This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ADMM for Nonsmooth Composite Optimization under Orthogonality Constraints

Ganzhao Yuan
Peng Cheng Laboratory, China
[email protected]
Abstract

We consider a class of structured, nonconvex, nonsmooth optimization problems under orthogonality constraints, where the objectives combine a smooth function, a nonsmooth concave function, and a nonsmooth weakly convex function. This class of problems finds diverse applications in statistical learning and data science. Existing methods for addressing these problems often fail to exploit the specific structure of orthogonality constraints, struggle with nonsmooth functions, or result in suboptimal oracle complexity. We propose OADMM, an Alternating Direction Method of Multipliers (ADMM) designed to solve this class of problems using efficient proximal linearized strategies. Two specific variants of OADMM are explored: one based on Euclidean Projection (OADMM-EP) and the other on Riemannian Retraction (OADMM-RR). Under mild assumptions, we prove that OADMM converges to a critical point of the problem with an ergodic convergence rate of 𝒪(1/ϵ3)\mathcal{O}(1/\epsilon^{3}). Additionally, we establish a super-exponential convergence rate or polynomial convergence rate for OADMM, depending on the specific setting, under the Kurdyka-Lojasiewicz (KL) inequality. To the best of our knowledge, this is the first non-ergodic convergence result for this class of nonconvex nonsmooth optimization problems. Numerical experiments demonstrate that the proposed algorithm achieves state-of-the-art performance.

Keywords: Orthogonality Constraints; Nonconvex Optimization; Nonsmooth Composite Optimization; ADMM; Convergence Analysis

1 Introduction

This paper focuses on the following nonsmooth composite optimization problem under orthogonality constraints (‘\triangleq’ means define):

min𝐗n×rF(𝐗)f(𝐗)g(𝐗)+h(𝒜(𝐗)),s.t.𝐗𝖳𝐗=𝐈r.\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times r}}\,F(\mathbf{X})\triangleq f(\mathbf{X})-g(\mathbf{X})+h(\mathcal{A}(\mathbf{X})),~{}s.t.~{}\mathbf{X}^{\mathsf{T}}\mathbf{X}=\mathbf{I}_{r}. (1)

Here, nrn\geq r, 𝒜(𝐗)m\mathcal{A}(\mathbf{X})\in\mathbb{R}^{m} is a linear mapping of 𝐗\mathbf{X}, and 𝐈r\mathbf{I}_{r} is a r×rr\times r identity matrix. For conciseness, the orthogonality constraints 𝐗𝖳𝐗=𝐈r\mathbf{X}^{\mathsf{T}}\mathbf{X}=\mathbf{I}_{r} in Problem (1) is rewritten as 𝐗n×r\mathbf{X}\in\mathcal{M}\in\mathbb{R}^{n\times r}, with \mathcal{M} representing the Stiefel manifold in the literature Edelman et al. (1998); Absil et al. (2008b).

We impose the following assumptions on Problem (1) throughout this paper. (𝔸\mathbb{A}-i) f(𝐗)f(\mathbf{X}) is LfL_{f}-smooth, satisfying f(𝐗)f(𝐗)𝖥Lf𝐗𝐗𝖥\|\nabla f(\mathbf{X})-\nabla f(\mathbf{X}^{\prime})\|_{\mathsf{F}}\leq L_{f}\|\mathbf{X}-\mathbf{X}^{\prime}\|_{\mathsf{F}} holds for all 𝐗,𝐗n×r\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{R}^{n\times r}. This implies: |f(𝐗)f(𝐗)f(𝐗),𝐗𝐗|Lf2𝐗𝐗𝖥2|f(\mathbf{X})-f(\mathbf{X}^{\prime})-\langle\nabla f(\mathbf{X}^{\prime}),\mathbf{X}-\mathbf{X}^{\prime}\rangle|\leq\tfrac{L_{f}}{2}\|\mathbf{X}-\mathbf{X}^{\prime}\|_{\mathsf{F}}^{2} (cf. Lemma 1.2.3 in Nesterov (2003)). We also assume that f(𝐗)f(\mathbf{X}) demonstrates CfC_{f}-Lipschitz continuity, with f(𝐗)𝖥Cf\|\nabla f(\mathbf{X})\|_{\mathsf{F}}\leq C_{f} for all 𝐗\mathbf{X}\in\mathcal{M}. The convexity of f(𝐗)f(\mathbf{X}) is not assumed. (𝔸\mathbb{A}-ii) The function g()g(\cdot) is convex, proper, and CgC_{g}-Lipschitz continuous, though it is not necessarily smooth. (𝔸\mathbb{A}-iii) The function h()h(\cdot) is proper, lower semicontinuous, ChC_{h}-Lipschitz continuous, and potentially nonsmooth. Also, it is weakly convexity with constant Wh0W_{h}\geq 0, which implies that the function h(𝐲)+Wh2𝐲22h(\mathbf{y})+\frac{W_{h}}{2}\|\mathbf{y}\|_{2}^{2} is convex for all 𝐲m\mathbf{y}\in\mathbb{R}^{m}. (𝔸\mathbb{A}-iv) The proximal operator, μ(𝐲)argmin𝐲12μ𝐲𝐲22+h(𝐲)\operatorname{\mathbb{P}}_{\mu}(\mathbf{y}^{\prime})\triangleq\arg\min_{\mathbf{y}}\tfrac{1}{2\mu}\|\mathbf{y}-\mathbf{y}^{\prime}\|_{2}^{2}+h(\mathbf{y}), can be computed efficiently and exactly for any given μ>0\mu>0 and 𝐲m\mathbf{y}^{\prime}\in\mathbb{R}^{m}.

Problem (1) represents an optimization framework that plays a crucial role in a variety of statistical learning and data science models. These models include sparse Principal Component Analysis (PCA) Journée et al. (2010); Lu & Zhang (2012), deep neural networks Cho & Lee (2017); Xie et al. (2017); Bansal et al. (2018); Cogswell et al. (2016); Huang & Gao (2023), orthogonal nonnegative matrix factorization Jiang et al. (2022), range-based independent component analysis Selvan et al. (2015), and dictionary learning Zhai et al. (2020).

1.1 Related Work

\blacktriangleright Optimization under Orthogonality Constraints. Solving Problem (1) is challenging due to the computationally expensive and non-convex orthogonality constraints. Existing methods can be divided into three classes. (i) Geodesic-like methods Edelman et al. (1998); Abrudan et al. (2008); Absil et al. (2008b); Jiang & Dai (2015). These methods involve calculating geodesics by solving ordinary differential equations, which can introduce significant computational complexity. To mitigate this, geodesic-like methods iteratively compute the geodesic logarithm using simple linear algebra calculations. Efficient constraint-preserving update schemes have been integrated with the Barzilai-Borwein (BB) stepsize strategy Wen & Yin (2013); Jiang & Dai (2015) for minimizing smooth functions under orthogonality constraints. (ii) Projection and retractions methods Absil et al. (2008b); Golub & Van Loan (2013). These methods maintain orthogonality constraints through projection or retraction. They reduce the objective value by using its current Euclidean gradient direction or Riemannian tangent direction, followed by an orthogonal projection operation. This projection can be computed using polar decomposition or singular value decomposition, or approximated with QR factorization. (iii) Multiplier correction methods Gao et al. (2018; 2019); Xiao et al. (2022). Leveraging the insight that the Lagrangian multiplier associated with the orthogonality constraint is symmetric and has an explicit closed-form expression at the first-order optimality condition, these methods tackle an alternative unconstrained nonlinear objective minimization problem, rather than the original smooth function under orthogonality constraints.

\blacktriangleright Optimization with Nonsmooth Objectives. Another challenge in addressing Problem (1) stems from the nonsmooth nature of the objective function. Existing methods for tackling this challenge fall into three main categories. (i) Subgradient methods Ferreira & Oliveira (1998); Hwang et al. (2015); Li et al. (2021). Subgradient methods, analogous to gradient descent methods, can incorporate various geodesic-like and projection-like techniques. However, they often exhibit slower convergence rates compared to other approaches. (ii) Proximal gradient methods Chen et al. (2020). These methods use a semi-smooth Newton approach to solve a strongly convex minimization problem over the tangent space, finding a descent direction while preserving the orthogonality constraint through a retraction operation. (iii) Operator splitting methods Lai & Osher (2014); Chen et al. (2016); Zhang et al. (2020b). These methods introduce linear constraints to break down the original problem into simpler subproblems that can be solved separately and exactly. Among these, ADMM is a promising solution for Problem (1) due to its capability to handle nonsmooth objectives and nonconvex constraints separately and alternately. Several ADMM-like algorithms have been proposed for solving nonconvex problems Boţ\c{t} & Nguyen (2020); Boţ\c{t} et al. (2019); Wang et al. (2019); Li & Pong (2015); He & Yuan (2012); Yuan (2024); Zhang et al. (2020b), but these methods fail to exploit the specific structure of orthogonality constraints or cannot be adapted to solve Problem (1). (iv) Other methods. OBCD Yuan (2023) has been proposed to solve a specific class of our problems, while the exact augmented Lagrangian method ManIAL was introduced in Deng et al. (2024).

\blacktriangleright Detailed Discussions on Operator Splitting Methods. We list some popular variants of operator splitting methods for tackling Problem (1). Initially, two natural splitting strategies are used in the literature:

min𝐗,𝐲F1(𝐗,𝐲)f(𝐗)g(𝐗)+h(𝐲)+(𝐗),s.t.𝒜(𝐗)=𝐲\displaystyle\textstyle\min_{\mathbf{X},\mathbf{y}}\,F_{1}(\mathbf{X},\mathbf{y})\triangleq f(\mathbf{X})-g(\mathbf{X})+h(\mathbf{y})+\mathcal{I}_{\mathcal{M}}(\mathbf{X}),~{}s.t.~{}\mathcal{A}(\mathbf{X})=\mathbf{y} (2)
min𝐗,𝐘F2(𝐗,𝐘)f(𝐗)g(𝐗)+h(𝒜(𝐗))+(𝐘),s.t.𝐗=𝐘.\displaystyle\textstyle\min_{\mathbf{X},\mathbf{Y}}\,F_{2}(\mathbf{X},\mathbf{Y})\triangleq f(\mathbf{X})-g(\mathbf{X})+h(\mathcal{A}(\mathbf{X}))+\mathcal{I}_{\mathcal{M}}(\mathbf{Y}),~{}s.t.~{}\mathbf{X}=\mathbf{Y}. (3)

(a) Smoothing Proximal Gradient Methods (SPGM, Beck & Rosset (2023); Böhm & Wright (2021)) incorporate a penalty (or smoothing) parameter μ0\mu\rightarrow 0 to penalize the squared error in the constraints, resulting in the subsequent minimization problem Beck & Rosset (2023); Böhm & Wright (2021); Chen (2012): min𝐗,𝐲F1(𝐗,𝐲)+12μ𝒜(𝐗)𝐲22\min_{\mathbf{X},\mathbf{y}}~{}F_{1}(\mathbf{X},\mathbf{y})+\tfrac{1}{2\mu}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}. During each iteration, SPGM employs proximal gradient strategies to alternatively minimize w.r.t. 𝐗\mathbf{X} and 𝐲\mathbf{y}. (b) Splitting Orthogonality Constraints Methods (SOCM, Lai & Osher (2014)) use the following iteration scheme: 𝐗t+1argmin𝐗F2(𝐗,𝐘t)+𝐙t,𝐗𝐘t+β2𝐗𝐘𝖥2\mathbf{X}^{t+1}\approx\arg\min_{\mathbf{X}}F_{2}(\mathbf{X},\mathbf{Y}^{t})+\langle\mathbf{Z}^{t},\mathbf{X}-\mathbf{Y}^{t}\rangle+\tfrac{\beta}{2}\|\mathbf{X}-\mathbf{Y}\|_{\mathsf{F}}^{2}, 𝐘t+1argmin𝐘F2(𝐗t+1,𝐘)+𝐙t,𝐗t+1𝐘+β2𝐗t+1𝐘𝖥2\mathbf{Y}^{t+1}\in\arg\min_{\mathbf{Y}}F_{2}(\mathbf{X}^{t+1},\mathbf{Y})+\langle\mathbf{Z}^{t},\mathbf{X}^{t+1}-\mathbf{Y}\rangle+\tfrac{\beta}{2}\|\mathbf{X}^{t+1}-\mathbf{Y}\|_{\mathsf{F}}^{2}, and 𝐙t+1=𝐙t+β(𝐗t+1𝐘t+1)\mathbf{Z}^{t+1}=\mathbf{Z}^{t}+\beta(\mathbf{X}^{t+1}-\mathbf{Y}^{t+1}), where β\beta is a fixed penalty constant, and 𝐙t\mathbf{Z}^{t} is the multiplier associated with the constraint 𝐗=𝐘\mathbf{X}=\mathbf{Y} at iteration tt. (c) Similarly, Manifold ADMM (MADMM, Kovnatsky et al. (2016)) iterates as follows: 𝐗t+1argmin𝐗F1(𝐗,𝐲t)+𝐳t,𝒜(𝐗)𝐲t+β2𝒜(𝐗)𝐲t𝖥2\mathbf{X}^{t+1}\approx\arg\min_{\mathbf{X}}F_{1}(\mathbf{X},\mathbf{y}^{t})+\langle\mathbf{z}^{t},\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\rangle+\tfrac{\beta}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\|_{\mathsf{F}}^{2}, 𝐲t+1argmin𝐲F1(𝐗t+1,𝐲)+𝐳t,𝒜(𝐗t+1)𝐲+β2𝒜(𝐗t+1)𝐲𝖥2\mathbf{y}^{t+1}\in\arg\min_{\mathbf{y}}F_{1}(\mathbf{X}^{t+1},\mathbf{y})+\langle\mathbf{z}^{t},\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}\rangle+\tfrac{\beta}{2}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}\|_{\mathsf{F}}^{2}, and 𝐳t+1=𝐳t+β(A(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}=\mathbf{z}^{t}+\beta(A(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}), where 𝐳t\mathbf{z}^{t} is the multiplier associated with the constraint A(𝐗)𝐲=𝟎A(\mathbf{X})-\mathbf{y}=\mathbf{0} at iteration tt. (d) Like MADMM, Riemannian ADMM (RADMM, Li et al. (2022)) operates using the first splitting strategy in Equation (2). In contrast, it employs a Riemannian retraction strategy to solve the 𝐗\mathbf{X}-subproblem and a Moreau envelope smoothing strategy to solve the 𝐲\mathbf{y}-subproblem.

Contributions. We compare existing methods for solving Problem (1) in Table 1, and our main contributions are summarized as follows. (i) We introduce OADMM, a specialized ADMM designed for structured nonsmooth composite optimization problems under orthogonality constraints in Problem (1). Two specific variants of OADMM are explored: one based on Euclidean Projection (OADMM-EP) and the other on Riemannian Retraction (OADMM-EP). Notably, while many existing works primarily address cases where g(𝐗)=0g(\mathbf{X})=0 and h()h(\cdot) is convex, our approach considers a more general setting where h()h(\cdot) is weakly convex and g(𝐗)g(\mathbf{X}) is convex. (ii) OADMM could demonstrate fast convergence by incorporating Nesterov’s extrapolation Nesterov (2003) into OADMM-EP and a Monotone Barzilai-Borwein (MBB) stepsize strategy Wen & Yin (2013) into OADMM-RR to potentially accelerate primal convergence. Both variants also employ an over-relaxation strategy to enhance dual convergence Gonçalves et al. (2017); Yang et al. (2017); Li et al. (2016). (iii) By introducing a novel Lyapunov function, we establish the convergence of OADMM to critical points of Problem (1) within an oracle complexity of 𝒪(1/ϵ3)\mathcal{O}(1/\epsilon^{3}), matching the best-known results to date Beck & Rosset (2023); Böhm & Wright (2021). This is achieved through a decreasing step size for updating primal and dual variables. In contrast, RADMM employs a small constant step size for such updates, resulting in a sub-optimal oracle complexity of 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) Li et al. (2022). (iv) We establish a super-exponential convergence rate or polynomial convergence rate for OADMM, depending on the specific setting, under the Kurdyka-Lojasiewicz (KL) inequality, providing the first non-ergodic convergence result for this class of non-convex nonsmooth optimization problems.

Table 1: Comparison of existing methods for solving Problem (1).
Reference h(𝒜(𝐗))h(\mathcal{A}(\mathbf{X})) g(𝐗)g(\mathbf{X}) Notable Features Complexity Conv. Rate
SOCM Lai & Osher (2014) convex h()h(\cdot) empty σ=1\sigma=1, α=0\alpha=0 unknown unknown
MADMM Kovnatsky et al. (2016) convex h()h(\cdot) empty σ=1\sigma=1, α=0\alpha=0 unknown unknown
RSG Li et al. (2021) weakly convex h()h(\cdot) empty 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) unknown
ManPG Chen et al. (2020) h(𝒜(𝐗))=𝐗1h(\mathcal{A}(\mathbf{X}))=\|\mathbf{X}\|_{1} empty hard subproblem 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) unknown
OBCD Yuan (2023) separable h()h(\cdot) empty hard subproblem 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) unknown
RADMM Li et al. (2022) convex h()h(\cdot) empty σ=1\sigma=1, α=0\alpha=0 𝒪(ϵ4)\mathcal{O}(\epsilon^{-4}) unknown
ManIAL Deng et al. (2024) convex h()h(\cdot) empty inexact subproblem 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) unknown
SPGM Beck & Rosset (2023) convex h()h(\cdot) empty 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) unknown
OADMM-EP[ours] weakly convex h()h(\cdot) convex σ[1,2)\sigma\in[1,2), α>0\alpha>0 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) 𝒪(1/exp(Tu˙)),u˙(0,23]\mathcal{O}(1/\exp(T^{\dot{u}})),\dot{u}\in(0,\tfrac{2}{3}]~{}^{\star}
OADMM-RR[ours] weakly convex h()h(\cdot) convex σ[1,2)\sigma\in[1,2), MBB 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) or 𝒪(1/Tu¨),u¨(0,+)\mathcal{O}(1/T^{\ddot{u}}),\ddot{u}\in(0,+\infty)~{}^{\ddagger}
Note \star: This is known as super-exponential convergence, please refer to Theorem 5.9(a) for more details.
Note \ddagger: This is known as polynomial convergence, please refer to Theorem 5.9(b) for more details.

2 Technical Preliminaries

This section provides some technical preliminaries on Moreau envelopes for weakly convex functions and manifold optimization.

Notations. We define [n]{1,2,,n}[n]\triangleq\{1,2,\ldots,n\}. We use 𝒜𝖳()\mathcal{A}^{\mathsf{T}}(\cdot) to denote the adjoin operator of 𝒜()\mathcal{A}(\cdot) with 𝒜(𝐗),𝐳=𝐗,𝒜𝖳(𝐳)\langle\mathcal{A}(\mathbf{X}),\mathbf{z}\rangle=\langle\mathbf{X},\mathcal{A}^{\mathsf{T}}(\mathbf{z})\rangle for all 𝐗n×r\mathbf{X}\in\mathbb{R}^{n\times r} and 𝐳m\mathbf{z}\in\mathbb{R}^{m}. We define A¯max𝐕𝒜(𝐕)𝖥/𝐕𝖥\overline{\rm{A}}\triangleq\max_{\mathbf{V}}\|\mathcal{A}(\mathbf{V})\|_{\mathsf{F}}/\|\mathbf{V}\|_{\mathsf{F}}. We use (𝐗)\mathcal{I}_{\mathcal{M}}(\mathbf{X}) to denote the indicator function of orthogonality constants. Further notations, technical preliminaries, and relevant lemmas are detailed in Appendix Section A.

2.1 Moreau Envelopes for Weakly Convex Functions

We provide the following useful definition.

Definition 2.1.

For a proper convex, and Lipschitz continuous function h(𝐲):mh(\mathbf{y}):\mathbb{R}^{m}\mapsto\mathbb{R}, the Moreau envelope of h(𝐲)h(\mathbf{y}) with the parameter μ>0\mu>0 is given by hμ(𝐲)min𝐲˘h(𝐲˘)+12μ𝐲˘𝐲22h_{\mu}(\mathbf{y})\triangleq\min_{\breve{\mathbf{y}}}h(\breve{\mathbf{y}})+\frac{1}{2\mu}\|\breve{\mathbf{y}}-\mathbf{y}\|_{2}^{2}.

We show some useful properties of Moreau envelope for weakly convex functions.

Lemma 2.2.

Let h:mh:\mathbb{R}^{m}\mapsto\mathbb{R} to be a proper, WhW_{h}-weakly convex, and lower semicontinuous function. Assume μ(0,Wh1)\mu\in(0,W_{h}^{-1}). We have the following results Böhm & Wright (2021). (a) The function hμ()h_{\mu}(\cdot) is ChC_{h}-Lipschitz continuous. (b) The function hμ()h_{\mu}(\cdot) is continuously differentiable with gradient hμ(𝐲)=1μ(𝐲μ(𝐲))\nabla h_{\mu}(\mathbf{y})=\frac{1}{\mu}(\mathbf{y}-\operatorname{\mathbb{P}}_{\mu}(\mathbf{y})) for all 𝐲\mathbf{y}, where μ(𝐲)argmin𝐲˘h(𝐲˘)+12μ𝐲˘𝐲22\operatorname{\mathbb{P}}_{\mu}(\mathbf{y})\triangleq\arg\min_{\breve{\mathbf{y}}}h(\breve{\mathbf{y}})+\frac{1}{2\mu}\|\breve{\mathbf{y}}-\mathbf{y}\|_{2}^{2}. This gradient is max(μ1,Wh1μWh)\max(\mu^{-1},\frac{W_{h}}{1-\mu W_{h}})-Lipschitz continuous. In particular, when μ(0,12Wh]\mu\in(0,\frac{1}{2W_{h}}], the condition μ1Wh1μWh\mu^{-1}\geq\frac{W_{h}}{1-\mu W_{h}} ensures that hμ(𝐲)h_{\mu}(\mathbf{y}) is (μ1)(\mu^{-1})-smooth and (μ1)(\mu^{-1})-weakly convex.

Lemma 2.3.

(Proof in Appendix B.1) Assume 0<μ2<μ1<1Wh0<\mu_{2}<\mu_{1}<\frac{1}{W_{h}}, and fixing 𝐲m\mathbf{y}\in\mathbb{R}^{m}. We have: 0hμ2(𝐲)hμ1(𝐲)min{μ12μ2,1}(μ1μ2)Ch20\leq h_{\mu_{2}}(\mathbf{y})-h_{\mu_{1}}(\mathbf{y})\leq\min\{\tfrac{\mu_{1}}{2\mu_{2}},1\}\cdot(\mu_{1}-\mu_{2})C_{h}^{2}.

Lemma 2.4.

(Proof in Appendix B.2) Assume 0<μ2<μ112Wh0<\mu_{2}<\mu_{1}\leq\frac{1}{2W_{h}}, and fixing 𝐲m\mathbf{y}\in\mathbb{R}^{m}. We have: hμ1(𝐲)hμ2(𝐲)(μ1μ21)Ch\|\nabla h_{\mu_{1}}(\mathbf{y})-\nabla h_{\mu_{2}}(\mathbf{y})\|\leq(\tfrac{\mu_{1}}{\mu_{2}}-1)C_{h}.

Lemma 2.5.

(Proof in Appendix B.3) Assume that h(𝐲)h(\mathbf{y}) is WhW_{h}-weakly convex, μ(0,12Wh]\mu\in(0,\tfrac{1}{2W_{h}}], β>μ1\beta>\mu^{-1}. Consider the following strongly convex optimization problem: 𝐲¯=argmin𝐲hμ(𝐲)+β2𝐲𝐛22\bar{\mathbf{y}}=\arg\min_{\mathbf{y}}h_{\mu}(\mathbf{y})+\frac{\beta}{2}\|\mathbf{y}-\mathbf{b}\|_{2}^{2}, which is equivalent to: (𝐲¯,𝐲˘)=argmin𝐲,𝐲h(𝐲)+12μ𝐲𝐲22+β2𝐲𝐛22(\bar{\mathbf{y}},\breve{\mathbf{y}})=\arg\min_{\mathbf{y},\mathbf{y}^{\prime}}h(\mathbf{y}^{\prime})+\tfrac{1}{2\mu}\|\mathbf{y}^{\prime}-\mathbf{y}\|_{2}^{2}+\tfrac{\beta}{2}\|\mathbf{y}-\mathbf{b}\|_{2}^{2}. We have: (a) 𝐲¯=(𝐲˘+μβ𝐛)1+μβ\bar{\mathbf{y}}=\frac{(\breve{\mathbf{y}}+\mu\beta\mathbf{b})}{1+\mu\beta}, where 𝐲˘=argmin𝐲h(𝐲)+β2(1+μβ)𝐲𝐛22=[μ+1/β](𝐛)\breve{\mathbf{y}}=\arg\min_{\mathbf{y}}~{}h(\mathbf{y})+\tfrac{\beta}{2(1+\mu\beta)}\|\mathbf{y}-\mathbf{b}\|_{2}^{2}=\operatorname{\mathbb{P}}_{[\mu+1/\beta]}(\mathbf{b}). (b) β(𝐛𝐲¯)h(𝐲˘)\beta(\mathbf{b}-\bar{\mathbf{y}})\in\partial h(\breve{\mathbf{y}}). (c) 𝐲¯𝐲˘μCh\|\bar{\mathbf{y}}-\breve{\mathbf{y}}\|\leq\mu C_{h}.

Remark 2.6.

(i) Lemmas 2.3 and 2.4 presented in this paper are novel. (ii) The upper bound in Lemma 2.3 is slightly better than the bound established in Lemma 4.1 of Böhm & Wright (2021). (iii) Lemma 2.5 is very critical in our algorithm development and theoretical analysis.

2.2 Manifold Optimization

We define the ϵ\epsilon-stationary point of Problem (1) as follows.

Definition 2.7.

(First-Order Optimality Conditions, Chen et al. (2020); Li et al. (2022); Beck & Rosset (2023)) The solution (𝐗¨,𝐲¨,𝐳¨)(\ddot{\mathbf{X}},\ddot{\mathbf{y}},\ddot{\mathbf{z}}) with 𝐗¨\ddot{\mathbf{X}}\in\mathcal{M} is called an ϵ\epsilon-stationary point of Problem (1) if: Crit(𝐗¨,𝐲¨,𝐳¨)ϵ\operatorname{Crit}(\ddot{\mathbf{X}},\ddot{\mathbf{y}},\ddot{\mathbf{z}})\leq\epsilon, where Crit(𝐗,𝐲,𝐳)𝒜(𝐗)𝐲+h(𝐲)𝐳+Proj𝐓𝐗(f(𝐗)g(𝐗)+𝒜𝖳(𝐳))𝖥\operatorname{Crit}(\mathbf{X},\mathbf{y},\mathbf{z})\triangleq\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|+\|\partial h(\mathbf{y})-\mathbf{z}\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X})-\partial g(\mathbf{X})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}))\|_{\mathsf{F}}. Here, according to Absil et al. (2008a), for all 𝐗\mathbf{X}\in\mathcal{M} and 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r}, we have: Proj𝐓𝐗(𝚫)=𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})=\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}).

The proposed algorithm is an iterative procedure. After shifting the current iterate 𝐗\mathbf{X}\in\mathcal{M} in the search direction, it may no longer reside on \mathcal{M}. Therefore, we must retract the point onto \mathcal{M} to form the next iterate. The following definition is useful in this context.

Definition 2.8.

A retraction on \mathcal{M} is a smooth map Absil et al. (2008a): Retr𝐗(𝚫)\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})\in\mathcal{M} with 𝐗\mathbf{X}\in\mathcal{M} and 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} satisfying Retr𝐗(𝟎)=𝐗\operatorname{Retr}_{\mathbf{X}}(\mathbf{0})=\mathbf{X}, and lim𝐓𝐗𝚫𝟎Retr𝐗(𝚫)𝐗𝚫𝖥𝚫𝖥=0\lim_{\mathbf{T}_{\mathbf{X}}\mathcal{M}\ni\bm{\Delta}\rightarrow\mathbf{0}}\tfrac{\|\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})-\mathbf{X}-\bm{\Delta}\|_{\mathsf{F}}}{\|\bm{\Delta}\|_{\mathsf{F}}}=0 for any 𝐗\mathbf{X}\in\mathcal{M}.

Remark 2.9.

Several retractions on the Stiefel manifold have been explored in literature Absil & Malick (2012); Absil et al. (2008b). We present two examples below. (i) Polar Decomposition-Based Retraction: Retr𝐗(𝚫)=(𝐗+𝚫)(𝐈r+𝚫𝖳𝚫)1/2\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})=(\mathbf{X}+\bm{\Delta})(\mathbf{I}_{r}+\bm{\Delta}^{\mathsf{T}}\bm{\Delta})^{-1/2}. (ii) QR-Decomposition-Based Retraction: Retr𝐗(𝚫)=qf(𝐗+𝚫)\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})=\operatorname{qf}(\mathbf{X}+\bm{\Delta}), where qf(𝐗)\operatorname{qf}(\mathbf{X}) is the 𝐐\mathbf{Q}-factor in the thin QR-decomposition of 𝐗\mathbf{X}.

The following lemma concerning the retraction operator is useful for our subsequent analysis.

Lemma 2.10.

(Boumal et al. (2019)) Let 𝐗\mathbf{X}\in\mathcal{M} and 𝚫𝐓𝐗\bm{\Delta}\in\mathbf{T}_{\mathbf{X}}\mathcal{M}. There exists positive constants {k˙,k¨}\{\dot{k},\ddot{k}\} such that Retr𝐗(𝚫)𝐗𝖥k˙𝚫𝖥\|\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})-\mathbf{X}\|_{\mathsf{F}}\leq\dot{k}\|\bm{\Delta}\|_{\mathsf{F}}, and Retr𝐗(𝚫)𝐗𝚫𝖥12k¨𝚫𝖥2\|\operatorname{Retr}_{\mathbf{X}}(\bm{\Delta})-\mathbf{X}-\bm{\Delta}\|_{\mathsf{F}}\leq\tfrac{1}{2}\ddot{k}\|\bm{\Delta}\|_{\mathsf{F}}^{2}.

Furthermore, we present the following three insightful lemmas.

Lemma 2.11.

(Proof in Appendix B.4) Let 𝐗\mathbf{X}\in\mathcal{M} and 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r}, we have Proj𝐓𝐗(𝚫)𝖥𝚫𝖥\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})\|_{\mathsf{F}}\leq\|\bm{\Delta}\|_{\mathsf{F}}.

Lemma 2.12.

(Proof in Appendix B.5) Let ρ>0\rho>0, 𝐆n×r\mathbf{G}\in\mathbb{R}^{n\times r}, and 𝐗\mathbf{X}\in\mathcal{M}. We define 𝔾ρ𝐆ρ𝐗𝐆𝖳𝐗(1ρ)𝐗𝐗𝖳𝐆\mathbb{G}_{\rho}\triangleq\mathbf{G}-\rho\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X}-(1-\rho)\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G}. It follows that: (a) max(1,2ρ)𝐆,𝔾ρ𝔾ρ𝖥2min(1,ρ2)𝔾1𝖥2\max(1,2\rho)\cdot\langle\mathbf{G},\mathbb{G}_{\rho}\rangle\geq\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\geq\min(1,\rho^{2})\|\mathbb{G}_{1}\|_{\mathsf{F}}^{2}. (b) min(1,2ρ)𝔾1/2𝖥𝔾ρ𝖥max(1,2ρ)𝔾1/2𝖥\min(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}\leq\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\leq\max(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}.

Lemma 2.13.

(Proof in Appendix B.6) Consider the following optimization problem: min𝐗f(𝐗)\min_{\mathbf{X}\in\mathcal{M}}f(\mathbf{X}), where f(𝐗)f(\mathbf{X}) is differentiable. For all 𝐗\mathbf{X}\in\mathcal{M}, we have: dist(𝟎,I(𝐗)+f(𝐗))f(𝐗)𝐗f(𝐗)𝖳𝐗𝖥\operatorname{dist}(\mathbf{0},\partial I_{\mathcal{M}}(\mathbf{X})+\nabla f(\mathbf{X}))\leq\|\nabla f(\mathbf{X})-\mathbf{X}\nabla f(\mathbf{X})^{\mathsf{T}}\mathbf{X}\|_{\mathsf{F}}.

Remark 2.14.

The matrix 𝔾ρn×r-\mathbb{G}_{\rho}\in\mathbb{R}^{n\times r} in Lemma 2.12 is closely related to the search descent direction of the proposed OADMM-RR algorithm. While one can set ρ\rho to typical values such as 11 or 1/21/2, we consider the setting ρ(0,)\rho\in(0,\infty) to enhance the versatility of OADMM-RR, aligning with Liu et al. (2016); Jiang & Dai (2015).

3 The Proposed OADMM Algorithm

This section provides the proposed OADMM algorithm for solving Problem (1), featuring two variants, one is based on Euclidean Projection (OADMM-EP) and the other on Riemannian Retraction (OADMM-RR).

Using the Moreau envelope smoothing technique, we consider the following optimization problem:

min𝐗,𝐲f(𝐗)g(𝐗)+hμ(𝐲)+(𝐗),s.t.𝒜(𝐗)=𝐲,\displaystyle\min_{\mathbf{X},\mathbf{y}}\,f(\mathbf{X})-g(\mathbf{X})+h_{\mu}(\mathbf{y})+\mathcal{I}_{\mathcal{M}}(\mathbf{X}),~{}s.t.~{}\mathcal{A}(\mathbf{X})=\mathbf{y}, (4)

where μ0\mu\rightarrow 0, and hμ(𝐲)h_{\mu}(\mathbf{y}) is the Moreau Envelope of h(𝐲)h(\mathbf{y}). Importantly, hμ(𝐲)h_{\mu}(\mathbf{y}) is (μ1)(\mu^{-1})-smooth when μ12Wh\mu\leq\frac{1}{2W_{h}}, according to Lemma 2.2. It is worth noting that similar smoothing techniques have been used in the design of augmented Lagrangian methods Zeng et al. (2022), and minimax optimization Zhang et al. (2020a), and ADMMs Li et al. (2022). We define the augmented Lagrangian function of Problem (4) as follows:

(𝐗,𝐲;𝐳;β,μ)=f(𝐗)+𝐳,𝒜(𝐗)𝐲+β2𝒜(𝐗)𝐲22SS(𝐗,𝐲;𝐳;β)g(𝐗)+hμ(𝐲)+(𝐗).\displaystyle\mathcal{L}(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu)=\underbrace{f(\mathbf{X})+\langle\mathbf{z},\mathcal{A}(\mathbf{X})-\mathbf{y}\rangle+\tfrac{\beta}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}}_{~{}\triangleq~{}\SS(\mathbf{X},\mathbf{y};\mathbf{z};\beta)}-g(\mathbf{X})+h_{\mu}(\mathbf{y})+\mathcal{I}_{\mathcal{M}}(\mathbf{X}). (5)

Here, 𝐳\mathbf{z} is the dual variable for the equality constraint, μ\mu is the smoothing parameter linked to the function h(𝐲)h(\mathbf{y}), β\beta is the penalty parameter associated with the equality constraint, and (𝐗)\mathcal{I}_{\mathcal{M}}(\mathbf{X}) is the indicator function of the set \mathcal{M}.

In simple terms, OADMM updates are performed by minimizing the augmented Lagrangian function (𝐗,𝐲,𝐳;β,μ)\mathcal{L}(\mathbf{X},\mathbf{y},\mathbf{z};\beta,\mu) over the primal variables {𝐗t,𝐲t}\{\mathbf{X}^{t},\mathbf{y}^{t}\} at each iteration, while keeping all other primal and dual variables fixed. The dual variables are updated using gradient ascent on the dual problem.

For updating the primal variable 𝐗\mathbf{X}, we use different strategies, resulting in distinct variants of OADMM. We first observe that the function 𝒮(𝐗,𝐲t;𝐳t;βt)\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}) is (βt)\ell(\beta^{t})-smooth w.r.t. 𝐗\mathbf{X}, where (βt)βtA¯2+Lf\ell(\beta^{t})\triangleq\beta^{t}\overline{\rm{A}}^{2}+L_{f}. In OADMM-EP, we adopt a proximal linearized method based on Euclidean projection Lai & Osher (2014), while in OADMM-RR, we apply line-search methods on the Stiefel manifold Liu et al. (2016).

Initialization:
 Choose {𝐗0,𝐲0,𝐳0}\{\mathbf{X}^{0},\mathbf{y}^{0},\mathbf{z}^{0}\}. Choose p(0,1)p\in(0,1), ξ(0,)\xi\in(0,\infty), θ(1,)\theta\in(1,\infty), σ[1,2)\sigma\in[1,2).
 Choose χ(1+4ωσ¨,)\chi\in(1+4\omega\ddot{\sigma},\infty), where ω1σ+ξ2σ2+εzσ2\omega\triangleq\tfrac{1}{\sigma}+\tfrac{\xi}{2\sigma^{2}}+\tfrac{\varepsilon_{z}}{\sigma^{2}}, σ¨(σ/(2σ))2\ddot{\sigma}\triangleq(\sigma/(2-\sigma))^{2}, εz=ξ\varepsilon_{z}=\xi.
 Choose β0\beta^{0} sufficiently large such that β02χWh\beta^{0}\geq 2\chi W_{h}.
 For OADMM-EP, choose α[0,θ1(θ+1)(ξ+2))\alpha\in[0,\frac{\theta-1}{(\theta+1)(\xi+2)}).
 For OADMM-RR, choose α=0\alpha=0, ρ(0,)\rho\in(0,\infty), γ(0,1)\gamma\in(0,1), δ(0,1max(1,2ρ))\delta\in(0,\tfrac{1}{\max(1,2\rho)}).
for tt from 0 to TT do
       S1) Set βt=β0(1+ξtp)\beta^{t}=\beta^{0}(1+\xi t^{p}), μt=χ/βt\mu^{t}=\chi/\beta^{t}.
      S2) Update the primal variable 𝐗\mathbf{X}:
      if OADMM-EP then
            
            Set 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}_{{\sf c}}^{t}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}), 𝐆t𝐗SS(𝐗𝖼t,𝐲t;𝐳t;βt)g(𝐗t)\mathbf{G}^{t}\in\nabla_{\mathbf{X}}\SS(\mathbf{X}_{{\sf c}}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\partial g(\mathbf{X}^{t}).
            𝐗t+1argmin𝐗𝐗𝐗t,𝐆t+θ(βt)2𝐗𝐗𝖼t𝖥2\mathbf{X}^{t+1}\in\arg\min_{\mathbf{X}\in\mathcal{M}}\langle\mathbf{X}-\mathbf{X}^{t},\mathbf{G}^{t}\rangle+\tfrac{\theta\ell(\beta^{t})}{2}\|\mathbf{X}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}^{2}, where (βt)βtA¯2+Lf\ell(\beta^{t})\triangleq\beta^{t}\overline{\rm{A}}^{2}+L_{f}.
       end if
      
      if OADMM-RR then
            
            Set 𝐆t𝐗SS(𝐗t,𝐲t;𝐳t;βt)g(𝐗t)\mathbf{G}^{t}\in\nabla_{\mathbf{X}}\SS(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\partial g(\mathbf{X}^{t}), ˙(𝐗)L(𝐗,𝐲t;𝐳t;βt,μt)\dot{\mathcal{L}}(\mathbf{X})\triangleq L(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t}). Set 𝔾ρt𝐆tρ𝐗t[𝐆t]𝖳𝐗t(1ρ)𝐗t[𝐗t]𝖳𝐆t\mathbb{G}^{t}_{\rho}\triangleq\mathbf{G}^{t}-\rho\mathbf{X}^{t}[\mathbf{G}^{t}]^{\mathsf{T}}\mathbf{X}^{t}-(1-\rho)\mathbf{X}^{t}[\mathbf{X}^{t}]^{\mathsf{T}}\mathbf{G}^{t}. Set bt(b¯,b¯)b^{t}\in(\underline{b},\overline{b}) as the BB step size, where b¯,b¯(0,)\underline{b},\overline{b}\in(0,\infty). Set 𝐗t+1=Retr𝐗t(ηt𝔾ρt)\mathbf{X}^{t+1}=\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho}), where ηtbtγjβt\eta^{t}\triangleq\tfrac{b^{t}\gamma^{j}}{\beta^{t}}, and j{0,1,2,}j\in\{0,1,2,\ldots\} is the smallest integer that: ˙(Retr𝐗t(ηt𝔾ρt))˙(𝐗t)δηt𝔾ρt𝖥2\dot{\mathcal{L}}(\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho}))-\dot{\mathcal{L}}(\mathbf{X}^{t})\leq-\delta\eta^{t}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}.
       end if
      
      S3) Update the primal variable 𝐲\mathbf{y}: 𝐲t+1=argmin𝐲hμt(𝐲)+βt2𝐲𝐛22\mathbf{y}^{t+1}=\arg\min_{\mathbf{y}}h_{\mu^{t}}(\mathbf{y})+\frac{\beta^{t}}{2}\|\mathbf{y}-\mathbf{b}\|_{2}^{2}, where 𝐛𝐲t1βt𝐲SS(𝐗t+1,𝐲t,𝐳t;βt)\mathbf{b}\triangleq\mathbf{y}^{t}-\tfrac{1}{\beta^{t}}\nabla_{\mathbf{y}}\SS(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t}). It can be solved as: 𝐲t+1=𝐲˘t+1+μtβt𝐛1+μtβt\mathbf{y}^{t+1}=\tfrac{\breve{\mathbf{y}}^{t+1}+\mu^{t}\beta^{t}\mathbf{b}}{1+\mu^{t}\beta^{t}}, where 𝐲˘t+1=[μt+1/βt](𝐛)\breve{\mathbf{y}}^{t+1}=\operatorname{\mathbb{P}}_{[\mu^{t}+1/\beta^{t}]}(\mathbf{b}).
      S4) Update the dual variable 𝐳\mathbf{z}: 𝐳t+1=𝐳t+σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}=\mathbf{z}^{t}+\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1})
end for
Algorithm 1 OADMM: The Proposed ADMM for Solving Problem (1).

We detail iteration steps of OADMM in Algorithm 1, and have the following remarks.

  1. (a)

    To achieve possible faster dual convergence, we apply an over-relaxation step size with σ(1,2)\sigma\in(1,2) for updating the dual variable 𝐳\mathbf{z}, as suggested by previous studies Gonçalves et al. (2017); Yang et al. (2017); Li et al. (2016; 2023).

  2. (b)

    To accelerate primal convergence in OADMM-EP, we incorporate a Nesterov extrapolation strategy with parameter α(0,1)\alpha\in(0,1).

  3. (c)

    To enhance primal convergence in OADMM-RR, we use a Monotone Barzilai-Borwein (MBB) strategy Wen & Yin (2013) with a dynamically adjusted parameter btb^{t} to capture the problem’s curvature 111Following Wen & Yin (2013), one can set bt=𝐒t,𝐒t/𝐒t,𝐙tb^{t}={\langle\mathbf{S}^{t},\mathbf{S}^{t}\rangle}/{\langle\mathbf{S}^{t},\mathbf{Z}^{t}\rangle} or bt=𝐒t,𝐙t/𝐙t,𝐙tb^{t}={\langle\mathbf{S}^{t},\mathbf{Z}^{t}\rangle}/{\langle\mathbf{Z}^{t},\mathbf{Z}^{t}\rangle}, where 𝐒t=𝐗t𝐗t1\mathbf{S}^{t}=\mathbf{X}^{t}-\mathbf{X}^{t-1} and 𝐙t=𝔾1t1𝔾1t\mathbf{Z}^{t}=\mathbb{G}_{1}^{t-1}-\mathbb{G}_{1}^{t}, with 𝔾1t\mathbb{G}_{1}^{t} being the Riemannian gradient.. The parameters {γ,δ}\{\gamma,\delta\} represent the decay rate and sufficient decrease parameter, commonly used in line search procedures Chen et al. (2020).

  4. (d)

    The 𝐗\mathbf{X}-subproblem is solved as: 𝐗t+1=argmin𝐗𝐗𝐗𝖥2=𝐔˙𝐕˙𝖳\mathbf{X}^{t+1}=\arg\min_{\mathbf{X}\in\mathcal{M}}\|\mathbf{X}-\mathbf{X}^{\prime}\|_{\mathsf{F}}^{2}=\dot{\mathbf{U}}\dot{\mathbf{V}}^{\mathsf{T}}, where 𝐗=𝐗𝖼t𝐆t/(θ(βt))\mathbf{X}^{\prime}=\mathbf{X}_{{\sf c}}^{t}-\mathbf{G}^{t}/(\theta\ell(\beta^{t})), and 𝐔˙diag(𝐱˙)𝐕˙𝖳=𝐗\dot{\mathbf{U}}{\rm{diag}}(\dot{\mathbf{x}})\dot{\mathbf{V}}^{\mathsf{T}}=\mathbf{X}^{\prime} is the using singular value decomposition of 𝐗\mathbf{X}^{\prime}.

  5. (e)

    The 𝐲\mathbf{y}-subproblem can be solved using the result from Lemma 2.5.

  6. (f)

    For practical implementation, we recommend the following default parameters: p=1/3p=1/3, θ=1.01\theta=1.01, σ=1.1\sigma=1.1, ρ=1\rho=1, γ=1/2\gamma=1/2, δ=103\delta=10^{-3}, ξ=1\xi=1, α=θ1(θ+1)(ξ+2)1012\alpha=\frac{\theta-1}{(\theta+1)(\xi+2)}-10^{-12}.

4 Oracle Complexity

This section details the oracle complexity of Algorithm 1.

We define εz=ξ\varepsilon_{z}=\xi, εy12(11+4ωσ¨χ)\varepsilon_{y}\triangleq\tfrac{1}{2}(1-\tfrac{1+4\omega\ddot{\sigma}}{\chi}), σ˙(σ1)/(2σ)\dot{\sigma}\triangleq(\sigma-1)/(2-\sigma), σ¨(σ/(2σ))2\ddot{\sigma}\triangleq(\sigma/(2-\sigma))^{2}, ω1σ+ξ2σ2+εzσ2\omega\triangleq\tfrac{1}{\sigma}+\tfrac{\xi}{2\sigma^{2}}+\tfrac{\varepsilon_{z}}{\sigma^{2}}.

We define the potential function (or Lyapunov function) for all t1t\geq 1, as follows:

Θt\displaystyle\Theta^{t} \displaystyle\triangleq Θ(𝐗t,𝐗t1,𝐲t,𝐳t;βt,βt1,μt1,t)\displaystyle\Theta(\mathbf{X}^{t},\mathbf{X}^{t-1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\beta^{t-1},\mu^{t-1},t) (6)
\displaystyle\triangleq L(𝐗t,𝐲t,𝐳t;βt,μt1)+μt1Ch2+𝕋t+t+𝕏t,\displaystyle L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})+\mu^{t-1}C_{h}^{2}+\mathbb{T}^{t}+\mathbb{Z}^{t}+\mathbb{X}^{t},

where 𝕋t4ωσ¨β0Ch21t\mathbb{T}^{t}\triangleq\tfrac{4\omega\ddot{\sigma}}{\beta^{0}}C^{2}_{h}\tfrac{1}{t}, tωσ˙σ2βt1𝒜(𝐗t)𝐲t22\mathbb{Z}^{t}\triangleq\omega\dot{\sigma}\sigma^{2}\beta^{t-1}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|_{2}^{2}, and 𝕏tα(θ+1)(βt)2𝐗t𝐗t1𝖥2\mathbb{X}^{t}\triangleq\tfrac{\alpha(\theta+1)\ell(\beta^{t})}{2}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}.

Additionally, we define:

et{𝐲t𝐲t1+𝒜(𝐗t)𝐲t+𝐗t𝐗t1𝖥,OADMM-EP;𝐲t𝐲t1+𝒜(𝐗t)𝐲t+1βt𝔾1/2t1𝖥,OADMM-RR.\displaystyle{e^{t}\triangleq\left\{\begin{array}[]{ll}\|\mathbf{y}^{t}-\mathbf{y}^{t-1}\|+\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}},&\hbox{{\sf OADMM-EP};}\\ \|\mathbf{y}^{t}-\mathbf{y}^{t-1}\|+\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t-1}_{1/2}\|_{\mathsf{F}},&\hbox{{\sf OADMM-RR}.}\end{array}\right.} (9)

We have the following useful lemma, derived using the first-order optimality condition of 𝐲t+1\mathbf{y}^{t+1}.

Lemma 4.1.

(Proof in Section C.1, Bounding Dual using Primal) We have: (a) t0,𝐳t1σ(𝐳t𝐳t+1)=hμt(𝐲t+1)h(𝐲˘t+1)\forall t\geq 0,\,\mathbf{z}^{t}-\tfrac{1}{\sigma}(\mathbf{z}^{t}-\mathbf{z}^{t+1})=\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})\in\partial h(\breve{\mathbf{y}}^{t+1}). (b) t1,𝐳t+1𝐳t22σ˙(𝐳t𝐳t122𝐳t+1𝐳t22)+2σ¨(βt/χ)2𝐲t+1𝐲t22+2σ¨Ch2(2t2t+1)\forall t\geq 1,\,\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2}\leq\dot{\sigma}(\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+2\ddot{\sigma}(\beta^{t}/\chi)^{2}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+2\ddot{\sigma}C^{2}_{h}(\tfrac{2}{t}-\tfrac{2}{t+1}).

Remark 4.2.

Here, for OADMM-RR, we set α=0\alpha=0, resulting in 𝕏t=0\mathbb{X}^{t}=0 for all tt. (i) With the choice σ=1\sigma=1, we have: hμt1(𝐲t)=𝐳t\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})=\mathbf{z}^{t}, and 𝐳t+1𝐳thμt(𝐲t+1)hμt1(𝐲t)\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|\leq\|\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\|.

Lemma 4.3.

(Proof in Appendix C.2) (a) It holds that βt+1βt(1+ξ)\beta^{t+1}\leq\beta^{t}(1+\xi). (b) There exists constant {¯,¯}\{\overline{\rm{\ell}},\underline{\rm{\ell}}\} such that βt¯(βt)βt¯\beta^{t}\underline{\rm{\ell}}\leq\ell(\beta^{t})\leq\beta^{t}\overline{\rm{\ell}}.

The subsequent lemma demonstrates that the sequence {Θt}t=1\{\Theta^{t}\}_{t=1}^{\infty} is always lower bounded.

Lemma 4.4.

(Proof in Section C.3) For all t1t\geq 1, there exists constants {X¯,z¯,y¯,Θ¯}\{\overline{\rm{X}},\overline{\rm{z}},\overline{\rm{y}},\underline{\rm{\Theta}}\} such that 𝐗t𝖥X¯\|\mathbf{X}^{t}\|_{\mathsf{F}}\leq\overline{\rm{X}}, 𝐳tz¯\|\mathbf{z}^{t}\|\leq\overline{\rm{z}}, 𝐲ty¯\|\mathbf{y}^{t}\|\leq\overline{\rm{y}}, and ΘtΘ¯\Theta^{t}\geq\underline{\rm{\Theta}}.

The following lemma is useful for our subsequent analysis, applicable to both OADMM-EP and OADMM-RR.

Lemma 4.5.

(Proof in Appendix C.4, Sufficient Decrease for Variables {𝐲,𝐳,β,μ})\{\mathbf{y},\mathbf{z},\beta,\mu\}) We have L(𝐗t+1,𝐲t+1,𝐳t+1;βt+1,μt)L(𝐗t+1,𝐲t,𝐳t;βt,μt1)+(μtμt1)Ch2+𝕋t+1𝕋t+t+1t+εzβt𝒜(𝐗t+1)𝐲t+122εyβt𝐲t+1𝐲t22L(\mathbf{X}^{t+1},\mathbf{y}^{t+1},\mathbf{z}^{t+1};\beta^{t+1},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})+(\mu^{t}-\mu^{t-1})C_{h}^{2}+\mathbb{T}^{t+1}-\mathbb{T}^{t}+\mathbb{Z}^{t+1}-\mathbb{Z}^{t}+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}\leq-\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}.

In the remaining content of this section, we provide separate analyses for OADMM-EP and OADMM-RR.

4.1 Analysis for OADMM-EP

Using the optimality condition of 𝐗t+1\mathbf{X}^{t+1}, we derive the following lemma.

Lemma 4.6.

(Proof in Appendix C.5, Sufficient Decrease for Variable 𝐗\mathbf{X}) We define εx12εx¯\varepsilon_{x}\triangleq\tfrac{1}{2}\varepsilon_{x}^{\prime}\underline{\rm{\ell}}, where εxθ1α(2+ξ)(1+θ)>0\varepsilon_{x}^{\prime}\triangleq\theta-1-\alpha(2+\xi)(1+\theta)>0. We have L(𝐗t+1,𝐲t,𝐳t;βt,μt1)L(𝐗t,𝐲t,𝐳t;βt,μt1)εxβt𝐗t+1𝐗t𝖥2+𝕏t𝕏t+1L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})-L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})\leq-\varepsilon_{x}\beta^{t}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\mathbb{X}^{t}-\mathbb{X}^{t+1}.

Combining the results from Lemmas 4.5, and 4.6, we arrive at the following lemma.

Lemma 4.7.

(Proof in Appendix C.6) We have: (a) βt{εz𝒜(𝐗t+1)𝐲t+122+εy𝐲t+1𝐲t22+εx𝐗t+1𝐗t𝖥2}ΘtΘt+1\beta^{t}\{\varepsilon_{z}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\varepsilon_{y}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\varepsilon_{x}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}\}\leq\Theta^{t}-\Theta^{t+1}. (b) 1Tt=1Tβtet+1𝒪(T(p1)/2)\tfrac{1}{T}\sum_{t=1}^{T}\beta^{t}e^{t+1}\leq\mathcal{O}(T^{(p-1)/2}).

Finally, we have the following theorem regarding the oracle complexity of OADMM-EP.

Theorem 4.8.

(Proof in Appendix C.7) Let p=1/3p=1/3. We have: 1Tt=1TCrit(𝐗t+1,𝐲˘t+1,𝐳t+1)𝒪(T1/3)\frac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t+1},\breve{\mathbf{y}}^{t+1},\mathbf{z}^{t+1})\leq\mathcal{O}(T^{-1/3}). In other words, there exists t¯T\bar{t}\leq T such that: 1Tt=1TCrit(𝐗t+1,𝐲˘t+1,𝐳t+1)ϵ\frac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t+1},\breve{\mathbf{y}}^{t+1},\mathbf{z}^{t+1})\leq\epsilon, provided that T𝒪(1/ϵ3)T\geq\mathcal{O}(1/\epsilon^{3}).

Remark 4.9.

The oracle complexity of OADMM-EP matches the best-known complexities currently available to date Beck & Rosset (2023); Böhm & Wright (2021).

4.2 Analysis for OADMM-RR

Using the properties of the line search procedure for updating the variable 𝐗t+1\mathbf{X}^{t+1}, we deduce the following lemma.

Lemma 4.10.

(Proof in Appendix C.8, Sufficient Decrease for Variable 𝐗\mathbf{X}) We define εxδγ¯γb¯min(1,2ρ)2>0\varepsilon_{x}\triangleq\delta\overline{\gamma}\gamma\underline{b}\min(1,2\rho)^{2}>0, where γ¯2(1/max(1,2ρ)δ)/(¯k˙b¯+g¯k¨b¯/β0)>0\overline{\gamma}\triangleq 2(1/{\max(1,2\rho)}-\delta)/(\overline{\rm{\ell}}\dot{k}\overline{b}+\overline{g}\ddot{k}\overline{b}/\beta^{0})>0. We have: (a) For any t0t\geq 0, if jj is large enough such that γj(0,γ¯)\gamma^{j}\in(0,\overline{\gamma}), then the condition of the line search procedure is satisfied. (b) It follows that: L(𝐗t+1,𝐲t,𝐳t;βt,μt)L(𝐗t,𝐲t,𝐳t;βt,μt)εxβt𝔾1/2t𝖥2L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t})-L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t})\leq-\tfrac{\varepsilon_{x}}{\beta^{t}}\|\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}. Here, g¯\overline{g} is a constant that 𝐆t𝖥g¯\|\mathbf{G}^{t}\|_{\mathsf{F}}\leq\overline{g}, {k˙,k¨}\{\dot{k},\ddot{k}\} are defined in Lemma 2.10, and {ρ,γ,δ,b¯,b¯}\{\rho,\gamma,\delta,\overline{b},\underline{b}\} are defined in Algorithm 1.

Remark 4.11.

By Lemma 4.10(a), since γ¯\overline{\gamma} is a universal constant and γj\gamma^{j} decreases exponentially, the line search procedure of OADMM-RR will terminate in log(γ¯)/log(γ)+1=𝒪(1)\log(\overline{\gamma})/\log(\gamma)+1=\mathcal{O}(1) time.

Combining the results from Lemmas 4.5, and 4.10, we obtain the following lemma.

Lemma 4.12.

(Proof in Appendix C.9) We have: (a) βt{εz𝒜(𝐗t+1)𝐲t+122+εy𝐲t+1𝐲t22+εx1βt𝔾1/2t𝖥2}ΘtΘt+1\beta^{t}\{\varepsilon_{z}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\varepsilon_{y}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\varepsilon_{x}\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}\}\leq\Theta^{t}-\Theta^{t+1}. (b) 1Tt=1Tβtet+1𝒪(T(p1)/2)\tfrac{1}{T}\sum_{t=1}^{T}\beta^{t}e^{t+1}\leq\mathcal{O}(T^{(p-1)/2}).

Finally, we derive the following theorem on the oracle complexity of OADMM-RR.

Theorem 4.13.

(Proof in Appendix C.10) Let p=1/3p=1/3. We have: 1Tt=1TCrit(𝐗t+1,𝐲˘t+1,𝐳t+1)𝒪(T1/3)\frac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t+1},\breve{\mathbf{y}}^{t+1},\mathbf{z}^{t+1})\leq\mathcal{O}(T^{-1/3}). In other words, there exists t¯T\bar{t}\leq T such that: 1Tt=1TCrit(𝐗t+1,𝐲˘t+1,𝐳t+1)ϵ\frac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t+1},\breve{\mathbf{y}}^{t+1},\mathbf{z}^{t+1})\leq\epsilon, provided that T𝒪(1/ϵ3)T\geq\mathcal{O}(1/\epsilon^{3}).

Remark 4.14.

Theorem 4.13 mirrors Theorem 4.8, and OADMM-RR shares the same oracle complexity as OADMM-EP.

5 Convergence Rate

This section provides convergence rate of OADMM-EP and OADMM-RR. Our analyses are based on a non-convex analysis tool called KL inequality Attouch et al. (2010); Bolte et al. (2014); Li & Lin (2015); Li et al. (2023).

We define the Lyapunov function as: Θ(𝐗,𝐗,𝐲,𝐳;β,β,μ,t)L(𝐗,𝐲,𝐳;β,μ)+ωσ¨σ2β𝒜(𝐗)𝐲22+α(θ+1)(β)2𝐗𝐗𝖥2+4ωσ¨β0Ch21t+Ch2μ\Theta(\mathbf{X},\mathbf{X}^{-},\mathbf{y},\mathbf{z};\beta,\beta^{-},\mu^{-},t)\triangleq L(\mathbf{X},\mathbf{y},\mathbf{z};\beta,\mu^{-})+\omega\ddot{\sigma}\sigma^{2}\beta^{-}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}+\tfrac{\alpha(\theta+1)\ell(\beta)}{2}\|\mathbf{X}-\mathbf{X}^{-}\|_{\mathsf{F}}^{2}+\tfrac{4\omega\ddot{\sigma}}{\beta^{0}}C_{h}^{2}\tfrac{1}{t}+C_{h}^{2}\mu^{-}, where we let α=0\alpha=0 for OADMM-RR. We define 𝕨{𝐗,𝐗,𝐲,𝐳}{\mathbbm{w}}\triangleq\{\mathbf{X},\mathbf{X}^{-},\mathbf{y},\mathbf{z}\}, 𝕨t{𝐗t,𝐗t1,𝐲t,𝐳t}{\mathbbm{w}}^{t}\triangleq\{\mathbf{X}^{t},\mathbf{X}^{t-1},\mathbf{y}^{t},\mathbf{z}^{t}\}, 𝕦{β,β,μ,t}{\mathbbm{u}}\triangleq\{\beta,\beta^{-},\mu^{-},t\}, and 𝕦t{βt,βt1,μt1,t}{\mathbbm{u}}^{t}\triangleq\{\beta^{t},\beta^{t-1},\mu^{t-1},t\}. Thus, we have Θt=Θ(𝕨t;𝕦t)\Theta^{t}=\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}). We denote 𝕨{\mathbbm{w}}^{\infty} as a limiting point of Algorithm 1.

We make the following additional assumptions.

Assumption 5.1.

(Kurdyka-Łojasiewicz Inequality Attouch et al. (2010)). Consider a semi-algebraic function Θ(𝕨t;𝕦t)\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}) w.r.t. 𝕨t{\mathbbm{w}}^{t} for all tt, where 𝕨t{\mathbbm{w}}^{t} is in the effective domain of Θ(𝕨t;𝕦t)\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}). There exist η~(0,+)\tilde{\eta}\in(0,+\infty), σ~[0,1)\tilde{\sigma}\in[0,1), a neighborhood Υ\Upsilon of 𝕨{\mathbbm{w}}^{\infty}, and a continuous and concave desingularization function φ(s)c~s1σ~\varphi(s)\triangleq\tilde{c}s^{1-\tilde{\sigma}} with c~>0\tilde{c}>0 and s[0,η~)s\in[0,\tilde{\eta}) such that, for all 𝕨tΥ{\mathbbm{w}}^{t}\in\Upsilon satisfying Θ(𝕨t,𝕦t)Θ(𝕨,𝕦)(0,η~)\Theta({\mathbbm{w}}^{t},{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty},{\mathbbm{u}}^{\infty})\in(0,\tilde{\eta}), it holds that: φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦))dist(𝟎,Θ(𝕨t;𝕦t))1\varphi^{\prime}(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}))\cdot\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\geq 1. Here, dist(𝟎,Θ(𝕨t;𝕦t)){dist2(𝟎,𝐗Θ(𝕨t;𝕦t))+dist2(𝟎,𝐗Θ(𝕨t;𝕦t))+dist2(𝟎,𝐲Θ(𝕨t;𝕦t))+dist2(𝟎,𝐳Θ(𝕨t;𝕦t))}1/2\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\triangleq\{\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{X}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{X}^{-}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{y}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{z}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\}^{1/2}.

Assumption 5.2.

The function g(𝐗)g(\mathbf{X}) is LgL_{g}-smooth such that g(𝐗)g(𝐗)𝖥Lg𝐗𝐗𝖥\|\nabla g(\mathbf{X})-\nabla g(\mathbf{X}^{\prime})\|_{\mathsf{F}}\leq L_{g}\|\mathbf{X}-\mathbf{X}^{\prime}\|_{\mathsf{F}} holds for all 𝐗\mathbf{X}\in\mathcal{M} and 𝐗\mathbf{X}^{\prime}\in\mathcal{M}.

Remark 5.3.

Semi-algebraic functions, including real polynomial functions, finite combinations, and indicator functions of semi-algebraic sets, commonly exhibit the KL property and find extensive use in applications Attouch et al. (2010).

We present the following lemma regarding subgradient bounds for each iteration.

Lemma 5.4.

(Proof in Section D.1, Subgradient Bounds) (a) For OADMM-EP, there exists a constant K>0K>0 such that: dist(𝟎,Θ(𝕨t;𝕦t))βtK(et+et1)\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq\beta^{t}K(e^{t}+e^{t-1}). (b) For OADMM-RR, there exists a constant K>0K>0 such that: dist(𝟎,Θ(𝕨t;𝕦t))βtKet\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq\beta^{t}Ke^{t}.

Remark 5.5.

Lemma 5.4 significantly differs from prior work that used a constant penalty due to the crucial role played by the increasing penalty.

The following theorem establishes a finite length property of OADMM.

Theorem 5.6.

(Proof in Section D.2, A Finite Length Property) We define dti=tei+1d^{t}\triangleq\sum_{i=t}^{\infty}e^{i+1}. We define φtφ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦))\varphi^{t}\triangleq\varphi(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty})), where φ()\varphi(\cdot) is the desingularization function defined in Assumption 5.1. (a) We have the following recursive inequality for both OADMM-EP and OADMM-RR: (et+1)2(et+et1)K˙(φtφt+1)(e^{t+1})^{2}\leq(e^{t}+e^{t-1})\cdot\dot{K}(\varphi^{t}-\varphi^{t+1}), where K˙=3Kmin(εz,εy,εx)\dot{K}=\tfrac{3K}{\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})}, and KK is defined in Lemma 5.4. (b) It holds that t1,dtet+et1+4K˙φt\forall t\geq 1,\,d^{t}\leq e^{t}+e^{t-1}+4\dot{K}\varphi^{t}. The sequence {𝕨t}t=1\{{\mathbbm{w}}^{t}\}_{t=1}^{\infty} has the finite length property that d1e1+e0+4K˙φ1<+d^{1}\leq e^{1}+e^{0}+4\dot{K}\varphi^{1}<+\infty.

Remark 5.7.

The finite length property in Theorem 5.6 represents much stronger convergence results compared to those outlined in Theorems 4.8 and 4.13.

We prove a lemma demonstrating that the convergence of dti=tei+1d^{t}\triangleq\sum_{i=t}^{\infty}e^{i+1} is sufficient to establish the convergence of 𝐗t𝐗𝖥\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}.

Lemma 5.8.

(Proof in Section D.3) We define dti=tei+1d^{t}\triangleq\sum_{i=t}^{\infty}e^{i+1}. For both OADMM-EP and OADMM-RR, we have: (a) There exists a constant c¨\ddot{c} such that 𝐗t𝐗𝖥c¨dt\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\leq\ddot{c}\cdot d^{t}. (b) We have dtdt2dt+K¨[βt(dt2dt)]1σ~σ~d^{t}\leq\textstyle d^{t-2}-d^{t}+\ddot{K}[\beta^{t}(d^{t-2}-d^{t})]^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}, where K¨4K˙c~[c~(1σ~)K]1σ~σ~\ddot{K}\triangleq 4\dot{K}\tilde{c}\cdot[\tilde{c}(1-\tilde{\sigma})K]^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}.

Finally, we establish the convergence rate of OADMM with exploiting the KL exponent σ~\tilde{\sigma}.

Theorem 5.9.

(Proof in Section D.4, Convergence Rate) We fix p=1/3p=1/3. There exists tt^{\prime} such that for all ttt\geq t^{\prime}, we have:

  1. (a)

    If σ~(14,12]\tilde{\sigma}\in(\tfrac{1}{4},\tfrac{1}{2}], then we have 𝐗t𝐗𝖥𝒪(1/exp(t1u))\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\leq\mathcal{O}(1/\exp(t^{1-u})), where u=p(1σ~)σ~[13,1)u=\frac{p(1-\tilde{\sigma})}{\tilde{\sigma}}\in[\tfrac{1}{3},1).

  2. (b)

    If σ~(12,1)\tilde{\sigma}\in(\tfrac{1}{2},1), then we have: 𝐗t𝐗𝖥𝒪(1/(t(1p)/τ))\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\leq\mathcal{O}(1/(t^{(1-p)/\tau})), where τ=σ~1σ~1(0,)\tau=\frac{\tilde{\sigma}}{1-\tilde{\sigma}}-1\in(0,\infty).

Remark 5.10.

(i) To the best of our knowledge, Theorem 5.9 represents the first non-ergodic convergence rate for solving this class of nonconvex and nonsmooth problem in Problem (1). It is worth noting that the work of Li et al. (2023) establishes a non-ergodic convergence rate for subgradient methods with diminishing stepsizes by further exploring the KL exponent. (ii) Under the KL inequality assumption, with the desingularizing function chosen in the form of φ(s)c~s1σ~\varphi(s)\triangleq\tilde{c}s^{1-\tilde{\sigma}} with σ~(0,1)\tilde{\sigma}\in(0,1), OADMM converges with a super-exponential rate when σ~(14,12]\tilde{\sigma}\in(\tfrac{1}{4},\tfrac{1}{2}], and converges with a polynomial convergence rate when σ~(12,1)\tilde{\sigma}\in(\tfrac{1}{2},1) for the gap 𝐗t𝐗𝖥\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}. Notably, super-exponential convergence is faster than polynomial convergence. (iii) Our result generalizes the classical findings of Attouch et al. (2010); Bolte et al. (2014), which characterize the convergence rate of proximal gradient methods for a specific class of nonconvex composite optimization problems.

6 Applications and Numerical Experiments

In this section, we assess the effectiveness of the proposed algorithm OADMM on the sparse PCA problem by comparing it against existing non-convex, non-smooth optimization algorithms.

\blacktriangleright Application to Sparse PCA. Sparse PCA is a method to produce modified principal components with sparse loadings, which helps reduce model complexity and increase model interpretation Chen et al. (2016). It can be formulated as:

min𝐗n×r12m˙𝐗𝐗𝖳𝐃𝐃𝖥2+ρ˙(𝐗1𝐗[k]),s.t.𝐗𝖳𝐗=𝐈r,\displaystyle\min_{\mathbf{X}\in\mathbb{R}^{n\times r}}\tfrac{1}{2\dot{m}}\|\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{D}-\mathbf{D}\|_{\mathsf{F}}^{2}+\dot{\rho}(\|\mathbf{X}\|_{1}-\|\mathbf{X}\|_{[k]}),~{}s.t.~{}\mathbf{X}^{\mathsf{T}}\mathbf{X}=\mathbf{I}_{r},

where 𝐃n×m˙\mathbf{D}\in\mathbb{R}^{n\times\dot{m}} is the data matrix, m˙\dot{m} is the number of data points, and 𝐗[k]\|\mathbf{X}\|_{[k]} is the 1\ell_{1} norm the the kk largest (in magnitude) elements of the matrix 𝐗\mathbf{X}. Here, we consider the DC 1\ell_{1}-largest-kk function Gotoh et al. (2018) to induce sparsity in the solution. One advantage of this model is that when ρ˙\dot{\rho} is sufficient large, we have 𝐗1𝐗[k]\|\mathbf{X}\|_{1}\approx\|\mathbf{X}\|_{[k]}, leading to a kk-sparsity solution 𝐗\mathbf{X}.

\blacktriangleright Compared Methods. We compare OADMM-EP and OADMM-RR against four state-of-the-art optimization algorithms: (i) RADMM: ADMM using Riemannian retraction with fixed and small stepsizes Li et al. (2022), tested with two different penalty parameters t,βt{100,10000}\forall t,\,\beta^{t}\in\{100,10000\}, leading to two variants: RADMM-I and RADMM-II. (ii) SPGM-EP: Smoothing Proximal Gradient Method using Euclidean projection Böhm & Wright (2021). (iii) SPGM-EP: SPGM utilizing Riemannian retraction Beck & Rosset (2023). (iv) Sub-Grad: Subgradient methods with Euclidean projection Davis & Drusvyatskiy (2019); Li et al. (2021).

\blacktriangleright Experiment Settings. All methods are implemented in MATLAB on an Intel 2.6 GHz CPU with 64 GB RAM. For all retraction-based methods, we use only polar decomposition-based retraction. We evaluate different regularization parameters ρ˙{10,50,100,500,1000}\dot{\rho}\in\{10,50,100,500,1000\}. For OADMM, default parameters are used, with β0=10ρ˙\beta^{0}=10\dot{\rho} and corresponding values ξ={1,2,5,8,10}\xi=\{1,2,5,8,10\} for each ρ˙\dot{\rho}. For simplicity, we omit the Barzilai-Borwein strategy and instead use a fixed constant bt=1b^{t}=1 for all iterations. All algorithms start with a common initial solution 𝐱0\mathbf{x}^{0}, generated from a standard normal distribution. Our code for reproducing the experiments is available in the supplemental material.

\blacktriangleright Experiment Results. We report the objective values for different methods with varying parameters ρ˙\dot{\rho}. The experimental results presented in Figures 1 and 2 reveal the following insights: (i) Sub-Grad essentially fails to solve this problem, as the subgradient is inaccurately estimated when the solution is sparse. (ii) SPGM-EP and SPGM-RR, which rely on a variable smoothing strategy, exhibit slower performance than the multiplier-based variable splitting method. This observation aligns with the commonly accepted notion that primal-dual methods are generally more robust and faster than primal-only methods. (iii) The proposed OADMM-EP and OADMM-RR demonstrate similar results and generally achieve lower objective function values than the other methods.

7 Conclusions

This paper introduces OADMM, an Alternating Direction Method of Multipliers (ADMM) tailored for solving structured nonsmooth composite optimization problems under orthogonality constraints. OADMM integrates either a Nesterov extrapolation strategy or a Monotone Barzilai-Borwein (MBB) stepsize strategy to potentially accelerate primal convergence, complemented by an over-relaxation stepsize strategy for rapid dual convergence. We adjust the penalty and smoothing parameters at a controlled rate. Additionally, we develop a novel Lyapunov function to rigorously analyze the oracle complexity of OADMM and establish the first non-ergodic convergence rate for this method. Finally, numerical experiments show that our OADMM achieves state-of-the-art performance.

Refer to caption
(a) mnist-1500-500
Refer to caption
(b) mnist-2500-500
Refer to caption
(c) TDT2-1500-500
Refer to caption
(d) TDT2-2500-500
Refer to caption
(e) sector-1500-500
Refer to caption
(f) sector-2500-500
Refer to caption
(g) randn-1500-500
Refer to caption
(h) randn-2500-500
Figure 1: The convergence curve of the compared methods with ρ˙=50\dot{\rho}=50.
Refer to caption
(a) mnist-1500-500
Refer to caption
(b) mnist-2500-500
Refer to caption
(c) TDT2-1500-500
Refer to caption
(d) TDT2-2500-500
Refer to caption
(e) sector-1500-500
Refer to caption
(f) sector-2500-500
Refer to caption
(g) randn-1500-500
Refer to caption
(h) randn-2500-500
Figure 2: The convergence curve of the compared methods with ρ˙=500\dot{\rho}=500.

References

  • Abrudan et al. (2008) Traian E Abrudan, Jan Eriksson, and Visa Koivunen. Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Transactions on Signal Processing, 56(3):1134–1147, 2008.
  • Absil & Malick (2012) P-A Absil and Jérôme Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1):135–158, 2012.
  • Absil et al. (2008a) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008a.
  • Absil et al. (2008b) Pierre-Antoine Absil, Robert E. Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008b.
  • Attouch et al. (2010) Hedy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-lojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
  • Bansal et al. (2018) Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep networks? Advances in Neural Information Processing Systems, 31, 2018.
  • Beck & Rosset (2023) Amir Beck and Israel Rosset. A dynamic smoothing technique for a class of nonsmooth optimization problems on manifolds. SIAM Journal on Optimization, 33(3):1473–1493, 2023.
  • Boţ\c{t} & Nguyen (2020) Radu Ioan Boţ\c{t} and Dang-Khoa Nguyen. The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. Mathematics of Operations Research, 45(2):682–712, 2020.
  • Boţ\c{t} et al. (2019) Radu Ioan Boţ\c{t}, Erno Robert Csetnek, and Dang-Khoa Nguyen. A proximal minimization algorithm for structured nonconvex and nonsmooth problems. SIAM Journal on Optimization, 29(2):1300–1328, 2019. doi: 10.1137/18M1190689.
  • Böhm & Wright (2021) Axel Böhm and Stephen J. Wright. Variable smoothing for weakly convex composite functions. Journal of Optimization Theory and Applications, 188(3):628–649, 2021.
  • Bolte et al. (2014) Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
  • Boumal et al. (2019) Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
  • Chen et al. (2020) Shixiang Chen, Shiqian Ma, Anthony Man-Cho So, and Tong Zhang. Proximal gradient method for nonsmooth optimization over the stiefel manifold. SIAM Journal on Optimization, 30(1):210–239, 2020.
  • Chen et al. (2016) Weiqiang Chen, Hui Ji, and Yanfei You. An augmented lagrangian method for 1\ell_{1}-regularized optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 38(4):B570–B592, 2016.
  • Chen (2012) Xiaojun Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical programming, 134(1):71–99, 2012.
  • Cho & Lee (2017) Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. Advances in Neural Information Processing Systems, 30, 2017.
  • Cogswell et al. (2016) Michael Cogswell, Faruk Ahmed, Ross B. Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfitting in deep networks by decorrelating representations. In Yoshua Bengio and Yann LeCun (eds.), International Conference on Learning Representations (ICLR), 2016.
  • Davis & Drusvyatskiy (2019) Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
  • Deng et al. (2024) Kangkang Deng, Jiang Hu, and Zaiwen Wen. Oracle complexity of augmented lagrangian methods for nonsmooth manifold optimization. arXiv preprint arXiv:2404.05121, 2024.
  • Drusvyatskiy & Paquette (2019) Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178:503–558, 2019.
  • Edelman et al. (1998) Alan Edelman, Tomás A. Arias, and Steven Thomas Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
  • Ferreira & Oliveira (1998) OP Ferreira and PR1622188 Oliveira. Subgradient algorithm on riemannian manifolds. Journal of Optimization Theory and Applications, 97:93–104, 1998.
  • Gao et al. (2018) Bin Gao, Xin Liu, Xiaojun Chen, and Ya-xiang Yuan. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM Journal on Optimization, 28(1):302–332, 2018.
  • Gao et al. (2019) Bin Gao, Xin Liu, and Ya-xiang Yuan. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019.
  • Golub & Van Loan (2013) Gene H Golub and Charles F Van Loan. Matrix computations. 2013.
  • Gonçalves et al. (2017) Max LN Gonçalves, Jefferson G Melo, and Renato DC Monteiro. Convergence rate bounds for a proximal admm with over-relaxation stepsize parameter for solving nonconvex linearly constrained problems. arXiv preprint arXiv:1702.01850, 2017.
  • Gotoh et al. (2018) Jun-ya Gotoh, Akiko Takeda, and Katsuya Tono. Dc formulations and algorithms for sparse optimization problems. Mathematical Programming, 169(1):141–176, 2018.
  • He & Yuan (2012) Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
  • Huang & Gao (2023) Feihu Huang and Shangqian Gao. Gradient descent ascent for minimax problems on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 45(7):8466–8476, 2023.
  • Hwang et al. (2015) Seong Jae Hwang, Maxwell D. Collins, Sathya N. Ravi, Vamsi K. Ithapu, Nagesh Adluru, Sterling C. Johnson, and Vikas Singh. A projection free method for generalized eigenvalue problem with a nonsmooth regularizer. In IEEE International Conference on Computer Vision (ICCV), pp.  1841–1849, 2015.
  • Jiang & Dai (2015) Bo Jiang and Yu-Hong Dai. A framework of constraint preserving update schemes for optimization on stiefel manifold. Mathematical Programming, 153(2):535–575, 2015.
  • Jiang et al. (2022) Bo Jiang, Xiang Meng, Zaiwen Wen, and Xiaojun Chen. An exact penalty approach for optimization with nonnegative orthogonality constraints. Mathematical Programming, pp.  1–43, 2022.
  • Journée et al. (2010) Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(2):517–553, 2010.
  • Kovnatsky et al. (2016) Artiom Kovnatsky, Klaus Glashoff, and Michael M Bronstein. Madmm: a generic algorithm for non-smooth optimization on manifolds. In The European Conference on Computer Vision (ECCV), pp. 680–696. Springer, 2016.
  • Lai & Osher (2014) Rongjie Lai and Stanley Osher. A splitting method for orthogonality constrained problems. Journal of Scientific Computing, 58(2):431–449, 2014.
  • Li & Pong (2015) Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
  • Li & Lin (2015) Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. Advances in neural information processing systems, 28, 2015.
  • Li et al. (2022) Jiaxiang Li, Shiqian Ma, and Tejes Srivastava. A riemannian admm. arXiv preprint arXiv:2211.02163, 2022.
  • Li et al. (2016) Min Li, Defeng Sun, and Kim-Chuan Toh. A majorized admm with indefinite proximal terms for linearly constrained convex composite optimization. SIAM Journal on Optimization, 26(2):922–950, 2016.
  • Li et al. (2021) Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, and Anthony Man-Cho So. Weakly convex optimization over stiefel manifold using riemannian subgradient-type methods. SIAM Journal on Optimization, 31(3):1605–1634, 2021.
  • Li et al. (2023) Xiao Li, Andre Milzarek, and Junwen Qiu. Convergence of random reshuffling under the kurdyka–łojasiewicz inequality. SIAM Journal on Optimization, 33(2):1092–1120, 2023.
  • Liu et al. (2016) Huikang Liu, Weijie Wu, and Anthony Man-Cho So. Quadratic optimization with orthogonality constraints: Explicit lojasiewicz exponent and linear convergence of line-search methods. In International Conference on Machine Learning (ICML), pp. 1158–1167, 2016.
  • Lu & Zhang (2012) Zhaosong Lu and Yong Zhang. An augmented lagrangian approach for sparse principal component analysis. Mathematical Programming, 135(1-2):149–193, 2012.
  • Mordukhovich (2006) Boris S. Mordukhovich. Variational analysis and generalized differentiation i: Basic theory. Berlin Springer, 330, 2006.
  • Nesterov (2003) Y. E. Nesterov. Introductory lectures on convex optimization: a basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2003.
  • Rockafellar & Wets. (2009) R. Tyrrell Rockafellar and Roger J-B. Wets. Variational analysis. Springer Science & Business Media, 317, 2009.
  • Selvan et al. (2015) S Easter Selvan, S Thomas George, and R Balakrishnan. Range-based ica using a nonsmooth quasi-newton optimizer for electroencephalographic source localization in focal epilepsy. Neural computation, 27(3):628–671, 2015.
  • Wang et al. (2019) Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019.
  • Wen & Yin (2013) Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1):397–434, 2013.
  • Xiao et al. (2022) Nachuan Xiao, Xin Liu, and Ya-Xiang Yuan. A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optimization Methods and Software, 37(4):1205–1241, 2022.
  • Xie et al. (2017) Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In IEEE Conference on Computer Vision and Pattern Recognition, pp.  6176–6185, 2017.
  • Yang et al. (2017) Lei Yang, Ting Kei Pong, and Xiaojun Chen. Alternating direction method of multipliers for a class of nonconvex and nonsmooth problems with applications to background/foreground extraction. SIAM Journal on Imaging Sciences, 10(1):74–110, 2017.
  • Yuan (2023) Ganzhao Yuan. A block coordinate descent method for nonsmooth composite optimization under orthogonality constraints. arXiv preprint arXiv:2304.03641, 2023.
  • Yuan (2024) Ganzhao Yuan. Admm for nonconvex optimization under minimal continuity assumption. arXiv preprint, 2024.
  • Zeng et al. (2022) Jinshan Zeng, Wotao Yin, and Ding-Xuan Zhou. Moreau envelope augmented lagrangian method for nonconvex optimization with linear constraints. Journal of Scientific Computing, 91(2):61, 2022.
  • Zhai et al. (2020) Yuexiang Zhai, Zitong Yang, Zhenyu Liao, John Wright, and Yi Ma. Complete dictionary learning via l4-norm maximization over the orthogonal group. Journal of Machine Learning Research, 21(165):1–68, 2020.
  • Zhang et al. (2020a) Jiawei Zhang, Peijun Xiao, Ruoyu Sun, and Zhi-Quan Luo. A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems, 2020a.
  • Zhang et al. (2020b) Junyu Zhang, Shiqian Ma, and Shuzhong Zhang. Primal-dual optimization algorithms over riemannian manifolds: an iteration complexity analysis. Mathematical Programming, 184(1):445–490, 2020b.

Appendix

The appendix is organized as follows.

Appendix A provides notations, technical preliminaries, and relevant lemmas.

Appendix B contains the proofs for Section 2.

Appendix C includes the proofs for Section 4.

Appendix D encompasses the proofs for Section 5.

Appendix E presents additional experiments details and results.

Appendix A Notations, Technical Preliminaries, and Relevant Lemmas

A.1 Notations

In this paper, lowercase boldface letters signify vectors, while uppercase letters denote real-valued matrices. The following notations are utilized throughout this paper.

  • [n][n]: {1,2,,n}\{1,2,...,n\}

  • 𝐱\|\mathbf{x}\|: Euclidean norm: 𝐱=𝐱2=𝐱,𝐱\|\mathbf{x}\|=\|\mathbf{x}\|_{2}=\sqrt{\langle\mathbf{x},\mathbf{x}\rangle}

  • 𝐗𝖳\mathbf{X}^{\mathsf{T}} : the transpose of the matrix 𝐗\mathbf{X}

  • 𝟎n,r\mathbf{0}_{n,r} : A zero matrix of size n×rn\times r; the subscript is omitted sometimes

  • 𝐈r\mathbf{I}_{r} : 𝐈rr×r\mathbf{I}_{r}\in\mathbb{R}^{r\times r}, Identity matrix

  • \mathcal{M}: Orthogonality constraint set (a.k.a., Stiefel manifold: ={𝐗n×r|𝐗𝖳𝐗=𝐈r}\mathcal{M}=\{\mathbf{X}\in\mathbb{R}^{n\times r}\,|\,\mathbf{X}^{\mathsf{T}}\mathbf{X}=\mathbf{I}_{r}\}.

  • 𝐗𝟎(or𝟎)\mathbf{X}\succeq\mathbf{0}(\text{or}~{}\succ\mathbf{0}) : the Matrix 𝐗\mathbf{X} is symmetric positive semidefinite (or definite)

  • tr(𝐀)\operatorname{tr}(\mathbf{A}) : Sum of the elements on the main diagonal 𝐀\mathbf{A}: tr(𝐀)=i𝐀i,i\operatorname{tr}(\mathbf{A})=\sum_{i}\mathbf{A}_{i,i}

  • 𝐗\|\mathbf{X}\| : Operator/Spectral norm: the largest singular value of 𝐗\mathbf{X}

  • 𝐗𝖥\|\mathbf{X}\|_{\mathsf{F}} : Frobenius norm: (ij𝐗ij2)1/2(\sum_{ij}{\mathbf{X}_{ij}^{2}})^{1/2}

  • 𝐗1\|\mathbf{X}\|_{1}: Absolute sum of the elements in 𝐗\mathbf{X} with 𝐗=ij|𝐗ij|\mathbf{X}=\sum_{ij}|\mathbf{X}_{ij}|

  • 𝐗[k]\|\mathbf{X}\|_{[k]}: 1\ell_{1} norm the the kk largest (in magnitude) elements of the matrix 𝐗\mathbf{X}

  • g(𝐗)\partial g(\mathbf{X}) : (limiting) Euclidean subdifferential of g(𝐗)g(\mathbf{X}) at 𝐗\mathbf{X}

  • ProjΞ(𝐗)\operatorname{Proj}_{\Xi}(\mathbf{X}^{\prime}) : Orthogonal projection of 𝐗\mathbf{X}^{\prime} with ProjΞ(𝐗)=argargmin𝐗Ξ𝐗𝐗𝖥2\operatorname{Proj}_{\Xi}(\mathbf{X}^{\prime})=\arg\arg\min_{\mathbf{X}\in\Xi}\|\mathbf{X}^{\prime}-\mathbf{X}\|_{\mathsf{F}}^{2}

  • dist(Ξ,Ξ)\text{dist}(\Xi,\Xi^{\prime}) : the distance between two sets with dist(Ξ,Ξ)inf𝐗Ξ,𝐗Ξ𝐗𝐗𝖥\text{dist}(\Xi,\Xi^{\prime})\triangleq\inf_{\mathbf{X}\in\Xi,\mathbf{X}^{\prime}\in\Xi^{\prime}}\|\mathbf{X}-\mathbf{X}^{\prime}\|_{\mathsf{F}}

  • g(𝐗)𝖥\|\partial g(\mathbf{X})\|_{\mathsf{F}}: g(𝐗)𝖥=inf𝐘g(𝐗)𝐘𝖥=dist(𝟎,g(𝐗))\|\partial g(\mathbf{X})\|_{\mathsf{F}}=\inf_{\mathbf{Y}\in\partial g(\mathbf{X})}\|\mathbf{Y}\|_{\mathsf{F}}=\operatorname{dist}(\mathbf{0},\partial g(\mathbf{X})).

  • (βt)\ell(\beta^{t}): the smoothness parameter of the function 𝒮(𝐗,𝐲t;𝐳t;βt)\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}) w.r.t. 𝐗\mathbf{X}.

  • (𝐱)\mathcal{I}_{\mathcal{M}}(\mathbf{\mathbf{x}}) : Indicator function of \mathcal{M} with (𝐱)=0\mathcal{I}_{\mathcal{M}}(\mathbf{\mathbf{x}})=0 if 𝐱\mathbf{\mathbf{x}}\in\mathcal{M} and otherwise ++\infty.

We employ the following parameters in Algorithm 1.

  • θ\theta: proximal parameter

  • χ\chi: correlation coefficient between μt\mu^{t} and βt\beta^{t}, such that μtβt=χ\mu^{t}\beta^{t}=\chi

  • σ\sigma: over-relaxation parameter with σ[1,2)\sigma\in[1,2)

  • α\alpha: Nesterov extrapolation parameter with α[0,1)\alpha\in[0,1)

  • ρ\rho: search descent parameter with ρ(0,)\rho\in(0,\infty)

  • γ\gamma: decay rate parameter in the line search procedure with γ(0,1)\gamma\in(0,1)

  • δ\delta: sufficient decrease parameter in the line search procedure with δ(0,)\delta\in(0,\infty)

  • pp: exponent parameter used in the penalty update rule with p(0,1)p\in(0,1)

  • ξ\xi: growth factor parameter used in the penalty update rule with ξ(0,)\xi\in(0,\infty)

A.2 Technical Preliminaries

Non-convex Non-smooth Optimization. Given the potential non-convexity and non-smoothness of the function F()F(\cdot), we introduce tools from non-smooth analysis Mordukhovich (2006); Rockafellar & Wets. (2009). The domain of any extended real-valued function F:n×r(,+]F:\mathbb{R}^{n\times r}\rightarrow(-\infty,+\infty] is defined as dom(F){𝐗n×r:|F(𝐗)|<+}\text{dom}(F)\triangleq\{\mathbf{X}\in\mathbb{R}^{n\times r}:|F(\mathbf{X})|<+\infty\}. At 𝐗dom(F)\mathbf{X}\in\text{dom}(F), the Fréchet subdifferential of FF is defined as ^F(𝐗){𝝃n×r:lim𝐙𝐗inf𝐙𝐗F(𝐙)F(𝐗)𝝃,𝐙𝐗𝐙𝐗𝖥0}\hat{\partial}{F}(\mathbf{X})\triangleq\{\bm{\xi}\in\mathbb{R}^{n\times r}:\lim_{\mathbf{Z}\rightarrow\mathbf{X}}\inf_{\mathbf{Z}\neq\mathbf{X}}\frac{{F}(\mathbf{Z})-{F}(\mathbf{X})-\langle\bm{\xi},\mathbf{Z}-\mathbf{X}\rangle}{\|\mathbf{Z}-\mathbf{X}\|_{\mathsf{F}}}\geq 0\}, while the limiting subdifferential of F(𝐗){F}(\mathbf{X}) at 𝐗dom(F)\mathbf{X}\in\text{dom}({F}) is denoted as F(𝐗){𝝃n:𝐗t𝐗,F(𝐗t)F(𝐗),𝝃t^F(𝐗t)𝝃,t}\partial{F}(\mathbf{X})\triangleq\{\bm{\xi}\in\mathbb{R}^{n}:\exists\mathbf{X}^{t}\rightarrow\mathbf{X},{F}(\mathbf{X}^{t})\rightarrow{F}(\mathbf{X}),\bm{\xi}^{t}\in\hat{\partial}{F}(\mathbf{X}^{t})\rightarrow\bm{\xi},\forall t\}. The gradient of F()F(\cdot) at 𝐗\mathbf{X} in the Euclidean space is denoted as F(𝐗)\nabla{F}(\mathbf{X}). The following relations hold among ^F(𝐗)\hat{\partial}{F}(\mathbf{X}), F(𝐗)\partial{F}(\mathbf{X}), and F(𝐗)\nabla{F}(\mathbf{X}): (i) ^F(𝐗)F(𝐗)\hat{\partial}{F}(\mathbf{X})\subseteq\partial{F}(\mathbf{X}). (ii) If the function F()F(\cdot) is convex, F(𝐗)\partial{F}(\mathbf{X}) and ^F(𝐗)\hat{\partial}{F}(\mathbf{X}) represent the classical subdifferential for convex functions, i.e., F(𝐗)=^F(𝐗)={𝝃n×r:F(𝐙)F(𝐗)+𝝃,𝐙𝐗,𝐙n×r}\partial{F}(\mathbf{X})=\hat{\partial}{F}(\mathbf{X})=\{\bm{\xi}\in\mathbb{R}^{n\times r}:F(\mathbf{Z})\geq F(\mathbf{X})+\langle\bm{\xi},\mathbf{Z}-\mathbf{X}\rangle,\forall\mathbf{Z}\in\mathbb{R}^{n\times r}\}. (iii) If the function F()F(\cdot) is differentiable, then ^F(𝐗)=F(𝐗)={F(𝐗)}\hat{\partial}{F}(\mathbf{X})=\partial{F}(\mathbf{X})=\{\nabla F(\mathbf{X})\}.

Optimization with Orthogonality Constraints. We introduce some prior knowledge of optimization involving orthogonality constraints Absil et al. (2008b). The nearest orthogonality matrix to any arbitrary matrix 𝐘n×r\mathbf{Y}\in\mathbb{R}^{n\times r} is determined as (𝐘)=𝐔˘𝐕˘𝖳\mathbb{P}_{\mathcal{M}}(\mathbf{Y})=\breve{\mathbf{U}}\breve{\mathbf{V}}^{\mathsf{T}}, where 𝐘=𝐔˘Diag(𝐬)𝐕˘𝖳\mathbf{Y}=\breve{\mathbf{U}}{\rm{Diag}}(\mathbf{s})\breve{\mathbf{V}}^{\mathsf{T}} represents the singular value decomposition of 𝐘\mathbf{Y}. We use 𝒩(𝐗)\mathcal{N}_{\mathcal{M}}(\mathbf{X}) to denote the limiting normal cone to \mathcal{M} at 𝐗\mathbf{X}, thus defined as 𝒩(𝐗)=(𝐗)={𝐙n×r:𝐙,𝐗𝐙,𝐘,𝐘}\mathcal{N}_{\mathcal{M}}(\mathbf{X})=\partial\mathcal{I}_{\mathcal{M}}(\mathbf{X})=\{\mathbf{Z}\in\mathbb{R}^{n\times r}:\langle\mathbf{Z},\mathbf{X}\rangle\geq\langle\mathbf{Z},\mathbf{Y}\rangle,\,\forall\mathbf{Y}\in\mathcal{M}\}. Moreover, the tangent and normal space to \mathcal{M} at 𝐗\mathbf{X}\in\mathcal{M} are respectively denoted as T𝐗\mathrm{T}_{\mathbf{X}}\mathcal{M} and N𝐗\mathrm{N}_{\mathbf{X}}\mathcal{M}. We have: T𝐗={𝐘n×r|𝒜X(𝐘)=𝟎}\mathrm{T}_{\mathbf{X}}\mathcal{M}=\{\mathbf{Y}\in\mathbb{R}^{n\times r}|\mathcal{A}_{X}(\mathbf{Y})=\mathbf{0}\} and N𝐗=2𝐗𝚲|𝚲=𝚲𝖳,𝚲r×r}\mathrm{N}_{\mathbf{X}}\mathcal{M}=2\mathbf{X}\mathbf{\Lambda}\,|\,\mathbf{\Lambda}=\mathbf{\Lambda}^{\mathsf{T}},\mathbf{\Lambda}\in\mathbb{R}^{r\times r}\}, where 𝒜𝐗(𝐘)𝐗𝖳𝐘+𝐘𝖳𝐗\mathcal{A}_{\mathbf{X}}(\mathbf{Y})\triangleq\mathbf{X}^{\mathsf{T}}\mathbf{Y}+\mathbf{Y}^{\mathsf{T}}\mathbf{X} for 𝐘n×r\mathbf{Y}\in\mathbb{R}^{n\times r} and 𝐗\mathbf{X}\in\mathcal{M}.

Weakly Convex Functions. The function h(𝐲)h(\mathbf{y}) is weakly convex if there exists a constant Wh0W_{h}\geq 0 such that h(𝐲)+12Wh𝐲22h(\mathbf{y})+\tfrac{1}{2}W_{h}\|\mathbf{y}\|_{2}^{2} is convex; the smallest such WhW_{h} is termed the modulus of weak convexity. Weakly convex functions encompass a diverse range, including convex functions, differentiable functions with Lipschitz continuous gradient, and compositions of convex, Lipschitz-continuous functions with C1C^{1}-smooth mappings having Lipschitz continuous Jacobians Drusvyatskiy & Paquette (2019).

A.3 Relevant Lemmas

Lemma A.1.

Let 𝐚,𝐛n\mathbf{a},\mathbf{b}\in\mathbb{R}^{n}, and α0\alpha\geq 0 be any constant. We have: 𝐚α𝐛22(α1)𝐚22(α2α)𝐛22-\|\mathbf{a}-\alpha\mathbf{b}\|_{2}^{2}\leq(\alpha-1)\|\mathbf{a}\|_{2}^{2}-(\alpha^{2}-\alpha)\|\mathbf{b}\|_{2}^{2}.

Proof.

We have: 𝐚α𝐛22=𝐚22α𝐛22+2α𝐚,𝐛𝐚22α𝐛22+2α(12𝐚22+12𝐛22)=(α1)𝐚22(α2α)𝐛22-\|\mathbf{a}-\alpha\mathbf{b}\|_{2}^{2}=-\|\mathbf{a}\|_{2}^{2}-\|\alpha\mathbf{b}\|_{2}^{2}+2\alpha\langle\mathbf{a},\mathbf{b}\rangle\leq-\|\mathbf{a}\|_{2}^{2}-\|\alpha\mathbf{b}\|_{2}^{2}+2\alpha\cdot(\tfrac{1}{2}\|\mathbf{a}\|_{2}^{2}+\tfrac{1}{2}\|\mathbf{b}\|_{2}^{2})=(\alpha-1)\|\mathbf{a}\|_{2}^{2}-(\alpha^{2}-\alpha)\|\mathbf{b}\|_{2}^{2}. ∎

Lemma A.2.

Assume p(0,1)p\in(0,1), and t1t\geq 1 is an integer. We have: tp(t1)p1+(t1)p1t\tfrac{t^{p}-(t-1)^{p}}{1+(t-1)^{p}}\leq\tfrac{1}{t}.

Proof.

Part (a). When t=1t=1, the conclusion of this lemma is clearly satisfied.

Part (b). Now, we assume t2t\geq 2 and p(0,1)p\in(0,1). First, using the convexity of f(t)tpf(t)\triangleq-t^{p}, we obtain tp(t1)pp(t1)p1<(t1)p1t^{p}-(t-1)^{p}\leq p(t-1)^{p-1}<(t-1)^{p-1}. It remains to prove that (t1)p1t1(t1)p(t-1)^{p-1}t-1\leq(t-1)^{p} for all t2t\geq 2. Second, we derive the following chain of inequalities: 1(t1)1pt(t1)(t1)1ptt111(t1)p(t1)p1t1(t1)p1\leq(t-1)^{1-p}\Rightarrow t-(t-1)\leq(t-1)^{1-p}\Rightarrow\tfrac{t}{t-1}-1\leq\tfrac{1}{(t-1)^{p}}\Rightarrow(t-1)^{p-1}t-1\leq(t-1)^{p}.

Lemma A.3.

Let βt=β0(1+ξtp)\beta^{t}=\beta^{0}(1+\xi t^{p}), where t0t\geq 0, β0>0\beta^{0}>0, ξ,p(0,1)\xi,p\in(0,1). For all t1t\geq 1, we have: (βtβt11)22t2t+1(\tfrac{\beta^{t}}{\beta^{t-1}}-1)^{2}\leq\frac{2}{t}-\frac{2}{t+1}.

Proof.

We derive: (βtβt11)2=(1+ξtp1+ξ(t1)p1)2=(ξtpξ(t1)p1+ξ(t1)p)2(tp(t1)p1+(t1)p)2(1t)22t2t+1(\tfrac{\beta^{t}}{\beta^{t-1}}-1)^{2}\overset{\text{\char 172}}{=}\textstyle(\tfrac{1+\xi t^{p}}{1+\xi(t-1)^{p}}-1)^{2}=(\tfrac{\xi t^{p}-\xi(t-1)^{p}}{1+\xi(t-1)^{p}})^{2}\overset{\text{\char 173}}{\leq}\textstyle(\tfrac{t^{p}-(t-1)^{p}}{1+(t-1)^{p}})^{2}\overset{\text{\char 174}}{\leq}\textstyle(\tfrac{1}{t})^{2}\overset{\text{\char 175}}{\leq}\textstyle\frac{2}{t}-\frac{2}{t+1}, where step uses βt=β0(1+ξtp)\beta^{t}=\beta^{0}(1+\xi t^{p}); step uses ξ1+ξa<11+a\tfrac{\xi}{1+\xi a}<\tfrac{1}{1+a} for all a0a\geq 0 when ξ(0,1)\xi\in(0,1); step uses Lemma A.2; step uses the fact that 1t22t2t+1\frac{1}{t^{2}}\leq\tfrac{2}{t}-\tfrac{2}{t+1} for all t1t\geq 1.

Lemma A.4.

Assume 𝐚+=ϱ𝐚+𝐛\mathbf{a}^{+}=\varrho\mathbf{a}+\mathbf{b}, where 𝐚,𝐛,𝐚+m\mathbf{a},\mathbf{b},\mathbf{a}^{+}\in\mathbb{R}^{m}, and ϱ[0,1)\varrho\in[0,1). We have: 𝐚+22ϱ1ϱ(𝐚22𝐚+22)+1(1ϱ)2𝐛22\|\mathbf{a}^{+}\|_{2}^{2}\leq\tfrac{\varrho}{1-\varrho}(\|\mathbf{a}\|_{2}^{2}-\|\mathbf{a}^{+}\|_{2}^{2})+\tfrac{1}{(1-\varrho)^{2}}\|\mathbf{b}\|_{2}^{2}.

Proof.

We have: 𝐚+22=ϱ𝐚+𝐛22=ϱ𝐚+(1ϱ)𝐛1ϱ22ϱ𝐚22+(1ϱ)𝐛1ϱ22=ϱ𝐚22+11ϱ𝐛22\|\mathbf{a}^{+}\|_{2}^{2}=\|\varrho\mathbf{a}+\mathbf{b}\|_{2}^{2}=\|\varrho\mathbf{a}+(1-\varrho)\cdot\tfrac{\mathbf{b}}{1-\varrho}\|_{2}^{2}\leq\varrho\|\mathbf{a}\|_{2}^{2}+(1-\varrho)\cdot\|\tfrac{\mathbf{b}}{1-\varrho}\|_{2}^{2}=\varrho\|\mathbf{a}\|_{2}^{2}+\tfrac{1}{1-\varrho}\|\mathbf{b}\|_{2}^{2}, where the inequality holds due to the convexity of 22\|\cdot\|_{2}^{2}.

Lemma A.5.

Assume that 𝐚tϱ𝐚t1+c\mathbf{a}^{t}\leq\varrho\mathbf{a}^{t-1}+c, where ϱ[0,1)\varrho\in[0,1), c0c\geq 0, and {𝐚i}i=0\{\mathbf{a}^{i}\}_{i=0}^{\infty} is a non-negative sequence. We have: 𝐚t𝐚0+c1ϱ\mathbf{a}^{t}\leq\mathbf{a}^{0}+\tfrac{c}{1-\varrho} for all t0t\geq 0.

Proof.

Using basic induction, we have the following results:

t=1,\displaystyle\textstyle t=1, 𝐚1ϱ𝐚0+c\displaystyle\textstyle\mathbf{a}^{1}\leq\varrho\mathbf{a}^{0}+c
t=2,\displaystyle\textstyle t=2, 𝐚2ϱ𝐚1+cϱ(ϱ𝐚0+c)+c=ϱ2𝐚0+c(1+ϱ)\displaystyle\textstyle\mathbf{a}^{2}\leq\varrho\mathbf{a}^{1}+c\leq\varrho(\varrho\mathbf{a}^{0}+c)+c=\varrho^{2}\mathbf{a}^{0}+c(1+\varrho)
t=3,\displaystyle\textstyle t=3, 𝐚3ϱ𝐚2+cϱ(ϱ2𝐚0+(c+ϱc))+c=ϱ3𝐚0+c(1+ϱ+ϱ2)\displaystyle\textstyle\mathbf{a}^{3}\leq\varrho\mathbf{a}^{2}+c\leq\varrho(\varrho^{2}\mathbf{a}^{0}+(c+\varrho c))+c=\varrho^{3}\mathbf{a}^{0}+c(1+\varrho+\varrho^{2})
\displaystyle\textstyle...
t=n,\displaystyle\textstyle t=n, 𝐚nϱ𝐚n1+cϱn𝐚0+c(1+ϱ++ϱn1).\displaystyle\textstyle\mathbf{a}^{n}\leq\varrho\mathbf{a}^{n-1}+c\leq\varrho^{n}\mathbf{a}^{0}+c\cdot(1+\varrho+\ldots+\varrho^{n-1}).

Therefore, we obtain: 𝐚nϱn𝐚0+c(1+ϱ++ϱn1)a0+c1ϱ\mathbf{a}^{n}\leq\varrho^{n}\mathbf{a}^{0}+c\cdot(1+\varrho+\ldots+\varrho^{n-1})\overset{\text{\char 172}}{\leq}a_{0}+\frac{c}{1-\varrho}, where step uses ρnρ<1\rho^{n}\leq\rho<1, and the summation formula of geometric sequences that 1+ϱ1+ϱ2++ϱt1=1ϱt1ϱ<11ϱ1+\varrho^{1}+\varrho^{2}+...+\varrho^{t-1}=\frac{1-\varrho^{t}}{1-\varrho}<\frac{1}{1-\varrho}.

Lemma A.6.

Assume 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}_{{\sf c}}^{t}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}), where α[0,1)\alpha\in[0,1), and 𝐗t,𝐗t1\mathbf{X}^{t},\mathbf{X}^{t-1}\in\mathcal{M}. We have:

(a) 𝐗t𝐗𝖼t𝖥𝐗t𝐗t1𝖥\|\mathbf{X}^{t}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\leq\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}.

(b) 𝐗t+1𝐗𝖼t𝖥𝐗t+1𝐗t𝖥+𝐗t𝐗t1𝖥\|\mathbf{X}^{t+1}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\leq\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}.

(c) 𝒜(𝐗𝖼t)𝐲t𝒜(𝐗t)𝐲t+A¯𝐗t𝐗t1𝖥\|\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t}\|\leq\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\overline{\rm{A}}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}.

Proof.

Part (a). We have: 𝐗t𝐗𝖼t𝖥=α𝐗t𝐗t1𝖥𝐗t𝐗t1𝖥\|\mathbf{X}^{t}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\overset{\text{\char 172}}{=}\alpha\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}\overset{\text{\char 173}}{\leq}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, where step uses 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}^{t}_{{\sf c}}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}); step uses α[0,1)\alpha\in[0,1).

Part (b). We have: 𝐗t+1𝐗𝖼t𝖥=𝐗t+1𝐗tα(𝐗t𝐗t1)𝖥𝐗t+1𝐗t𝖥+𝐗t𝐗t1𝖥\|\mathbf{X}^{t+1}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\overset{\text{\char 172}}{=}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}-\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1})\|_{\mathsf{F}}\overset{\text{\char 173}}{\leq}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, where step uses 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}^{t}_{{\sf c}}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}); step uses the triangle inequality and α[0,1)\alpha\in[0,1).

Part (c). We have: 𝒜(𝐗𝖼t)𝐲t𝒜(𝐗t)𝐲t+𝒜(𝐗t)𝒜(𝐗𝖼t)𝒜(𝐗t)𝐲t+A¯𝐗t𝐗𝖼t𝒜(𝐗t)𝐲t+A¯𝐗t𝐗t1\|\mathcal{A}(\mathbf{X}^{t}_{{\sf c}})-\mathbf{y}^{t}\|\overset{\text{\char 172}}{\leq}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\|\mathcal{A}(\mathbf{X}^{t})-\mathcal{A}(\mathbf{X}^{t}_{{\sf c}})\|\leq\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\overline{\rm{A}}\|\mathbf{X}^{t}-\mathbf{X}^{t}_{{\sf c}}\|\overset{\text{\char 173}}{\leq}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\overline{\rm{A}}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|, where step uses the triangle inequality; step uses Claim (a) of this lemma.

Lemma A.7.

Let 𝐏,𝐏~n×r\mathbf{P},\tilde{\mathbf{P}}\in\mathbb{R}^{n\times r}, and 𝐗,𝐗~\mathbf{X},\tilde{\mathbf{X}}\in\mathcal{M}. We have:

Proj𝐓𝐗(𝐏)Proj𝐓𝐗~(𝐏~)𝖥2𝐏𝐏~𝖥+2r𝐏𝐗𝐗~𝖥.\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\mathbf{P})-\operatorname{Proj}_{\mathbf{T}_{\tilde{\mathbf{X}}}\mathcal{M}}(\tilde{\mathbf{P}})\|_{\mathsf{F}}\leq 2\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}}+2\sqrt{r}\|\mathbf{P}\|\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}.
Proof.

First, we obtain:

𝐗𝐏𝖳𝐗𝐗~𝐏~𝖳𝐗~𝖥\displaystyle\|\mathbf{X}\mathbf{P}^{\mathsf{T}}\mathbf{X}-\tilde{\mathbf{X}}\tilde{\mathbf{P}}^{\mathsf{T}}\tilde{\mathbf{X}}\|_{\mathsf{F}} (10)
=\displaystyle= (𝐗𝐗~)𝐏𝖳𝐗+𝐗~𝐏𝖳(𝐗𝐗~)+𝐗~(𝐏𝐏~)𝖳𝐗~𝖥\displaystyle\|(\mathbf{X}-\tilde{\mathbf{X}})\mathbf{P}^{\mathsf{T}}\mathbf{X}+\tilde{\mathbf{X}}\mathbf{P}^{\mathsf{T}}(\mathbf{X}-\tilde{\mathbf{X}})+\tilde{\mathbf{X}}(\mathbf{P}-\tilde{\mathbf{P}})^{\mathsf{T}}\tilde{\mathbf{X}}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 172}}{\leq} 𝐗𝐗~𝖥𝐏𝖳𝐗+𝐗~𝐏𝖳𝐗𝐗~𝖥+𝐗~(𝐏𝐏~)𝖳𝐗~𝖥\displaystyle\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}\|\mathbf{P}^{\mathsf{T}}\mathbf{X}\|+\|\tilde{\mathbf{X}}\mathbf{P}^{\mathsf{T}}\|\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}+\|\tilde{\mathbf{X}}(\mathbf{P}-\tilde{\mathbf{P}})^{\mathsf{T}}\tilde{\mathbf{X}}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 2r𝐏𝐗𝐗~𝖥+𝐏𝐏~𝖥,\displaystyle 2\sqrt{r}\|\mathbf{P}\|\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}},

where step uses the triangle inequality; step uses 𝐀𝐁𝖥𝐀𝐁𝖥\|\mathbf{A}\mathbf{B}\|_{\mathsf{F}}\leq\|\mathbf{A}\|\cdot\|\mathbf{B}\|_{\mathsf{F}}, and 𝐗~1\|\tilde{\mathbf{X}}\|\leq 1.

Second, we have:

𝐗𝐗𝖳𝐏𝐗~𝐗~𝖳𝐏~𝖥\displaystyle\|\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{P}-\tilde{\mathbf{X}}\tilde{\mathbf{X}}^{\mathsf{T}}\tilde{\mathbf{P}}\|_{\mathsf{F}} (11)
=\displaystyle= (𝐗𝐗~)𝐗𝖳𝐏+𝐗~(𝐗𝐗~)𝖳𝐏+𝐗~𝐗~𝖳(𝐏𝐏~)𝖥\displaystyle\|(\mathbf{X}-\tilde{\mathbf{X}})\mathbf{X}^{\mathsf{T}}\mathbf{P}+\tilde{\mathbf{X}}(\mathbf{X}-\tilde{\mathbf{X}})^{\mathsf{T}}\mathbf{P}+\tilde{\mathbf{X}}\tilde{\mathbf{X}}^{\mathsf{T}}(\mathbf{P}-\tilde{\mathbf{P}})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 172}}{\leq} 𝐗𝐗~𝖥𝐗𝖳𝐏+𝐗~𝐗𝐗~𝖥𝐏+𝐗~𝐗~𝖳𝐏𝐏~𝖥\displaystyle\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}\|\mathbf{X}^{\mathsf{T}}\mathbf{P}\|+\|\tilde{\mathbf{X}}\|\cdot\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}\cdot\|\mathbf{P}\|+\|\tilde{\mathbf{X}}\tilde{\mathbf{X}}^{\mathsf{T}}\|\cdot\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 2r𝐏𝐗𝐗~𝖥+𝐏𝐏~𝖥,\displaystyle 2\sqrt{r}\|\mathbf{P}\|\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}},

where step uses the triangle inequality; step uses 𝐀𝐁𝖥𝐀𝐁𝖥\|\mathbf{A}\mathbf{B}\|_{\mathsf{F}}\leq\|\mathbf{A}\|\cdot\|\mathbf{B}\|_{\mathsf{F}}, and 𝐗~1\|\tilde{\mathbf{X}}\|\leq 1.

Finally, we derive:

Proj𝐓𝐗(𝐏)Proj𝐓𝐗~(𝐏~)𝖥\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\mathbf{P})-\operatorname{Proj}_{\mathbf{T}_{\tilde{\mathbf{X}}}\mathcal{M}}(\tilde{\mathbf{P}})\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} [𝐏12𝐗𝐏𝖳𝐗12𝐗𝐗𝖳𝐏][𝐏~12𝐗~𝐏~𝖳𝐗~12𝐗~𝐗~𝖳𝐏~]𝖥\displaystyle\|[\mathbf{P}-\tfrac{1}{2}\mathbf{X}\mathbf{P}^{\mathsf{T}}\mathbf{X}-\tfrac{1}{2}\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{P}]-[\tilde{\mathbf{P}}-\tfrac{1}{2}\tilde{\mathbf{X}}\tilde{\mathbf{P}}^{\mathsf{T}}\tilde{\mathbf{X}}-\tfrac{1}{2}\tilde{\mathbf{X}}\tilde{\mathbf{X}}^{\mathsf{T}}\tilde{\mathbf{P}}]\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 𝐏𝐏~𝖥+12𝐗𝐏𝖳𝐗𝐗~𝐏~𝖳𝐗~𝖥+12𝐗𝐗𝖳𝐏𝐗~𝐗~𝖳𝐏~𝖥\displaystyle\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}}+\tfrac{1}{2}\|\mathbf{X}\mathbf{P}^{\mathsf{T}}\mathbf{X}-\tilde{\mathbf{X}}\tilde{\mathbf{P}}^{\mathsf{T}}\tilde{\mathbf{X}}\|_{\mathsf{F}}+\tfrac{1}{2}\|\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{P}-\tilde{\mathbf{X}}\tilde{\mathbf{X}}^{\mathsf{T}}\tilde{\mathbf{P}}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} 𝐏𝐏~𝖥+2r𝐏𝐗𝐗~𝖥+𝐏𝐏~𝖥\displaystyle\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}}+2\sqrt{r}\|\mathbf{P}\|\|\mathbf{X}-\tilde{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{P}-\tilde{\mathbf{P}}\|_{\mathsf{F}}

where step uses Proj𝐓𝐗(𝚫)=𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})=\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}) for all 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} Absil et al. (2008a); step uses the triangle inequality; step uses Inequalities (10) and (11).

Lemma A.8.

We let p(0,1)p\in(0,1). We define g(t)11p(t+1)(1p)11p(1p)t(1p)g(t)\triangleq\tfrac{1}{1-p}(t+1)^{(1-p)}-\tfrac{1}{1-p}-(1-p)t^{(1-p)}. We have g(t)0g(t)\geq 0 for all t1t\geq 1.

Proof.

We assume p(0,1)p\in(0,1).

First, we show that h(p)(1p)1/p1exp(1)h(p)\triangleq(1-p)^{1/p}\leq\tfrac{1}{\exp(1)}. Recall that it holds: limp0+(1+p)1/p=exp(1)\lim_{p\rightarrow 0^{+}}(1+p)^{1/p}=\exp(1) and limp0+(1p)1/p=1/exp(1)\lim_{p\rightarrow 0^{+}}(1-p)^{1/p}=1/\exp(1). Given the function h(p)h(p) is a decreasing function on p(0,1)p\in(0,1), we have h(p)limp0+(1p)1/p=1exp(1)h(p)\leq\lim_{p\rightarrow 0^{+}}(1-p)^{1/p}=\tfrac{1}{\exp(1)}.

Second, we show that f(q)=2q1q20f(q)=2^{q}-1-q^{2}\geq 0 for all q(0,1)q\in(0,1). We have f(q)=log(2)2q2q\nabla f(q)=\log(2)2^{q}-2q, and 2f(q)=2q(log(2))222(log(2))220\nabla^{2}f(q)=2^{q}(\log(2))^{2}-2\leq 2(\log(2))^{2}-2\leq 0, implying that the function f(q)f(q) is concave on q(0,1)q\in(0,1). Noticing f(0)=f(1)=0f(0)=f(1)=0, we conclude that f(q)0f(q)\geq 0.

Third, we show that g(t)g(t) is an increasing function. We have: g(t)=(t+1)p(1p)2tp=(t+1)p(1(1p)2(t+1t)p)(t+1)p(1(1p)22p)(t+1)p(1(2exp(1)2)p)0\nabla g(t)=\textstyle(t+1)^{-p}-(1-p)^{2}t^{-p}=\textstyle(t+1)^{-p}\cdot(1-(1-p)^{2}(\tfrac{t+1}{t})^{p})\overset{\text{\char 172}}{\geq}\textstyle(t+1)^{-p}\cdot(1-(1-p)^{2}2^{p})\overset{\text{\char 173}}{\geq}\textstyle(t+1)^{-p}\cdot(1-(\frac{2}{\exp(1)^{2}})^{p})\overset{\text{\char 174}}{\geq}\textstyle 0, where step uses t+1t2\frac{t+1}{t}\leq 2 for all t1t\geq 1; step uses 1p(1exp(1))p1-p\leq(\tfrac{1}{\exp(1)})^{p} for all p(0,1)p\in(0,1); step uses 2exp(1)20.2707<1\frac{2}{\exp(1)^{2}}\thickapprox 0.2707<1.

Finally, we have: t1,g(t)g(1)=(1p)1{2(1p)1(1p)2}0\forall t\geq 1,\,g(t)\overset{\text{\char 172}}{\geq}g(1)=(1-p)^{-1}\cdot\{2^{(1-p)}-1-(1-p)^{2}\}\overset{\text{\char 173}}{\geq}0, where step uses the fact that g(t)g(t) is an increasing function; step uses 2q1q202^{q}-1-q^{2}\geq 0 for all q=1p(0,1)q=1-p\in(0,1).

Lemma A.9.

Assume p(0,1)p\in(0,1). We have: (1p)T(1p)t=1T1tpT(1p)1p(1-p)T^{(1-p)}\leq\sum_{t=1}^{T}\tfrac{1}{t^{p}}\leq\tfrac{T^{(1-p)}}{1-p}.

Proof.

We define g(t)1tpg(t)\triangleq\tfrac{1}{t^{p}} and h(t)11pt(1p)h(t)\triangleq\frac{1}{1-p}t^{(1-p)}.

Using the integral test for convergence, we obtain: 1T+1g(x)𝑑xt=1Tg(t)g(1)+1Tg(x)𝑑x\int_{1}^{T+1}g(x)dx\leq\sum_{t=1}^{T}g(t)\leq g(1)+\int_{1}^{T}g(x)dx.

Part (a). We first consider the lower bound. We obtain: t=1Ttpt=1Ttt+1xp𝑑x=1T+1xp𝑑xh(T+1)h(1)=11p(T+1)1p11p(1p)T1p\sum_{t=1}^{T}t^{-p}\geq\textstyle\sum_{t=1}^{T}\int_{t}^{t+1}x^{-p}dx=\int_{1}^{T+1}x^{-p}dx\overset{\text{\char 172}}{\geq}\textstyle h(T+1)-h(1)=\tfrac{1}{1-p}(T+1)^{1-p}-\tfrac{1}{1-p}\overset{\text{\char 173}}{\geq}(1-p)T^{1-p}, where step uses h(x)=xp\nabla h(x)=x^{-p}; step uses Lemma A.8.

Part (b). We now consider the upper bound. We have: t=1Ttph(1)+1Txp𝑑x=1+h(T)h(1)=1+11p(T)1p11p=T(1p)p1p<T(1p)1p\sum_{t=1}^{T}t^{-p}\leq\textstyle h(1)+\int_{1}^{T}x^{-p}dx\overset{\text{\char 172}}{=}1+h(T)-h(1)=1+\tfrac{1}{1-p}(T)^{1-p}-\tfrac{1}{1-p}\overset{}{=}\tfrac{T^{(1-p)}-p}{1-p}<\tfrac{T^{(1-p)}}{1-p}, where step uses h(x)=xp\nabla h(x)=x^{-p}.

Lemma A.10.

Assume (et+1)2(et+et1)(ptpt+1)(e^{t+1})^{2}\leq(e^{t}+e^{t-1})(p^{t}-p^{t+1}) and ptpt+1p^{t}\geq p^{t+1}, where {et,pt}t=0\{e^{t},\,p^{t}\}_{t=0}^{\infty} are two nonnegative sequences. For all i1i\geq 1, we have: t=iet+1ei+ei1+4pi\sum_{t=i}^{\infty}e^{t+1}\leq e^{i}+e^{i-1}+4p^{i}.

Proof.

We define wtptpt+1w_{t}\triangleq p^{t}-p^{t+1}. We let 1i<T1\leq i<T.

First, for any i1i\geq 1, we have:

t=iTwt=t=iT(ptpt+1)=pipT+1pi,\displaystyle\textstyle\sum^{T}_{t=i}w_{t}=\sum^{T}_{t=i}(p^{t}-p^{t+1})=p^{i}-p^{T+1}\overset{\text{\char 172}}{\leq}p^{i}, (12)

where step uses pi0p^{i}\geq 0 for all ii.

Second, we obtain:

et+1\displaystyle e^{t+1} \displaystyle\overset{\text{\char 172}}{\leq} (et+et1)wt\displaystyle\textstyle\sqrt{(e^{t}+e^{t-1})w_{t}} (13)
\displaystyle\overset{\text{\char 173}}{\leq} α2(et+et1)2+(wt)2/(2α),α>0\displaystyle\textstyle\sqrt{\tfrac{\alpha}{2}(e^{t}+e^{t-1})^{2}+(w_{t})^{2}/(2\alpha)},\,\forall\alpha>0
\displaystyle\overset{\text{\char 174}}{\leq} α2(et+et1)+wt1/(2α),α>0.\displaystyle\textstyle\sqrt{\tfrac{\alpha}{2}}\cdot(e^{t}+e^{t-1})+w_{t}\sqrt{1/(2\alpha)},\,\forall\alpha>0.

Here, step uses (et+1)2(et+et1)(ptpt+1)(e^{t+1})^{2}\leq(e^{t}+e^{t-1})(p^{t}-p^{t+1}) and wtptpt+1w_{t}\triangleq p^{t}-p^{t+1}; step uses the fact that abα2a2+12αb2ab\leq\frac{\alpha}{2}a^{2}+\frac{1}{2\alpha}b^{2} for all α>0\alpha>0; step uses the fact that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for all a,b0a,b\geq 0.

Assume the parameter α\alpha is sufficiently small that 12α2>01-2\sqrt{\tfrac{\alpha}{2}}>0. Telescoping Inequality (13) over tt from ii to TT, we obtain:

t=iTwt1/(2α)\displaystyle~{}\textstyle\sum_{t=i}^{T}w_{t}\sqrt{{1}/{(2\alpha)}}
\displaystyle\geq {t=iTet+1}α2{t=iTet}α2{t=iTet1}\displaystyle~{}\textstyle\{\sum_{t=i}^{T}e^{t+1}\}-\sqrt{\tfrac{\alpha}{2}}\{\sum_{t=i}^{T}e^{t}\}-\sqrt{\tfrac{\alpha}{2}}\{\sum_{t=i}^{T}e^{t-1}\}
=\displaystyle= {eT+eT+1+t=iT2et+1}α2{ei+eT+t=iT2et+1}\displaystyle~{}\textstyle\{e^{T}+e^{T+1}+\sum_{t=i}^{T-2}e^{t+1}\}-\sqrt{\tfrac{\alpha}{2}}\{e^{i}+e^{T}+\sum_{t=i}^{T-2}e^{t+1}\}
α2{ei1+ei+t=iT2et+1}\displaystyle~{}\textstyle-\sqrt{\tfrac{\alpha}{2}}\{e^{i-1}+e^{i}+\sum_{t=i}^{T-2}e^{t+1}\}
=\displaystyle= eT+eT+1α2(ei+eT+ei1+ei)+(12α2)t=iT2et+1\displaystyle~{}\textstyle e^{T}+e^{T+1}-\sqrt{\tfrac{\alpha}{2}}(e^{i}+e^{T}+e^{i-1}+e^{i})+(1-2\sqrt{\tfrac{\alpha}{2}})\sum_{t=i}^{T-2}e^{t+1}
\displaystyle\overset{\text{\char 172}}{\geq} eT(1α2)α2(ei+ei1+ei)+(12α2)t=iT2et+1\displaystyle~{}\textstyle e^{T}(1-\sqrt{\tfrac{\alpha}{2}})-\sqrt{\tfrac{\alpha}{2}}(e^{i}+e^{i-1}+e^{i})+(1-2\sqrt{\tfrac{\alpha}{2}})\sum_{t=i}^{T-2}e^{t+1}
\displaystyle\overset{\text{\char 173}}{\geq} 2α2(ei+ei1)+(12α2)t=iT2et+1,\displaystyle~{}\textstyle-2\sqrt{\tfrac{\alpha}{2}}(e^{i}+e^{i-1})+(1-2\sqrt{\tfrac{\alpha}{2}})\sum_{t=i}^{T-2}e^{t+1},

where step uses eT+10e^{T+1}\geq 0; step uses 1α2>12α2>01-\sqrt{\frac{\alpha}{2}}>1-2\sqrt{\frac{\alpha}{2}}>0. This leads to:

t=iT2et+1\displaystyle\textstyle\sum_{t=i}^{T-2}e^{t+1} \displaystyle\leq (12α2)1{2α2(ei+ei1)+12αt=iTwt}\displaystyle\textstyle(1-2\sqrt{\tfrac{\alpha}{2}})^{-1}\cdot\{2\sqrt{\tfrac{\alpha}{2}}(e^{i}+e^{i-1})+\sqrt{\tfrac{1}{2\alpha}}\sum_{t=i}^{T}w_{t}\}
=\displaystyle\overset{\text{\char 172}}{=} (ei+ei1)+4t=iTwt\displaystyle\textstyle(e^{i}+e^{i-1})+4\sum_{t=i}^{T}w_{t}
=\displaystyle\overset{\text{\char 173}}{=} (ei+ei1)+4pi,\displaystyle\textstyle(e^{i}+e^{i-1})+4p^{i},

step uses the fact that (12α2)12α2=1(1-2\sqrt{\tfrac{\alpha}{2}})^{-1}\cdot 2\sqrt{\tfrac{\alpha}{2}}=1 and (12α2)112α=4(1-2\sqrt{\tfrac{\alpha}{2}})^{-1}\cdot\sqrt{\tfrac{1}{2\alpha}}=4 when α=1/8\alpha=1/8; step uses Inequalities (12). Letting TT\rightarrow\infty, we conclude this lemma.

Lemma A.11.

Assume t=1T(1/β~t)𝒪(Ta)\sum_{t=1}^{T}({1}/{\tilde{\beta}^{t}})\geq\mathcal{O}(T^{a}), where a0a\geq 0 is a constant, and {β~t}t=1T\{\tilde{\beta}^{t}\}_{t=1}^{T} is a nonnegative increasing sequence. If TT is an even number, we have: t=1T/2(1/β~2t)𝒪(Ta)\sum_{t=1}^{T/2}({1}/{\tilde{\beta}^{2t}})\geq\mathcal{O}(T^{a}).

Proof.

We have: t=1T/21β~2t=12t=1T/2(1β~2t+1β~2t)12t=1T/2(1β~2t+1β~2t+1)=1β~2T+11β~1+t=1T1β~t=𝒪(t=1T1β~t)𝒪(Ta)\textstyle\sum_{t=1}^{T/2}\tfrac{1}{\tilde{\beta}^{2t}}=\textstyle\tfrac{1}{2}\sum_{t=1}^{T/2}(\tfrac{1}{\tilde{\beta}^{2t}}+\tfrac{1}{\tilde{\beta}^{2t}})\overset{\text{\char 172}}{\geq}\textstyle\tfrac{1}{2}\sum_{t=1}^{T/2}(\tfrac{1}{\tilde{\beta}^{2t}}+\tfrac{1}{\tilde{\beta}^{2t+1}})=\tfrac{1}{\tilde{\beta}^{2T+1}}-\tfrac{1}{\tilde{\beta}^{1}}+\sum_{t=1}^{T}\tfrac{1}{\tilde{\beta}^{t}}=\mathcal{O}(\sum_{t=1}^{T}\tfrac{1}{\tilde{\beta}^{t}})\geq\mathcal{O}(T^{a}), where step uses the fact that {β~t}t=1T\{\tilde{\beta}^{t}\}_{t=1}^{T} is increasing.

Lemma A.12.

Assume that dtdt2β˙t+1β˙t+2\tfrac{d^{t}}{d^{t-2}}\leq\tfrac{\dot{\beta}^{t}+1}{\dot{\beta}^{t}+2}, and i=0T(1/β˙i)𝒪(Ta)\sum_{i=0}^{T}({1}/{\dot{\beta}^{i}})\geq\mathcal{O}(T^{a}), where a0a\geq 0 is a positive constant, {dt}t=0\{d^{t}\}_{t=0}^{\infty} and {β˙t}t=0\{\dot{\beta}^{t}\}_{t=0}^{\infty} are two nonnegative sequences. Assume that {β˙t}t=0\{\dot{\beta}^{t}\}_{t=0}^{\infty} is increasing. We have: dT𝒪(1/exp(Ta))d^{T}\leq\mathcal{O}({1}/{\exp(T^{a})}).

Proof.

We define γt1β˙t+2(0,1)\gamma^{t}\triangleq\tfrac{1}{\dot{\beta}^{t}+2}\in(0,1).

Given dtdt2β˙t+1β˙t+2\tfrac{d^{t}}{d^{t-2}}\leq\tfrac{\dot{\beta}^{t}+1}{\dot{\beta}^{t}+2}, we have dtdt21γt\tfrac{d^{t}}{d^{t-2}}\leq 1-\gamma^{t}, leading to:

d2t\displaystyle d^{2t} \displaystyle\leq d0(1γ2)(1γ4)(1γ6)(1γ2t).\displaystyle d^{0}(1-\gamma^{2})(1-\gamma^{4})(1-\gamma^{6})\ldots(1-\gamma^{2t}). (14)

Part (a). When TT is an even number, we have:

dT\displaystyle\textstyle d^{T} =\displaystyle= exp(log(dT))\displaystyle\textstyle\exp(\log(d^{T}))
\displaystyle\overset{\text{\char 172}}{\leq} exp(log(d0t=1T/2(1γ2t)))\displaystyle\textstyle\exp(\log(d^{0}\cdot\prod_{t=1}^{T/2}(1-\gamma^{2t})))
=\displaystyle\overset{\text{\char 173}}{=} exp(log(d0)+t=1T/2log(1γ2t))\displaystyle\textstyle\exp(\log(d^{0})+\sum_{t=1}^{T/2}\log(1-\gamma^{2t}))
\displaystyle\overset{\text{\char 174}}{\leq} exp(log(d0)+t=1T/2(γ2t))\displaystyle\textstyle\exp(\log(d^{0})+\sum_{t=1}^{T/2}(-\gamma^{2t}))
\displaystyle\overset{\text{\char 175}}{\leq} exp(log(d0))×{exp(t=1T/2(γ2t))}1\displaystyle\textstyle\exp(\log(d^{0}))\times\{\exp(\sum_{t=1}^{T/2}(\gamma^{2t}))\}^{-1}
\displaystyle\overset{\text{\char 176}}{\leq} d0×{exp(𝒪(Ta))}1=𝒪(1/exp(Ta)),\displaystyle\textstyle d^{0}\times\{\exp(\mathcal{O}(T^{a}))\}^{-1}=\mathcal{O}({1}/{\exp(T^{a})}),

where step uses Inequality (14); step uses log(ab)=log(a)+log(b)\log(ab)=\log(a)+\log(b) for all a>0a>0 and b>0b>0; step uses log(1x)x\log(1-x)\leq-x for all x(0,1)x\in(0,1), and 1γt(0,1)1-\gamma^{t}\in(0,1) for all tt; step uses exp(a+b)=exp(a)exp(b)\exp(a+b)=\exp(a)\exp(b) for all a>0a>0 and b>0b>0; step uses Lemma A.11 with β~t=1/γt=β˙t+2\tilde{\beta}^{t}=1/\gamma^{t}=\dot{\beta}^{t}+2.

Part (b). When TT is an odd number, analogous strategies result in the same complexity outcome.

Lemma A.13.

Assume that [dt]τ+1β¨t(dt2dt)[d^{t}]^{\tau+1}\leq\ddot{\beta}^{t}(d^{t-2}-d^{t}), and i=1T(1/β¨i)𝒪(Ta)\sum_{i=1}^{T}({1}/{\ddot{\beta}^{i}})\geq\mathcal{O}(T^{a}), where τ,a>0\tau,a>0 are positive constants, {dt}t=0\{d^{t}\}_{t=0}^{\infty} and {β¨t}t=0\{\ddot{\beta}^{t}\}_{t=0}^{\infty} are two nonnegative sequences. Assume that {β˙t}t=0\{\dot{\beta}^{t}\}_{t=0}^{\infty} is increasing. We have: dT𝒪(1/(Ta/τ))d^{T}\leq\mathcal{O}(1/(T^{a/\tau})).

Proof.

We let κ>1\kappa>1 be any constant. We define h(s)=sτ1h(s)=s^{-\tau-1}, where τ>0\tau>0.

We consider two cases for h(dt)/h(dt2)h(d^{t})/h(d^{t-2}).

Case (1). h(dt)κh(dt2)h(d^{t})\leq\kappa h(d^{t-2}). We define h˘(s)1τsτ\breve{h}(s)\triangleq-\tfrac{1}{\tau}\cdot s^{-\tau}. We derive:

1\displaystyle 1 \displaystyle\overset{\text{\char 172}}{\leq} β¨t(dt2dt)h(dt)\displaystyle\textstyle\ddot{\beta}^{t}(d^{t-2}-d^{t})\cdot h(d^{t})
\displaystyle\overset{\text{\char 173}}{\leq} β¨t(dt2dt)κh(dt2)\displaystyle\textstyle\ddot{\beta}^{t}(d^{t-2}-d^{t})\cdot\kappa h(d^{t-2})
\displaystyle\overset{\text{\char 174}}{\leq} β¨tκdtdt2h(s)𝑑s\displaystyle\textstyle\ddot{\beta}^{t}\kappa\int_{d^{t}}^{d^{t-2}}h(s)ds
=\displaystyle\overset{\text{\char 175}}{=} β¨tκ(h˘(dt2)h˘(dt))\displaystyle\textstyle\ddot{\beta}^{t}\kappa\cdot(\breve{h}(d^{t-2})-\breve{h}(d^{t}))
=\displaystyle\overset{\text{\char 176}}{=} β¨tκ1τ([dt]τ[dt2]τ),\displaystyle\textstyle\ddot{\beta}^{t}\kappa\cdot\tfrac{1}{\tau}\cdot([d^{t}]^{-\tau}-[d^{t-2}]^{-\tau}),

where step uses [dt]τ+1β¨t(dt2dt)[d^{t}]^{\tau+1}\leq\ddot{\beta}^{t}(d^{t-2}-d^{t}); step uses h(dt)κh(dt2)h(d^{t})\leq\kappa h(d^{t-2}); step uses the fact that h(s)h(s) is a nonnegative and increasing function that (ab)h(a)bah(s)𝑑s(a-b)h(a)\leq\int_{b}^{a}h(s)ds for all a,b[0,)a,b\in[0,\infty); step uses the fact that h˘(s)=h(s)\nabla\breve{h}(s)=h(s); step uses the definition of h˘()\breve{h}(\cdot). This leads to:

[dt]τ[dt2]τκ1τβ¨t.\displaystyle[d^{t}]^{-\tau}-[d^{t-2}]^{-\tau}\geq\tfrac{\kappa^{-1}\tau}{\ddot{\beta}^{t}}. (15)

Case (2). h(dt)>κh(dt2)h(d^{t})>\kappa h(d^{t-2}). We have:

h(dt)>κh(dt2)\displaystyle h(d^{t})>\kappa h(d^{t-2}) \displaystyle\overset{\text{\char 172}}{\Rightarrow} [dt](τ+1)>κ[dt2](τ+1)\displaystyle[d^{t}]^{-(\tau+1)}>\kappa\cdot[d^{t-2}]^{-(\tau+1)} (16)
\displaystyle\overset{\text{\char 173}}{\Rightarrow} ([dt](τ+1))ττ+1>κττ+1([dt2](τ+1))ττ+1\displaystyle([d^{t}]^{-(\tau+1)})^{\tfrac{\tau}{\tau+1}}>\kappa^{\tfrac{\tau}{\tau+1}}\cdot([d^{t-2}]^{-(\tau+1)})^{\tfrac{\tau}{\tau+1}}
\displaystyle\overset{}{\Rightarrow} [dt]τ>κττ+1[dt2]τ,\displaystyle[d^{t}]^{-\tau}>\kappa^{\tfrac{\tau}{\tau+1}}\cdot[d^{t-2}]^{-\tau},

where step uses the definition of h()h(\cdot); step uses the fact that if a>b>0a>b>0, then aτ˙>bτ˙a^{\dot{\tau}}>b^{\dot{\tau}} for any exponent τ˙ττ+1(0,1)\dot{\tau}\triangleq\tfrac{\tau}{\tau+1}\in(0,1). We further derive:

[dt]τ[dt2]τ\displaystyle\textstyle[d^{t}]^{-\tau}-[d^{t-2}]^{-\tau} \displaystyle\overset{\text{\char 172}}{\geq} (κττ+11)[dt2]τ\displaystyle\textstyle(\kappa^{\tfrac{\tau}{\tau+1}}-1)\cdot[d^{t-2}]^{-\tau} (17)
\displaystyle\overset{\text{\char 173}}{\geq} (κττ+11)[d0]τ,\displaystyle\textstyle(\kappa^{\tfrac{\tau}{\tau+1}}-1)\cdot[d^{0}]^{-\tau},

where step uses Inequality (16); step uses τ>0\tau>0 and dt2d0d^{t-2}\leq d^{0} for all tt.

In view of Inequalities (15) and (17), we have:

[dt]τ[dt2]τ\displaystyle[d^{t}]^{-\tau}-[d^{t-2}]^{-\tau} \displaystyle\geq min(κ1τβt,(κττ+11)[d0]τ)\displaystyle\textstyle\min(\tfrac{\kappa^{-1}\tau}{\beta^{t}},(\kappa^{\tfrac{\tau}{\tau+1}}-1)\cdot[d^{0}]^{-\tau}) (18)
\displaystyle\geq 1β¨tmin(κ1τ,(κττ+11)[d0]τβ¨0)μ0.\displaystyle\textstyle\tfrac{1}{\ddot{\beta}^{t}}\cdot\smash{\underbrace{\min(\kappa^{-1}\tau,(\kappa^{\tfrac{\tau}{\tau+1}}-1)\cdot[d^{0}]^{-\tau}\ddot{\beta}^{0})}_{\mu_{0}}}.

We now focus on Inequality (18).

Part (a). When TT is an even number, telescoping Inequality (18) over t={2,4,,T}t=\{2,4,\ldots,T\}, we have:

[dT]τ[d0]τμ0t=1T/21β¨2t𝒪(Ta),\displaystyle\textstyle[d^{T}]^{-\tau}-[d^{0}]^{-\tau}\geq\textstyle\mu_{0}\sum_{t=1}^{T/2}\tfrac{1}{\ddot{\beta}^{2t}}\overset{\text{\char 172}}{\geq}\mathcal{O}(T^{a}),

where step use Lemma A.11. This leads to:

dT=([dT]τ)1/τ𝒪(Ta)1/τ=𝒪(1/(Ta/τ)).\displaystyle d^{T}=([d^{T}]^{-\tau})^{-1/\tau}\leq\mathcal{O}(T^{a})^{-1/\tau}=\mathcal{O}(1/(T^{a/\tau})).

Part (b). When TT is an odd number, analogous strategies result in the same complexity outcome.

Appendix B Proofs for Section 2

B.1 Proof of Lemma 2.3

Proof.

Assume 0<μ2<μ1<1Wh0<\mu_{2}<\mu_{1}<\frac{1}{W_{h}}, and fixing 𝐲m\mathbf{y}\in\mathbb{R}^{m}.

We define hμ1(𝐲)min𝐯h(𝐯)+12μ1𝐯𝐲22h_{\mu_{1}}(\mathbf{y})\triangleq\min_{\mathbf{v}}h(\mathbf{v})+\frac{1}{2\mu_{1}}\|\mathbf{v}-\mathbf{y}\|_{2}^{2}, and μ1(𝐲)=argmin𝐯h(𝐯)+12μ1𝐯𝐲22\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})=\textstyle\arg\min_{\mathbf{v}}\,h(\mathbf{v})+\tfrac{1}{2\mu_{1}}\|\mathbf{v}-\mathbf{y}\|_{2}^{2}.

We define hμ2(𝐲)min𝐯h(𝐯)+12μ2𝐯𝐲22h_{\mu_{2}}(\mathbf{y})\triangleq\min_{\mathbf{v}}h(\mathbf{v})+\frac{1}{2\mu_{2}}\|\mathbf{v}-\mathbf{y}\|_{2}^{2}, and μ2(𝐲)=argmin𝐯h(𝐯)+12μ2𝐯𝐲22\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})=\textstyle\arg\min_{\mathbf{v}}\,h(\mathbf{v})+\tfrac{1}{2\mu_{2}}\|\mathbf{v}-\mathbf{y}\|_{2}^{2}.

By the optimality of μ1(𝐲)\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}) and μ2(𝐲)\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}), we obtain:

𝐲μ1(𝐲)\displaystyle\textstyle\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}) \displaystyle\in μ1h(μ1(𝐲))\displaystyle\textstyle\mu_{1}\partial h(\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})) (19)
𝐲μ2(𝐲)\displaystyle\textstyle\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}) \displaystyle\in μ2h(μ2(𝐲)).\displaystyle\textstyle\mu_{2}\partial h(\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})). (20)

Part (a). We now prove that 0hμ2(𝐲)hμ1(𝐲)0\leq h_{\mu_{2}}(\mathbf{y})-h_{\mu_{1}}(\mathbf{y}). For any 𝐬1h(μ1(𝐲))\mathbf{s}_{1}\in\partial h(\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})) and 𝐬2h(μ2(𝐲))\mathbf{s}_{2}\in\partial h(\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})), we have:

hμ1(𝐲)hμ2(𝐲)\displaystyle h_{\mu_{1}}(\mathbf{y})-h_{\mu_{2}}(\mathbf{y})
=\displaystyle\overset{\text{\char 172}}{=} 12μ1𝐲μ1(𝐲)2212μ2𝐲μ2(𝐲)22+h(μ1(𝐲))h(μ2(𝐲))\displaystyle\tfrac{1}{2\mu_{1}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}-\tfrac{1}{2\mu_{2}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})\|_{2}^{2}+h(\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}))-h(\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}))
\displaystyle\overset{\text{\char 173}}{\leq} 12μ1𝐲μ1(𝐲)2212μ2𝐲μ2(𝐲)22+μ1(𝐲)μ2(𝐲),𝐬1+Wh2μ2(𝐲)μ1(𝐲)22\displaystyle\tfrac{1}{2\mu_{1}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}-\tfrac{1}{2\mu_{2}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})\|_{2}^{2}+\langle\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}),\mathbf{s}_{1}\rangle+\tfrac{W_{h}}{2}\|\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}
=\displaystyle\overset{\text{\char 174}}{=} 12μ1μ1𝐬12212μ2μ2𝐬222+μ2𝐬2μ1𝐬1,𝐬1+Wh2μ1𝐬1μ2𝐬222\displaystyle\tfrac{1}{2\mu_{1}}\|\mu_{1}\mathbf{s}_{1}\|_{2}^{2}-\tfrac{1}{2\mu_{2}}\|\mu_{2}\mathbf{s}_{2}\|_{2}^{2}+\langle\mu_{2}\mathbf{s}_{2}-\mu_{1}\mathbf{s}_{1},\mathbf{s}_{1}\rangle+\tfrac{W_{h}}{2}\|\mu_{1}\mathbf{s}_{1}-\mu_{2}\mathbf{s}_{2}\|_{2}^{2}
\displaystyle\overset{\text{\char 175}}{\leq} 12μ1μ1𝐬12212μ2μ2𝐬222+μ2𝐬2μ1𝐬1,𝐬1+12μ1μ1𝐬1μ2𝐬222\displaystyle\tfrac{1}{2\mu_{1}}\|\mu_{1}\mathbf{s}_{1}\|_{2}^{2}-\tfrac{1}{2\mu_{2}}\|\mu_{2}\mathbf{s}_{2}\|_{2}^{2}+\langle\mu_{2}\mathbf{s}_{2}-\mu_{1}\mathbf{s}_{1},\mathbf{s}_{1}\rangle+\tfrac{1}{2\mu_{1}}\|\mu_{1}\mathbf{s}_{1}-\mu_{2}\mathbf{s}_{2}\|_{2}^{2}
=\displaystyle\overset{}{=} μ22𝐬222(1μ2μ1)\displaystyle-\tfrac{\mu_{2}}{2}\|\mathbf{s}_{2}\|_{2}^{2}\cdot(1-\tfrac{\mu_{2}}{\mu_{1}})
\displaystyle\overset{\text{\char 176}}{\leq} 0,\displaystyle 0,

where step uses the definition of hμ1(𝐲)h_{\mu_{1}}(\mathbf{y}) and hμ2(𝐲)h_{\mu_{2}}(\mathbf{y}); step uses weakly convexity of h()h(\cdot); step uses the optimality of μ1(𝐲)\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}) and μ2(𝐲)\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}) in Equations (19) and (20); step uses Wh1μ1W_{h}\leq\tfrac{1}{\mu_{1}}; step uses 1μ2μ11\geq\tfrac{\mu_{2}}{\mu_{1}}.

Part (b). We now prove that hμ2(𝐲)hμ1(𝐲)min{μ12μ2,1}(μ1μ2)Ch2h_{\mu_{2}}(\mathbf{y})-h_{\mu_{1}}(\mathbf{y})\leq\min\{\tfrac{\mu_{1}}{2\mu_{2}},1\}\cdot(\mu_{1}-\mu_{2})C_{h}^{2}. For any 𝐬1h(μ1(𝐲))\mathbf{s}_{1}\in\partial h(\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})) and 𝐬2h(μ2(𝐲))\mathbf{s}_{2}\in\partial h(\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})), we have:

hμ2(𝐲)hμ1(𝐲)\displaystyle h_{\mu_{2}}(\mathbf{y})-h_{\mu_{1}}(\mathbf{y})
=\displaystyle\overset{\text{\char 172}}{=} 12μ2𝐲μ2(𝐲)2212μ1𝐲μ1(𝐲)22+h(μ2(𝐲))h(μ1(𝐲))\displaystyle\tfrac{1}{2\mu_{2}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})\|_{2}^{2}-\tfrac{1}{2\mu_{1}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}+h(\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}))-h(\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}))
\displaystyle\overset{\text{\char 173}}{\leq} 12μ2𝐲μ2(𝐲)2212μ1𝐲μ1(𝐲)22+μ2(𝐲)μ1(𝐲),𝐬1+Wh2μ2(𝐲)μ1(𝐲)22\displaystyle\tfrac{1}{2\mu_{2}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})\|_{2}^{2}-\tfrac{1}{2\mu_{1}}\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}+\langle\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}),\mathbf{s}_{1}\rangle+\tfrac{W_{h}}{2}\|\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y})-\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y})\|_{2}^{2}
=\displaystyle\overset{\text{\char 174}}{=} μ22𝐬122μ12𝐬222+μ1𝐬2μ2𝐬1,𝐬1+Wh2μ1𝐬2μ2𝐬122\displaystyle\tfrac{\mu_{2}}{2}\|\mathbf{s}_{1}\|_{2}^{2}-\tfrac{\mu_{1}}{2}\|\mathbf{s}_{2}\|_{2}^{2}+\langle\mu_{1}\mathbf{s}_{2}-\mu_{2}\mathbf{s}_{1},\mathbf{s}_{1}\rangle+\tfrac{W_{h}}{2}\|\mu_{1}\mathbf{s}_{2}-\mu_{2}\mathbf{s}_{1}\|_{2}^{2}
=\displaystyle= μ22𝐬122μ12𝐬222+μ1𝐬1,𝐬2+Wh2μ1𝐬2μ2𝐬122\displaystyle-\tfrac{\mu_{2}}{2}\|\mathbf{s}_{1}\|_{2}^{2}-\tfrac{\mu_{1}}{2}\|\mathbf{s}_{2}\|_{2}^{2}+\mu_{1}\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle+\tfrac{W_{h}}{2}\|\mu_{1}\mathbf{s}_{2}-\mu_{2}\mathbf{s}_{1}\|_{2}^{2}
\displaystyle\overset{\text{\char 175}}{\leq} min{μ12𝐬222+μ1𝐬1,𝐬2+12μ2μ1𝐬2μ2𝐬122μ22𝐬122,\displaystyle\min\{-\tfrac{\mu_{1}}{2}\|\mathbf{s}_{2}\|_{2}^{2}+\mu_{1}\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle+\tfrac{1}{2\mu_{2}}\|\mu_{1}\mathbf{s}_{2}-\mu_{2}\mathbf{s}_{1}\|_{2}^{2}-\tfrac{\mu_{2}}{2}\|\mathbf{s}_{1}\|_{2}^{2},
μ12𝐬222+μ1𝐬1,𝐬2+12μ1μ1𝐬2μ2𝐬122μ22𝐬122}\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}-\tfrac{\mu_{1}}{2}\|\mathbf{s}_{2}\|_{2}^{2}+\mu_{1}\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle+\tfrac{1}{2\mu_{1}}\|\mu_{1}\mathbf{s}_{2}-\mu_{2}\mathbf{s}_{1}\|_{2}^{2}-\tfrac{\mu_{2}}{2}\|\mathbf{s}_{1}\|_{2}^{2}\}
=\displaystyle\overset{}{=} min{(μ2+μ1)μ12μ2𝐬222,(μ1μ2)𝐬1,𝐬2μ22𝐬122+μ222μ1𝐬122}\displaystyle\min\{(-\mu_{2}+\mu_{1})\cdot\tfrac{\mu_{1}}{2\mu_{2}}\|\mathbf{s}_{2}\|_{2}^{2},(\mu_{1}-\mu_{2})\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle-\tfrac{\mu_{2}}{2}\|\mathbf{s}_{1}\|_{2}^{2}+\tfrac{\mu_{2}^{2}}{2\mu_{1}}\|\mathbf{s}_{1}\|_{2}^{2}\}
\displaystyle\overset{\text{\char 176}}{\leq} min{μ12μ2𝐬222(μ1μ2),(μ1μ2)𝐬1,𝐬2}\displaystyle\min\{\tfrac{\mu_{1}}{2\mu_{2}}\|\mathbf{s}_{2}\|_{2}^{2}\cdot(\mu_{1}-\mu_{2}),(\mu_{1}-\mu_{2})\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle\}
\displaystyle\overset{\text{\char 177}}{\leq} min{μ12μ2(μ1μ2),(μ1μ2)}Ch2\displaystyle\min\{\tfrac{\mu_{1}}{2\mu_{2}}\cdot(\mu_{1}-\mu_{2}),(\mu_{1}-\mu_{2})\}\cdot C_{h}^{2}
=\displaystyle\overset{}{=} min{μ12μ2,1}(μ1μ2)Ch2,\displaystyle\min\{\tfrac{\mu_{1}}{2\mu_{2}},1\}\cdot(\mu_{1}-\mu_{2})\cdot C_{h}^{2},

where step uses the definition of hμ1(𝐲)h_{\mu_{1}}(\mathbf{y}) and hμ2(𝐲)h_{\mu_{2}}(\mathbf{y}); step uses the weakly convexity of h()h(\cdot); step uses the optimality of μ2(𝐲)\operatorname{\mathbb{P}}_{\mu_{2}}(\mathbf{y}) and μ1(𝐲)\operatorname{\mathbb{P}}_{\mu_{1}}(\mathbf{y}) in Equations (19) and (20); step uses Wh1μ1W_{h}\leq\frac{1}{\mu_{1}} and Wh1μ2W_{h}\leq\frac{1}{\mu_{2}}; step uses μ2μ1\mu_{2}\leq\mu_{1}; step uses 𝐬1Ch\|\mathbf{s}_{1}\|\leq C_{h}, 𝐬2Ch\|\mathbf{s}_{2}\|\leq C_{h}, and 𝐬1,𝐬2𝐬1𝐬2Ch2\langle\mathbf{s}_{1},\mathbf{s}_{2}\rangle\leq\|\mathbf{s}_{1}\|\cdot\|\mathbf{s}_{2}\|\leq C_{h}^{2}.

B.2 Proof of Lemma 2.4

Proof.

Assume 0<μ2<μ112Wh0<\mu_{2}<\mu_{1}\leq\tfrac{1}{2W_{h}}, and fixing 𝐲m\mathbf{y}\in\mathbb{R}^{m}.

Using the result in Lemma 2.2, we establish that the gradient of hμ(𝐲)h_{\mu}(\mathbf{y}) w.r.t 𝐲\mathbf{y} can be computed as:

hμ(𝐲)=μ1(𝐲μ(𝐲)).\displaystyle\nabla h_{\mu}(\mathbf{y})=\mu^{-1}(\mathbf{y}-\operatorname{\mathbb{P}}_{\mu}(\mathbf{y})).

The gradient of the mapping hμ(𝐲)\nabla h_{\mu}(\mathbf{y}) w.r.t. the variable 1/μ1/\mu can be computed as: 1/μ(hμ(𝐲))=𝐲μ(𝐲)\nabla_{1/\mu}\left(\nabla h_{\mu}(\mathbf{y})\right)=\mathbf{y}-\operatorname{\mathbb{P}}_{\mu}(\mathbf{y}). We further obtain:

1/μ(hμ(𝐲))=𝐲μ(𝐲)=μh(μ(𝐲))μCh.\displaystyle\|\nabla_{1/\mu}\left(\nabla h_{\mu}(\mathbf{y})\right)\|=\|\mathbf{y}-\operatorname{\mathbb{P}}_{\mu}(\mathbf{y})\|\overset{\text{\char 172}}{=}\mu\|\partial h(\operatorname{\mathbb{P}}_{\mu}(\mathbf{y}))\|\leq\mu C_{h}.

Here, step uses the optimality of μ(𝐲)\operatorname{\mathbb{P}}_{\mu}(\mathbf{y}) that: 𝟎h(μ(𝐲))+1μ(μ(𝐲)𝐲)\mathbf{0}\in\partial h(\operatorname{\mathbb{P}}_{\mu}(\mathbf{y}))+\tfrac{1}{\mu}(\operatorname{\mathbb{P}}_{\mu}(\mathbf{y})-\mathbf{y}). Therefore, for all μ(0,12Wh]\mu\in(0,\tfrac{1}{2W_{h}}], we have:

hμ(𝐲)hμ(𝐲)2|1/μ1/μ|μCh.\displaystyle\frac{\|\nabla h_{\mu}(\mathbf{y})-\nabla h_{\mu^{\prime}}(\mathbf{y})\|_{2}}{|1/\mu-1/\mu^{\prime}|}\leq\mu C_{h}.

Letting μ=μ1\mu=\mu_{1} and μ=μ2\mu^{\prime}=\mu_{2}, we have: hμ1(𝐲)hμ2(𝐲)2|1μ1/μ2|Ch=(μ1/μ21)Ch\|\nabla h_{\mu_{1}}(\mathbf{y})-\nabla h_{\mu_{2}}(\mathbf{y})\|_{2}\leq|1-\mu_{1}/\mu_{2}|C_{h}=(\mu_{1}/\mu_{2}-1)C_{h}.

B.3 Proof of Lemma 2.5

Proof.

We consider the following optimization problem:

𝐲¯=argmin𝐲hμ(𝐲)+β2𝐲𝐛22.\displaystyle\bar{\mathbf{y}}=\arg\min_{\mathbf{y}}h_{\mu}(\mathbf{y})+\tfrac{\beta}{2}\|\mathbf{y}-\mathbf{b}\|_{2}^{2}. (21)

Given hμ(𝐲)h_{\mu}(\mathbf{y}) being (μ1)(\mu^{-1})-weakly convex and β>μ1\beta>\mu^{-1}, Problem (21) becomes strongly convex and has a unique optimal solution, which leads to the following equivalent problem:

(𝐲¯,𝐲˘)=argmin𝐲,𝐲h(𝐲)+12μ𝐲𝐲22+β2𝐲𝐛22,\displaystyle\textstyle(\bar{\mathbf{y}},\breve{\mathbf{y}})=\arg\min_{\mathbf{y},\mathbf{y}^{\prime}}h(\mathbf{y}^{\prime})+\tfrac{1}{2\mu}\|\mathbf{y}-\mathbf{y}^{\prime}\|_{2}^{2}+\tfrac{\beta}{2}\|\mathbf{y}-\mathbf{b}\|_{2}^{2},

We have the following first-order optimality conditions for (𝐲¯,𝐲˘)(\bar{\mathbf{y}},\breve{\mathbf{y}}):

1μ(𝐲¯𝐲˘)\displaystyle\tfrac{1}{\mu}(\bar{\mathbf{y}}-\breve{\mathbf{y}}) =\displaystyle= β(𝐛𝐲¯)\displaystyle\beta(\mathbf{b}-\bar{\mathbf{y}}) (22)
1μ(𝐲¯𝐲˘)\displaystyle\tfrac{1}{\mu}(\bar{\mathbf{y}}-\breve{\mathbf{y}}) \displaystyle\in h(𝐲˘).\displaystyle\partial h(\breve{\mathbf{y}}). (23)

Part (a). We have the following results:

𝟎\displaystyle\mathbf{0} \displaystyle\overset{\text{\char 172}}{\in} h(𝐲˘)+1μ(𝐲˘𝐲¯)\displaystyle\partial h(\breve{\mathbf{y}})+\tfrac{1}{\mu}(\breve{\mathbf{y}}-\bar{\mathbf{y}}) (24)
=\displaystyle\overset{\text{\char 173}}{=} h(𝐲˘)+1μ(𝐲˘11/μ+β(1μ𝐲˘+β𝐛))\displaystyle\partial h(\breve{\mathbf{y}})+\tfrac{1}{\mu}(\breve{\mathbf{y}}-\tfrac{1}{1/\mu+\beta}(\tfrac{1}{\mu}\breve{\mathbf{y}}+\beta\mathbf{b}))
=\displaystyle\overset{}{=} h(𝐲˘)+β1+μβ(𝐲˘𝐛),\displaystyle\partial h(\breve{\mathbf{y}})+\tfrac{\beta}{1+\mu\beta}(\breve{\mathbf{y}}-\mathbf{b}),

where step uses Equality (23); step uses Equality (22) that 𝐲¯=11/μ+β(1μ𝐲˘+β𝐛)\bar{\mathbf{y}}=\tfrac{1}{1/\mu+\beta}(\tfrac{1}{\mu}\breve{\mathbf{y}}+\beta\mathbf{b}). The inclusion in (24) implies that:

𝐲˘=argmin𝐯h(𝐲˘)+12β1+μβ𝐲˘𝐛22.\displaystyle\breve{\mathbf{y}}=\arg\min_{\mathbf{v}}h(\breve{\mathbf{y}})+\tfrac{1}{2}\cdot\tfrac{\beta}{1+\mu\beta}\|\breve{\mathbf{y}}-\mathbf{b}\|_{2}^{2}.

Part (b). Combining Equalities (22) and (23), we have: β(𝐛𝐲¯)h(𝐲˘)\beta(\mathbf{b}-\bar{\mathbf{y}})\in\partial h(\breve{\mathbf{y}}).

Part (c). In view of Equation (23), we have: 𝐲¯𝐲˘=μh(𝐲˘)\bar{\mathbf{y}}-\breve{\mathbf{y}}=\mu\partial h(\breve{\mathbf{y}}), leading to: 𝐲˘𝐲¯μCh\|\breve{\mathbf{y}}-\bar{\mathbf{y}}\|\leq\mu C_{h}.

B.4 Proofs for Lemma 2.11

Proof.

We let 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} and 𝐗\mathbf{X}\in\mathcal{M}. We define 𝐔𝚫𝖳𝐗r×r\mathbf{U}\triangleq\bm{\Delta}^{\mathsf{T}}\mathbf{X}\in\mathbb{R}^{r\times r}.

We derive the following results:

Proj𝐓𝐗(𝚫)𝖥2𝚫𝖥2\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})\|_{\mathsf{F}}^{2}-\|\bm{\Delta}\|_{\mathsf{F}}^{2}
=\displaystyle\overset{\text{\char 172}}{=} 𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)𝖥2𝚫𝖥2\displaystyle\|\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta})\|_{\mathsf{F}}^{2}-\|\bm{\Delta}\|_{\mathsf{F}}^{2}
=\displaystyle\overset{}{=} 14𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)𝖥2𝚫,𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\displaystyle\tfrac{1}{4}\|\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta})\|_{\mathsf{F}}^{2}-\langle\bm{\Delta},\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta})\rangle
=\displaystyle\overset{\text{\char 173}}{=} 14𝚫𝖳𝐗+𝐗𝖳𝚫𝖥2𝚫,𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\displaystyle\tfrac{1}{4}\|\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}\|_{\mathsf{F}}^{2}-\langle\bm{\Delta},\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta})\rangle
=\displaystyle\overset{\text{\char 174}}{=} 14𝐔+𝐔𝖳𝖥2𝐔+𝐔𝖳,𝐔\displaystyle\tfrac{1}{4}\|\mathbf{U}+\mathbf{U}^{\mathsf{T}}\|_{\mathsf{F}}^{2}-\langle\mathbf{U}+\mathbf{U}^{\mathsf{T}},\mathbf{U}\rangle
=\displaystyle\overset{\text{\char 175}}{=} 14𝐔+𝐔𝖳𝖥2𝐔+𝐔𝖳,𝐔+𝐔𝖳12\displaystyle\tfrac{1}{4}\|\mathbf{U}+\mathbf{U}^{\mathsf{T}}\|_{\mathsf{F}}^{2}-\langle\mathbf{U}+\mathbf{U}^{\mathsf{T}},\mathbf{U}+\mathbf{U}^{\mathsf{T}}\rangle\cdot\tfrac{1}{2}
=\displaystyle\overset{}{=} 14𝐔+𝐔𝖳𝖥20,\displaystyle-\tfrac{1}{4}\|\mathbf{U}+\mathbf{U}^{\mathsf{T}}\|_{\mathsf{F}}^{2}\leq 0,

where step uses Proj𝐓𝐗(𝚫)=𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})=\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}) for all 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} Absil et al. (2008a); step uses the fact that 𝐗𝐏𝖥2=tr(𝐏𝐗𝖳𝐗𝐏𝖳)=𝐏𝖥2\|\mathbf{X}\mathbf{P}\|_{\mathsf{F}}^{2}=\operatorname{tr}(\mathbf{P}\mathbf{X}^{\mathsf{T}}\mathbf{X}\mathbf{P}^{\mathsf{T}})=\|\mathbf{P}\|_{\mathsf{F}}^{2} for all 𝐗\mathbf{X}\in\mathcal{M}; step uses the definition of 𝐔𝚫𝖳𝐗\mathbf{U}\triangleq\bm{\Delta}^{\mathsf{T}}\mathbf{X}; step uses the symmetric properties of the matrix (𝐔+𝐔𝖳)(\mathbf{U}+\mathbf{U}^{\mathsf{T}}).

B.5 Proof of Lemma 2.12

Proof.

We let ρ>0\rho>0, 𝐆n×r\mathbf{G}\in\mathbb{R}^{n\times r}, and 𝐗\mathbf{X}\in\mathcal{M}.

We define 𝐔𝐆𝖳𝐗\mathbf{U}\triangleq\mathbf{G}^{\mathsf{T}}\mathbf{X}, and 𝔾ρ𝐆ρ𝐗𝐆𝖳𝐗(1ρ)𝐗𝐗𝖳𝐆\mathbb{G}_{\rho}\triangleq\mathbf{G}-\rho\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X}-(1-\rho)\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G}.

First, we have the following equalities:

𝐆,𝔾ρ\displaystyle\langle\mathbf{G},\mathbb{G}_{\rho}\rangle =\displaystyle= 𝐆,𝐆ρ𝐗𝐆𝖳𝐗(1ρ)𝐗𝐗𝖳𝐆\displaystyle\langle\mathbf{G},\mathbf{G}-\rho\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X}-(1-\rho)\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G}\rangle (25)
=\displaystyle\overset{}{=} 𝐆,𝐆ρtr(𝐆𝖳𝐗𝐆𝖳𝐗)(1ρ)tr(𝐆𝖳𝐗𝐗𝖳𝐆)\displaystyle\langle\mathbf{G},\mathbf{G}\rangle-\rho\operatorname{tr}(\mathbf{G}^{\mathsf{T}}\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X})-(1-\rho)\operatorname{tr}(\mathbf{G}^{\mathsf{T}}\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G})
=\displaystyle\overset{\text{\char 172}}{=} 𝐆,𝐆ρtr(𝐔𝐔)(1ρ)tr(𝐔𝐔𝖳),\displaystyle\langle\mathbf{G},\mathbf{G}\rangle-\rho\operatorname{tr}(\mathbf{U}\mathbf{U})-(1-\rho)\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}}),

where step uses 𝐔𝐆𝖳𝐗\mathbf{U}\triangleq\mathbf{G}^{\mathsf{T}}\mathbf{X}.

Second, we derive the following equalities:

𝔾ρ𝖥2\displaystyle\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2} =\displaystyle= ρ𝐗𝐆𝖳𝐗+(1ρ)𝐗𝐗𝖳𝐆𝐆,ρ𝐗𝐆𝖳𝐗+(1ρ)𝐗𝐗𝖳𝐆𝐆\displaystyle\langle\rho\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X}+(1-\rho)\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G}-\mathbf{G},\rho\mathbf{X}\mathbf{G}^{\mathsf{T}}\mathbf{X}+(1-\rho)\mathbf{X}\mathbf{X}^{\mathsf{T}}\mathbf{G}-\mathbf{G}\rangle (26)
=\displaystyle\overset{\text{\char 172}}{=} ρ2tr(𝐔𝖳𝐔)+ρ(1ρ)tr(𝐔𝖳𝐔𝖳)ρtr(𝐔𝖳𝐔𝖳)\displaystyle\rho^{2}\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U})+\rho(1-\rho)\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}^{\mathsf{T}})-\rho\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}^{\mathsf{T}})
+(1ρ)ρtr(𝐔𝐔)+(1ρ)2tr(𝐔𝐔𝖳)(1ρ)tr(𝐔𝐔𝖳)\displaystyle+(1-\rho)\rho\operatorname{tr}(\mathbf{U}\mathbf{U})+(1-\rho)^{2}\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})-(1-\rho)\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})
ρtr(𝐔𝐔)(1ρ)tr(𝐔𝐔𝖳)+𝐆,𝐆\displaystyle-\rho\operatorname{tr}(\mathbf{U}\mathbf{U})-(1-\rho)\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})+\langle\mathbf{G},\mathbf{G}\rangle
=\displaystyle\overset{\text{\char 173}}{=} (2ρ21)tr(𝐔𝖳𝐔)2ρ2tr(𝐔𝐔)+𝐆,𝐆,\displaystyle(2\rho^{2}-1)\cdot\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U})-2\rho^{2}\cdot\operatorname{tr}(\mathbf{U}\mathbf{U})+\langle\mathbf{G},\mathbf{G}\rangle,

where step uses 𝐔𝐆𝖳𝐗\mathbf{U}\triangleq\mathbf{G}^{\mathsf{T}}\mathbf{X} and 𝐗𝖳𝐗=𝐈r\mathbf{X}^{\mathsf{T}}\mathbf{X}=\mathbf{I}_{r}; step uses tr(𝐔𝖳𝐔𝖳)=tr(𝐔𝐔)\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}^{\mathsf{T}})=\operatorname{tr}(\mathbf{U}\mathbf{U}).

Third, we have:

tr(𝐆𝖳𝐆)tr(𝐔𝖳𝐔)=𝐆𝐆𝖳,𝐈n𝐗𝐗𝖳0,\displaystyle\operatorname{tr}(\mathbf{G}^{\mathsf{T}}\mathbf{G})-\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U})\overset{\text{\char 172}}{=}\langle\mathbf{G}\mathbf{G}^{\mathsf{T}},\mathbf{I}_{n}-\mathbf{X}\mathbf{X}^{\mathsf{T}}\rangle\overset{\text{\char 173}}{\geq}0, (27)

where step uses 𝐔𝐆𝖳𝐗\mathbf{U}\triangleq\mathbf{G}^{\mathsf{T}}\mathbf{X}; step uses the fact that the matrix (𝐈n𝐗𝐗𝖳)(\mathbf{I}_{n}-\mathbf{X}\mathbf{X}^{\mathsf{T}}) only contains eigenvalues that are 0 or 11.

Part (a-i). We now prove that max(1,2ρ)𝐆,𝔾ρ𝔾ρ𝖥2\max(1,2\rho)\langle\mathbf{G},\mathbb{G}_{\rho}\rangle\geq\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}. We discuss two cases. Case (i): ρ(0,12]\rho\in(0,\frac{1}{2}]. We have:

𝔾ρ𝖥2𝐆,𝔾ρ=(2ρ2ρ)(tr(𝐔𝐔𝖳)tr(𝐔𝐔))0,\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}-\langle\mathbf{G},\mathbb{G}_{\rho}\rangle\overset{\text{\char 172}}{=}(2\rho^{2}-\rho)\cdot(\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})-\operatorname{tr}(\mathbf{U}\mathbf{U}))\overset{\text{\char 173}}{\leq}0,

where step uses Inequalities (25) and (26); step uses 2ρ2ρ02\rho^{2}-\rho\leq 0 for all ρ(0,12]\rho\in(0,\frac{1}{2}], and tr(𝐔𝐔)tr(𝐔𝐔𝖳)\operatorname{tr}(\mathbf{U}\mathbf{U})\leq\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}}) for all 𝐔r×r\mathbf{U}\in\mathbb{R}^{r\times r}.
Case (ii): ρ[12,)\rho\in[\frac{1}{2},\infty). We have:

𝔾ρ𝖥22ρ𝐆,𝔾ρ=(2ρ1)(tr(𝐔𝐔𝖳)𝐆,𝐆)0,\textstyle\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}-2\rho\langle\mathbf{G},\mathbb{G}_{\rho}\rangle\overset{\text{\char 172}}{=}(2\rho-1)(\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})-\langle\mathbf{G},\mathbf{G}\rangle)\overset{\text{\char 173}}{\leq}0,

where step uses Inequalities (25) and (26); step uses 2ρ102\rho-1\geq 0 for all ρ[12,)\rho\in[\frac{1}{2},\infty), and Inequality(27). Therefore, we conclude that: max(1,2ρ)𝐆,𝔾ρ𝔾ρ𝖥2\max(1,2\rho)\langle\mathbf{G},\mathbb{G}_{\rho}\rangle\geq\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}.

Part (a-ii). We now prove that 𝔾ρ𝖥2min(1,ρ2)𝔾1𝖥2\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\geq\min(1,\rho^{2})\|\mathbb{G}_{1}\|_{\mathsf{F}}^{2}. We consider two cases. Case (i): ρ(0,1]\rho\in(0,1]. We have:

ρ2𝔾1𝖥2𝔾ρ𝖥2=(1ρ2)(tr(𝐔𝖳𝐔)𝐆,𝐆)0,\textstyle\rho^{2}\|\mathbb{G}_{1}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|^{2}_{\mathsf{F}}\overset{\text{\char 172}}{=}\textstyle(1-\rho^{2})(\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U})-\langle\mathbf{G},\mathbf{G}\rangle)\overset{\text{\char 173}}{\leq}0,

where step uses Inequalities (25) and (26); step uses 1ρ201-\rho^{2}\geq 0, and Inequality (27).
Case (ii): ρ(1,)\rho\in(1,\infty). We have:

𝔾1𝖥2𝔾ρ𝖥2=(22ρ2)(tr(𝐔𝖳𝐔)tr(𝐔𝐔))0,\|\mathbb{G}_{1}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|^{2}_{\mathsf{F}}\overset{\text{\char 172}}{=}(2-2\rho^{2})(\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U})-\operatorname{tr}(\mathbf{U}\mathbf{U}))\overset{}{\leq}0,

where step uses Inequality (26); step uses 4ρ2104\rho^{2}-1\leq 0 for all ρ(0,12]\rho\in(0,\frac{1}{2}], and the fact that tr(𝐔𝐔)tr(𝐔𝐔𝖳)0\operatorname{tr}(\mathbf{U}\mathbf{U})-\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})\leq 0 for all 𝐔r×r\mathbf{U}\in\mathbb{R}^{r\times r}. Therefore, we conclude that: min(1,ρ2)𝔾1𝖥2𝔾ρ𝖥2\min(1,\rho^{2})\|\mathbb{G}_{1}\|_{\mathsf{F}}^{2}\leq\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}.

Part (b-i). We now prove that 𝔾ρ𝖥min(1,2ρ)𝔾1/2𝖥\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\geq\min(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}. We consider two cases. Case (i): ρ(0,12]\rho\in(0,\frac{1}{2}]. We have:

(2ρ)2𝔾1/2𝖥2𝔾ρ𝖥2=(4ρ21)(tr(𝐆𝖳𝐆)tr(𝐔𝖳𝐔))0,(2\rho)^{2}\|\mathbb{G}_{1/2}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\overset{\text{\char 172}}{=}(4\rho^{2}-1)\cdot(\operatorname{tr}(\mathbf{G}^{\mathsf{T}}\mathbf{G})-\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}))\overset{\text{\char 173}}{\leq}0,

where step uses Inequality (26); step uses 4ρ2104\rho^{2}-1\leq 0 for all ρ(0,12]\rho\in(0,\frac{1}{2}], and Inequality (27).
Case (ii): ρ(12,)\rho\in(\frac{1}{2},\infty). We have:

𝔾1/2𝖥2𝔾ρ𝖥2=(2ρ212)(tr(𝐔𝐔)tr(𝐔𝖳𝐔))0,\|\mathbb{G}_{1/2}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\overset{\text{\char 172}}{=}(2\rho^{2}-\tfrac{1}{2})\cdot(\operatorname{tr}(\mathbf{U}\mathbf{U})-\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}))\overset{\text{\char 173}}{\leq}0,

where step uses Inequalities (25) and (26); step uses 2ρ21202\rho^{2}-\tfrac{1}{2}\geq 0 for all ρ(12,)\rho\in(\frac{1}{2},\infty), and the fact that tr(𝐔𝐔)tr(𝐔𝐔𝖳)0\operatorname{tr}(\mathbf{U}\mathbf{U})-\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})\leq 0 for all 𝐔r×r\mathbf{U}\in\mathbb{R}^{r\times r}. Therefore, we conclude that 𝔾ρ𝖥min(1,2ρ)𝔾1/2𝖥\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\geq\min(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}.

Part (b-ii). We now prove that 𝔾ρ𝖥max(1,2ρ)𝔾1/2𝖥\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\leq\max(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}. We consider two cases. Case (i): ρ(0,12]\rho\in(0,\frac{1}{2}]. We have:

𝔾1/2𝖥2𝔾ρ𝖥2=(2ρ212)(tr(𝐔𝐔)tr(𝐔𝖳𝐔))0,\|\mathbb{G}_{1/2}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\overset{\text{\char 172}}{=}(2\rho^{2}-\tfrac{1}{2})\cdot(\operatorname{tr}(\mathbf{U}\mathbf{U})-\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}))\overset{\text{\char 173}}{\geq}0,

where step uses Inequality (26); step uses 2ρ21202\rho^{2}-\tfrac{1}{2}\leq 0 for all ρ(0,12]\rho\in(0,\frac{1}{2}], and the fact that tr(𝐔𝐔)tr(𝐔𝐔𝖳)0\operatorname{tr}(\mathbf{U}\mathbf{U})-\operatorname{tr}(\mathbf{U}\mathbf{U}^{\mathsf{T}})\leq 0 for all 𝐔r×r\mathbf{U}\in\mathbb{R}^{r\times r}.
Case (ii): ρ(12,)\rho\in(\frac{1}{2},\infty). We have:

(2ρ)2𝔾1/2𝖥2𝔾ρ𝖥2=(4ρ21)(tr(𝐆𝖳𝐆)tr(𝐔𝖳𝐔))0,(2\rho)^{2}\|\mathbb{G}_{1/2}\|_{\mathsf{F}}^{2}-\|\mathbb{G}_{\rho}\|_{\mathsf{F}}^{2}\overset{\text{\char 172}}{=}(4\rho^{2}-1)\cdot(\operatorname{tr}(\mathbf{G}^{\mathsf{T}}\mathbf{G})-\operatorname{tr}(\mathbf{U}^{\mathsf{T}}\mathbf{U}))\overset{\text{\char 173}}{\geq}0,

where step uses Inequalities (25) and (26); step uses 4ρ2104\rho^{2}-1\geq 0 for all ρ(12,)\rho\in(\frac{1}{2},\infty), and Inequality (27). Therefore, we conclude that: 𝔾ρ𝖥min(1,2ρ)𝔾1/2𝖥\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\geq\min(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}.

B.6 Proof of Lemma 2.13

Proof.

Recall that the following first-order optimality conditions are equivalent for all 𝐗n×r\mathbf{X}\in\mathbb{R}^{n\times r}:

(𝟎(𝐗)+f(𝐗))(𝟎Proj𝐓𝐗(f(𝐗))).\displaystyle\left(\mathbf{0}\in\partial\mathcal{I}_{\mathcal{M}}(\mathbf{X})+\nabla f(\mathbf{X})\right)\Leftrightarrow\left(\mathbf{0}\in\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X}))\right). (28)

Therefore, we derive the following results:

dist(𝟎,(𝐗)+f(𝐗))\displaystyle\operatorname{dist}(\mathbf{0},\partial\mathcal{I}_{\mathcal{M}}(\mathbf{X})+\nabla f(\mathbf{X})) =\displaystyle= inf𝐑f(𝐗)+(𝐗)𝐑𝖥\displaystyle\inf_{\mathbf{R}\in\nabla f(\mathbf{X})+\partial\mathcal{I}_{\mathcal{M}}(\mathbf{X})}\|\mathbf{R}\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} inf𝐑Proj𝐓𝐗(f(𝐗))𝐑𝖥\displaystyle\inf_{\mathbf{R}\in\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X}))}\|\mathbf{R}\|_{\mathsf{F}}
=\displaystyle\overset{}{=} Proj𝐓𝐗(f(𝐗))𝖥\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X}))\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 173}}{=} f(𝐗)12𝐗(𝐗𝖳f(𝐗)+f(𝐗)𝖳𝐗)𝖥\displaystyle\|\nabla f(\mathbf{X})-\tfrac{1}{2}\mathbf{X}(\mathbf{X}^{\mathsf{T}}\nabla f(\mathbf{X})+\nabla f(\mathbf{X})^{\mathsf{T}}\mathbf{X})\|_{\mathsf{F}}
=\displaystyle\overset{}{=} (𝐈12𝐗𝐗𝖳)(f(𝐗)𝐗f(𝐗)𝖳𝐗)𝖥\displaystyle\|(\mathbf{I}-\tfrac{1}{2}\mathbf{X}\mathbf{X}^{\mathsf{T}})(\nabla f(\mathbf{X})-\mathbf{X}\nabla f(\mathbf{X})^{\mathsf{T}}\mathbf{X})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} f(𝐗)𝐗f(𝐗)𝖳𝐗𝖥,\displaystyle\|\nabla f(\mathbf{X})-\mathbf{X}\nabla f(\mathbf{X})^{\mathsf{T}}\mathbf{X}\|_{\mathsf{F}},

where step uses Formulation (28); step uses Proj𝐓𝐗(𝚫)=𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})=\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}) for all 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} Absil et al. (2008a); step uses the norm inequality 𝐀𝐁𝖥𝐀𝐁𝖥\|\mathbf{A}\mathbf{B}\|_{\mathsf{F}}\leq\|\mathbf{A}\|\|\mathbf{B}\|_{\mathsf{F}}, and fact that the matrix 𝐈12𝐗𝐗𝖳\mathbf{I}-\frac{1}{2}\mathbf{X}\mathbf{X}^{\mathsf{T}} only contains eigenvalues that are 12\frac{1}{2} or 11.

Appendix C Proofs for Section 4

C.1 Proof of Lemma 4.1

Proof.

We define L(𝐗,𝐲;𝐳;β,μ)f(𝐗)g(𝐗)+hμ(𝐲)+𝐳,𝒜(𝐗)𝐲+β2𝒜(𝐗)𝐲22L(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu)\triangleq f(\mathbf{X})-g(\mathbf{X})+h_{\mu}(\mathbf{y})+\langle\mathbf{z},\mathcal{A}(\mathbf{X})-\mathbf{y}\rangle+\frac{\beta}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}.

We define σ˙(σ1)/(2σ)\dot{\sigma}\triangleq(\sigma-1)/(2-\sigma), and σ¨(σ/(2σ))2\ddot{\sigma}\triangleq(\sigma/(2-\sigma))^{2}.

Part (a-i). Using the first-order optimality condition of 𝐲t+1argmin𝐲L(𝐗t+1,𝐲,𝐳t;βt,μt)\mathbf{y}^{t+1}\in\arg\min_{\mathbf{y}}L(\mathbf{X}^{t+1},\mathbf{y},\mathbf{z}^{t};\beta^{t},\mu^{t}) in Algorithm 1, for all t0t\geq 0, we have:

𝟎\displaystyle\mathbf{0} =\displaystyle= hμt(𝐲t+1)+βt(𝐲t+1𝐲t)+𝐲SS(𝐗t+1,𝐲t;𝐳t;βt)\displaystyle\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})+\beta^{t}(\mathbf{y}^{t+1}-\mathbf{y}^{t})+\nabla_{\mathbf{y}}\SS(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}) (29)
=\displaystyle\overset{\text{\char 172}}{=} hμt(𝐲t+1)+βt(𝐲t+1𝐲t)𝐳t+βt(𝐲t𝒜(𝐗t+1))\displaystyle\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})+\beta^{t}(\mathbf{y}^{t+1}-\mathbf{y}^{t})-\mathbf{z}^{t}+\beta^{t}(\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t+1}))
=\displaystyle= hμt(𝐲t+1)𝐳t+βt(𝐲t+1𝒜(𝐗t+1))\displaystyle\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\mathbf{z}^{t}+\beta^{t}(\mathbf{y}^{t+1}-\mathcal{A}(\mathbf{X}^{t+1}))
=\displaystyle\overset{\text{\char 173}}{=} hμt(𝐲t+1)𝐳t+1σ(𝐳t𝐳t+1),\displaystyle\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\mathbf{z}^{t}+\tfrac{1}{\sigma}(\mathbf{z}^{t}-\mathbf{z}^{t+1}),

where step uses 𝐲SS(𝐗t+1,𝐲;𝐳t;βt)=𝐳t+βt(𝐲𝒜(𝐗t+1))\nabla_{\mathbf{y}}\SS(\mathbf{X}^{t+1},\mathbf{y};\mathbf{z}^{t};\beta^{t})=-\mathbf{z}^{t}+\beta^{t}(\mathbf{y}-\mathcal{A}(\mathbf{X}^{t+1})); step uses 𝐳t+1=𝐳t+σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}=\mathbf{z}^{t}+\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}).

Part (a-ii). We obtain:

h(𝐲˘t+1)𝐳t\displaystyle\partial h(\breve{\mathbf{y}}^{t+1})-\mathbf{z}^{t} \displaystyle\overset{\text{\char 172}}{\ni} βt(𝐛𝐲t+1)𝐳t\displaystyle\beta^{t}(\mathbf{b}-\mathbf{y}^{t+1})-\mathbf{z}^{t}
=\displaystyle\overset{\text{\char 173}}{=} βt𝐲t𝐲𝒮t(𝐗t+1,𝐲t;𝐳t;βt)βt𝐲t+1𝐳t\displaystyle\beta^{t}\mathbf{y}^{t}-\nabla_{\mathbf{y}}\mathcal{S}^{t}(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\beta^{t}\mathbf{y}^{t+1}-\mathbf{z}^{t}
=\displaystyle\overset{\text{\char 174}}{=} βt𝐲tβt(𝐲t𝒜(𝐗t+1))βt𝐲t+1\displaystyle\beta^{t}\mathbf{y}^{t}-\beta^{t}(\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t+1}))-\beta^{t}\mathbf{y}^{t+1}
=\displaystyle\overset{}{=} βt(𝒜(𝐗t+1)𝐲t+1)\displaystyle\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1})
=\displaystyle\overset{\text{\char 175}}{=} 1σ(𝐳t+1𝐳t),\displaystyle\tfrac{1}{\sigma}(\mathbf{z}^{t+1}-\mathbf{z}^{t}),

where step uses the result in Lemma 2.5 that βt(𝐛𝐲t+1)h(𝐲˘t+1)\beta^{t}(\mathbf{b}-\mathbf{y}^{t+1})\in\partial h(\breve{\mathbf{y}}^{t+1}); step uses 𝐛𝐲t𝐲𝒮t(𝐗t+1,𝐲t;𝐳t;βt)/βt\mathbf{b}\triangleq\mathbf{y}^{t}-\nabla_{\mathbf{y}}\mathcal{S}^{t}(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})/\beta^{t}, as shown in Algorithm 1; step uses 𝐲𝒮t(𝐗t+1,𝐲;𝐳t;βt)=𝐳t+βt(𝐲𝒜(𝐗t+1))\nabla_{\mathbf{y}}\mathcal{S}^{t}(\mathbf{X}^{t+1},\mathbf{y};\mathbf{z}^{t};\beta^{t})=-\mathbf{z}^{t}+\beta^{t}(\mathbf{y}-\mathcal{A}(\mathbf{X}^{t+1})); step uses 𝐳t+1𝐳t=σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}-\mathbf{z}^{t}=\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}).

Part (b). First, we derive:

hμt1(𝐲t)hμt(𝐲t+1)\displaystyle\|\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})-\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})\| (30)
\displaystyle\overset{\text{\char 172}}{\leq} hμt1(𝐲t)hμt(𝐲t)+hμt(𝐲t)hμt(𝐲t+1)\displaystyle\|\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})-\nabla h_{\mu^{t}}(\mathbf{y}^{t})\|+\|\nabla h_{\mu^{t}}(\mathbf{y}^{t})-\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})\|
\displaystyle\overset{\text{\char 173}}{\leq} hμt(𝐲t)hμt1(𝐲t)+1μt𝐲t+1𝐲t\displaystyle\|\nabla h_{\mu^{t}}(\mathbf{y}^{t})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\|+\tfrac{1}{\mu^{t}}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|
\displaystyle\overset{\text{\char 174}}{\leq} Ch(μt1μt1)+βtχ𝐲t+1𝐲t,\displaystyle C_{h}(\tfrac{\mu^{t-1}}{\mu^{t}}-1)+\tfrac{\beta^{t}}{\chi}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|,

where step uses 𝐚𝐛𝐚𝐜+𝐜𝐛\|\mathbf{a}-\mathbf{b}\|\leq\|\mathbf{a}-\mathbf{c}\|+\|\mathbf{c}-\mathbf{b}\|; step uses the fact that the function hμt(𝐲)h_{\mu^{t}}(\mathbf{y}) is 1μt\tfrac{1}{\mu^{t}}-smooth w.r.t. 𝐲\mathbf{y} that: hμt(𝐲t+1)hμt(𝐲t)1μt𝐲t+1𝐲t\|\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t}}(\mathbf{y}^{t})\|\leq\tfrac{1}{\mu^{t}}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|; step uses the fact that hμt(𝐲t)hμt1(𝐲t)(μt1/μt1)Ch\|\nabla h_{\mu^{t}}(\mathbf{y}^{t})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\|\leq({\mu^{t-1}}/{\mu^{t}}-1)C_{h} which holds due to Lemma 2.4, and the equality μtβt=χ\mu^{t}\beta^{t}=\chi.

Second, we have from Equality (29):

t0,𝟎σhμt(𝐲t+1)σ𝐳t+(𝐳t𝐳t+1),\displaystyle\forall t\geq 0,~{}\mathbf{0}\in\sigma\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\sigma\mathbf{z}^{t}+(\mathbf{z}^{t}-\mathbf{z}^{t+1}),
t1,𝟎σhμt1(𝐲t)σ𝐳t1+(𝐳t1𝐳t).\displaystyle\forall t\geq 1,~{}\mathbf{0}\in\sigma\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})-\sigma\mathbf{z}^{t-1}+(\mathbf{z}^{t-1}-\mathbf{z}^{t}).

Combining these two equalities yields:

t1,𝐳t+1𝐳t=(σ1)(𝐳t1𝐳t)+σ(hμt(𝐲t+1)hμt1(𝐲t).\displaystyle\forall t\geq 1,\,\mathbf{z}^{t+1}-\mathbf{z}^{t}=(\sigma-1)(\mathbf{z}^{t-1}-\mathbf{z}^{t})+\sigma(\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t}).

Applying Lemma A.4 with 𝐚+=𝐳t+1𝐳t\mathbf{a}^{+}=\mathbf{z}^{t+1}-\mathbf{z}^{t}, 𝐚=𝐳t1𝐳t\mathbf{a}=\mathbf{z}^{t-1}-\mathbf{z}^{t}, 𝐛=σ{hμt(𝐲t+1)hμt1(𝐲t)}\mathbf{b}=\sigma\{\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\}, and ϱ=σ1[0,1)\varrho=\sigma-1\in[0,1), we have:

𝐳t+1𝐳t22\displaystyle\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2}
\displaystyle\leq ϱ1ϱ(𝐳t1𝐳t22𝐳t+1𝐳t22)+1(1ϱ)2σ(hμt(𝐲t+1)hμt1(𝐲t)22\displaystyle\tfrac{\varrho}{1-\varrho}(\|\mathbf{z}^{t-1}-\mathbf{z}^{t}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+\tfrac{1}{(1-\varrho)^{2}}\|\sigma(\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\|_{2}^{2}
=\displaystyle\overset{\text{\char 172}}{=} σ˙(𝐳t𝐳t122𝐳t+1𝐳t22)+σ¨hμt(𝐲t+1)hμt1(𝐲t)22\displaystyle\dot{\sigma}(\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+\ddot{\sigma}\|\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})-\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})\|_{2}^{2}
\displaystyle\overset{\text{\char 173}}{\leq} σ˙(𝐳t𝐳t122𝐳t+1𝐳t22)+2σ¨{(βt)2χ2𝐲t+1𝐲t22+Ch2(μt1/μt1)2}\displaystyle\dot{\sigma}(\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+2\ddot{\sigma}\{\tfrac{(\beta^{t})^{2}}{\chi^{2}}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+C^{2}_{h}(\mu^{t-1}/\mu^{t}-1)^{2}\}
\displaystyle\overset{\text{\char 174}}{\leq} σ˙(𝐳t𝐳t122𝐳t+1𝐳t22)+2σ¨(βt)2χ2𝐲t+1𝐲t22+2σ¨Ch2(2t2t+1),\displaystyle\dot{\sigma}(\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+2\ddot{\sigma}\tfrac{(\beta^{t})^{2}}{\chi^{2}}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+2\ddot{\sigma}C^{2}_{h}(\tfrac{2}{t}-\tfrac{2}{t+1}),

where step uses the definitions of {σ˙,σ¨}\{\dot{\sigma},\ddot{\sigma}\}; step uses Inequality (30), and the inequality (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2} for all a,ba,b\in\mathbb{R}; step uses Lemma A.3 that (βtβt11)22t2t+1(\tfrac{\beta^{t}}{\beta^{t-1}}-1)^{2}\leq\frac{2}{t}-\frac{2}{t+1} for all t1t\geq 1;

C.2 Proof of Lemma 4.3

Proof.

Part (a). We have:

βt+1βt(1+ξ)=β0ξ(t+1)pβ0ξtpβtξβ0ξβtξ0,\displaystyle\beta^{t+1}-\beta^{t}\cdot(1+\xi)\overset{\text{\char 172}}{=}\beta^{0}\xi(t+1)^{p}-\beta^{0}\xi t^{p}-\beta^{t}\xi\overset{\text{\char 173}}{\leq}\beta^{0}\xi-\beta^{t}\xi\overset{\text{\char 174}}{\leq}0,

where step uses βt=β0(1+ξtp)\beta^{t}=\beta^{0}(1+\xi t^{p}); step uses (t+1)ptp1(t+1)^{p}-t^{p}\leq 1 for all p(0,1)p\in(0,1); step uses β0βt\beta^{0}\leq\beta^{t} and ξ>0\xi>0.

Part (b). It holds with ¯=A¯2\underline{\rm{\ell}}=\overline{\rm{A}}^{2} and ¯=A¯2+Lf/β0\overline{\rm{\ell}}=\overline{\rm{A}}^{2}+L_{f}/\beta^{0}.

C.3 Proof of Lemma 4.4

Proof.

We define X¯r\overline{\rm{X}}\triangleq\sqrt{r}, z¯𝐳0+σCh2σ\overline{\rm{z}}\triangleq\|\mathbf{z}^{0}\|+\tfrac{\sigma C_{h}}{2-\sigma}, y¯A¯r+2z¯β0\overline{\rm{y}}\triangleq\overline{\rm{A}}\sqrt{r}+\tfrac{2\overline{\rm{z}}}{\beta^{0}}, where σ[1,2)\sigma\in[1,2).

We let Θ¯F(𝐗¯)μ0Ch2Ch(A¯r+y¯)z¯22β0\underline{\rm{\Theta}}\triangleq F(\bar{\mathbf{X}})-\mu^{0}C_{h}^{2}-C_{h}(\overline{\rm{A}}\sqrt{r}+\overline{\rm{y}})-\tfrac{\overline{\rm{z}}^{2}}{2\beta^{0}}, where 𝐗¯\bar{\mathbf{X}} is the optimal solution of Problem (1).

Part (a). Given 𝐗t+1\mathbf{X}^{t+1}\in\mathcal{M}, we have: 𝐗t𝖥X¯r\|\mathbf{X}^{t}\|_{\mathsf{F}}\leq\overline{\rm{X}}\triangleq\sqrt{r}.

Part (b). We show that 𝐳tz¯\|\mathbf{z}^{t}\|\leq\overline{\rm{z}}. For all t0t\geq 0, we have:

𝐳t+1\displaystyle\|\mathbf{z}^{t+1}\| \displaystyle\overset{\text{\char 172}}{\leq} (σ1)𝐳t+(σ1)𝐳t+𝐳t+1\displaystyle\|(\sigma-1)\mathbf{z}^{t}\|+\|(\sigma-1)\mathbf{z}^{t}+\mathbf{z}^{t+1}\|
=\displaystyle\overset{\text{\char 173}}{=} (σ1)𝐳t+σh(𝐲˘t+1)\displaystyle(\sigma-1)\|\mathbf{z}^{t}\|+\|\sigma\partial h(\breve{\mathbf{y}}^{t+1})\|
=\displaystyle\overset{\text{\char 174}}{=} (σ1)𝐳t+σCh,\displaystyle(\sigma-1)\|\mathbf{z}^{t}\|+\sigma C_{h},

step uses the triangle inequality; step uses 𝐳t+1+(σ1)𝐳tσh(𝐲˘t+1)\mathbf{z}^{t+1}+(\sigma-1)\mathbf{z}^{t}\in\sigma\partial h(\breve{\mathbf{y}}^{t+1}), as shown in Lemma 4.1(a); step uses ChC_{h}-Lipschitz continuity of h(𝐲)h(\mathbf{y}). Applying Lemma A.5 with 𝐚t=𝐳t+1\mathbf{a}_{t}=\|\mathbf{z}^{t+1}\|, c=σChc=\sigma C_{h}, and ϱ=σ1[0,1)\varrho=\sigma-1\in[0,1), we have:

t0,𝐳t+1𝐳0+c1ϱ=𝐳0+σCh2σz¯.\displaystyle\forall t\geq 0,\,\|\mathbf{z}^{t+1}\|\leq\|\mathbf{z}^{0}\|+\tfrac{c}{1-\varrho}=\|\mathbf{z}^{0}\|+\tfrac{\sigma C_{h}}{2-\sigma}\triangleq\overline{\rm{z}}.

Part (c). We show that 𝐲ty¯\|\mathbf{y}^{t}\|\leq\overline{\rm{y}}. For all t0t\geq 0, we have:

𝐲t+1\displaystyle\|\mathbf{y}^{t+1}\| =\displaystyle= 𝒜(𝐗t+1)𝐳t+1𝐳tσβt\displaystyle\|\mathcal{A}(\mathbf{X}^{t+1})-\tfrac{\mathbf{z}^{t+1}-\mathbf{z}^{t}}{\sigma\beta^{t}}\|
\displaystyle\overset{\text{\char 172}}{\leq} 𝒜(𝐗t+1)+1β0𝐳t+1𝐳t\displaystyle\|\mathcal{A}(\mathbf{X}^{t+1})\|+\tfrac{1}{\beta^{0}}\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|
\displaystyle\overset{\text{\char 173}}{\leq} A¯r+1β02z¯y¯,\displaystyle\overline{\rm{A}}\sqrt{r}+\tfrac{1}{\beta^{0}}\cdot 2\overline{\rm{z}}\triangleq\overline{\rm{y}},

where step uses the triangle inequality, σ1\sigma\geq 1, and 1βt1β0\tfrac{1}{\beta^{t}}\leq\tfrac{1}{\beta^{0}}; step uses 𝒜(𝐗)𝖥A¯𝐗𝖥A¯r\|\mathcal{A}(\mathbf{X})\|_{\mathsf{F}}\leq\overline{\rm{A}}\|\mathbf{X}\|_{\mathsf{F}}\leq\overline{\rm{A}}\sqrt{r}, and 𝐳tz¯\|\mathbf{z}^{t}\|\leq\overline{\rm{z}}.

Part (d). We show that ΘtΘ¯\Theta^{t}\geq\underline{\rm{\Theta}}. For all t1t\geq 1, we have:

Θt\displaystyle\Theta^{t}\triangleq L(𝐗t,𝐲t,𝐳t;βt,μt1)+μt1Ch2+𝕋t+t+𝕏t\displaystyle~{}L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})+\mu^{t-1}C_{h}^{2}+\mathbb{T}^{t}+\mathbb{Z}^{t}+\mathbb{X}^{t}
\displaystyle\overset{\text{\char 172}}{\geq} f(𝐗t)g(𝐗t)+hμt1(𝐲t)+𝐳t,𝒜(𝐗t)𝐲t+βt2𝒜(𝐗t)𝐲t22\displaystyle~{}f(\mathbf{X}^{t})-g(\mathbf{X}^{t})+h_{\mu^{t-1}}(\mathbf{y}^{t})+\langle\mathbf{z}^{t},\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\rangle+\tfrac{\beta^{t}}{2}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|_{2}^{2}
=\displaystyle= f(𝐗t)g(𝐗t)+hμt1(𝐲t)+βt2𝒜(𝐗t)𝐲t+𝐳t/βt22βt2𝐳t/βt22\displaystyle~{}f(\mathbf{X}^{t})-g(\mathbf{X}^{t})+h_{\mu^{t-1}}(\mathbf{y}^{t})+\tfrac{\beta^{t}}{2}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}+\mathbf{z}^{t}/\beta^{t}\|_{2}^{2}-\tfrac{\beta^{t}}{2}\|\mathbf{z}^{t}/\beta^{t}\|_{2}^{2}
\displaystyle\overset{\text{\char 173}}{\geq} f(𝐗t)g(𝐗t)+hμt1(𝒜(𝐗t))Ch𝒜(𝐗t)𝐲t12βt𝐳t22\displaystyle~{}f(\mathbf{X}^{t})-g(\mathbf{X}^{t})+h_{\mu^{t-1}}(\mathcal{A}(\mathbf{X}^{t}))-C_{h}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|-\tfrac{1}{2\beta^{t}}\|\mathbf{z}^{t}\|_{2}^{2}
\displaystyle\overset{\text{\char 174}}{\geq} f(𝐗t)g(𝐗t)+h(𝒜(𝐗t))μt1Ch2Ch(𝒜(𝐗t)+𝐲t)12βt𝐳t22\displaystyle~{}f(\mathbf{X}^{t})-g(\mathbf{X}^{t})+h(\mathcal{A}(\mathbf{X}^{t}))-\mu^{t-1}C_{h}^{2}-C_{h}(\|\mathcal{A}(\mathbf{X}^{t})\|+\|\mathbf{y}^{t}\|)-\tfrac{1}{2\beta^{t}}\|\mathbf{z}^{t}\|_{2}^{2}
\displaystyle\overset{\text{\char 175}}{\geq} F(𝐗¯)μ0Ch2Ch(A¯r+y¯)z¯22β0Θ¯,\displaystyle~{}F(\bar{\mathbf{X}})-\mu^{0}C_{h}^{2}-C_{h}(\overline{\rm{A}}\sqrt{r}+\overline{\rm{y}})-\tfrac{\overline{\rm{z}}^{2}}{2\beta^{0}}\triangleq\underline{\rm{\Theta}},

where step uses the definition of L(𝐗,𝐲;𝐳;β;μ)L(\mathbf{X},\mathbf{y};\mathbf{z};\beta;\mu) and the positivity of {μt,𝕋t,t,𝕏t}\{\mu^{t},\mathbb{T}^{t},\mathbb{Z}^{t},\mathbb{X}^{t}\}; step uses the LhL_{h}-Lipschitz continuity of hμt1(𝐲)h_{\mu^{t-1}}(\mathbf{y}), ensuring hμt1(𝐲t)hμt1(𝐲)Ch𝐲t𝐲h_{\mu^{t-1}}(\mathbf{y}^{t})\geq h_{\mu^{t-1}}(\mathbf{y})-C_{h}\|\mathbf{y}^{t}-\mathbf{y}\|, with the specific choice of 𝐲=𝒜(𝐗t)\mathbf{y}=\mathcal{A}(\mathbf{X}^{t}); step uses h(𝐲)hμ(𝐲)μCh2h(\mathbf{y})-h_{\mu}(\mathbf{y})\leq\mu C_{h}^{2}, which has been shown in Lemma 2.3; step uses μtμ0\mu^{t}\leq\mu^{0}, βtβ0\beta^{t}\geq\beta^{0}, 𝒜(𝐗)A¯𝐗𝖥A¯r\|\mathcal{A}(\mathbf{X})\|\leq\overline{\rm{A}}\|\mathbf{X}\|_{\mathsf{F}}\leq\overline{\rm{A}}\sqrt{r} for all 𝐗\mathbf{X}\in\mathcal{M}; 𝐲ty¯\|\mathbf{y}^{t}\|\leq\overline{\rm{y}}, and 𝐳tz¯\|\mathbf{z}^{t}\|\leq\overline{\rm{z}}.

C.4 Proof of Lemma 4.5

Proof.

We define L(𝐗,𝐲;𝐳;β,μ)f(𝐗)g(𝐗)+hμ(𝐲)+𝐳,𝒜(𝐗)𝐲+β2𝒜(𝐗)𝐲22L(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu)\triangleq f(\mathbf{X})-g(\mathbf{X})+h_{\mu}(\mathbf{y})+\langle\mathbf{z},\mathcal{A}(\mathbf{X})-\mathbf{y}\rangle+\frac{\beta}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}.

We define ω1σ+ξ2σ2+εzσ2\omega\triangleq\tfrac{1}{\sigma}+\tfrac{\xi}{2\sigma^{2}}+\tfrac{\varepsilon_{z}}{\sigma^{2}}.

We define tωσ˙σ2βt1𝒜(𝐗t)𝐲t22=ωσ˙βt1𝐳t𝐳t122\mathbb{Z}^{t}\triangleq\omega\dot{\sigma}\sigma^{2}\beta^{t-1}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|_{2}^{2}=\tfrac{\omega\dot{\sigma}}{\beta^{t-1}}\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}, where we use 𝐳t+1𝐳t=βtσ(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}-\mathbf{z}^{t}=\beta^{t}\sigma(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}).

Part (a). We focus on the sufficient decrease for variables {μ,𝐲}\{\mu,\mathbf{y}\}. First, we have:

Ξ\displaystyle\Xi\triangleq 𝐲t𝐲t+1,𝐳t+βt2𝐲t+1𝒜(𝐗t+1)22βt2𝐲t𝒜(𝐗t+1)22\displaystyle~{}\langle\mathbf{y}^{t}-\mathbf{y}^{t+1},\mathbf{z}^{t}\rangle+\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t+1}-\mathcal{A}(\mathbf{X}^{t+1})\|_{2}^{2}-\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t+1})\|_{2}^{2}
=\displaystyle\overset{\text{\char 172}}{=} 𝐲t𝐲t+1,𝐳t+βt(𝒜(𝐗t+1)𝐲t+1)βt2𝐲t+1𝐲t22\displaystyle~{}\langle\mathbf{y}^{t}-\mathbf{y}^{t+1},\mathbf{z}^{t}+\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1})\rangle-\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}
=\displaystyle\overset{\text{\char 173}}{=} βt2𝐲t+1𝐲t22+𝐲t𝐲t+1,𝐳t+1σ(𝐳t+1𝐳t)\displaystyle~{}-\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\langle\mathbf{y}^{t}-\mathbf{y}^{t+1},\mathbf{z}^{t}+\tfrac{1}{\sigma}(\mathbf{z}^{t+1}-\mathbf{z}^{t})\rangle
=\displaystyle\overset{\text{\char 174}}{=} βt2𝐲t+1𝐲t22+𝐲t𝐲t+1,hμt(𝐲t+1)\displaystyle~{}-\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\langle\mathbf{y}^{t}-\mathbf{y}^{t+1},\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})\rangle
\displaystyle\overset{\text{\char 175}}{\leq} {1χβt}12𝐲t+1𝐲t22+hμt(𝐲t)hμt(𝐲t+1),\displaystyle~{}\{\tfrac{1}{\chi}-\beta^{t}\}\tfrac{1}{2}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+h_{\mu^{t}}(\mathbf{y}^{t})-h_{\mu^{t}}(\mathbf{y}^{t+1}), (31)

where step uses the Pythagoras Relation that 12𝐲+𝐚2212𝐲𝐚22=12𝐲+𝐲22+𝐲𝐲+,𝐚𝐲+\tfrac{1}{2}\|\mathbf{y}^{+}-\mathbf{a}\|_{2}^{2}-\tfrac{1}{2}\|\mathbf{y}-\mathbf{a}\|_{2}^{2}=-\tfrac{1}{2}\|\mathbf{y}^{+}-\mathbf{y}\|_{2}^{2}+\langle\mathbf{y}-\mathbf{y}^{+},\mathbf{a}-\mathbf{y}^{+}\rangle for all 𝐲,𝐲+,𝐚m\mathbf{y},\mathbf{y}^{+},\mathbf{a}\in\mathbb{R}^{m}; step uses 𝐳t+1𝐳t=σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}-\mathbf{z}^{t}=\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}); step uses hμt(𝐲t+1)=𝐳t+1σ(𝐳t+1𝐳t)\nabla h_{\mu^{t}}(\mathbf{y}^{t+1})=\mathbf{z}^{t}+\tfrac{1}{\sigma}(\mathbf{z}^{t+1}-\mathbf{z}^{t}), as shown in Lemma 4.1(a); step uses the fact that the function hμt(𝐲)h_{\mu^{t}}(\mathbf{y}) is (1/μt)(1/\mu^{t})-weakly convex w.r.t 𝐲\mathbf{y}, and μtβt=χ\mu^{t}\beta^{t}=\chi. Furthermore, we have:

L(𝐗t+1,𝐲t+1;𝐳t;βt,μt)L(𝐗t+1,𝐲t,𝐳t;βt,μt1)\displaystyle~{}L(\mathbf{X}^{t+1},\mathbf{y}^{t+1};\mathbf{z}^{t};\beta^{t},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})
=\displaystyle\overset{\text{\char 172}}{=} hμt(𝐲t+1)hμt1(𝐲t)+𝐲t𝐲t+1,𝐳t+βt2𝐲t+1𝒜(𝐗t+1)22βt2𝐲t𝒜(𝐗t+1)22=Ξ\displaystyle~{}h_{\mu^{t}}(\mathbf{y}^{t+1})-h_{\mu^{t-1}}(\mathbf{y}^{t})+\smash{\underbrace{\langle\mathbf{y}^{t}-\mathbf{y}^{t+1},\mathbf{z}^{t}\rangle+\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t+1}-\mathcal{A}(\mathbf{X}^{t+1})\|_{2}^{2}-\tfrac{\beta^{t}}{2}\|\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t+1})\|_{2}^{2}}_{=\,\Xi}}
\displaystyle\overset{\text{\char 173}}{\leq} 1/χ12βt𝐲t+1𝐲t22+hμt(𝐲t)hμt1(𝐲t)\displaystyle~{}\tfrac{1/\chi-1}{2}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+h_{\mu^{t}}(\mathbf{y}^{t})-h_{\mu^{t-1}}(\mathbf{y}^{t})
=\displaystyle\overset{\text{\char 174}}{=} 1/χ12βt𝐲t+1𝐲t22+(μt1μt)Ch2,\displaystyle~{}\tfrac{1/\chi-1}{2}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+(\mu^{t-1}-\mu^{t})C_{h}^{2}, (32)

where step uses the definition of L(𝐗,𝐲;𝐳;β,μ)L(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu); step uses Inequality (C.4); step uses Lemma 2.3 that hμt(𝐲)hμt1(𝐲)min{μt12μt,1}(μt1μt)Ch2(μt1μt)Ch2h_{\mu^{t}}(\mathbf{y})-h_{\mu^{t-1}}(\mathbf{y})\leq\min\{\tfrac{\mu^{t-1}}{2\mu^{t}},1\}\cdot(\mu^{t-1}-\mu^{t})C_{h}^{2}\leq(\mu^{t-1}-\mu^{t})C_{h}^{2} for all 𝐲\mathbf{y}.

Part (b). We focus on the sufficient decrease for variables {𝐳,β}\{\mathbf{z},\beta\}. We have:

L(𝐗t+1,𝐲t+1;𝐳t+1;βt+1,μt)L(𝐗t+1,𝐲t+1;𝐳t;βt,μt)+εzβt𝒜(𝐗t+1)𝐲t+122\displaystyle~{}L(\mathbf{X}^{t+1},\mathbf{y}^{t+1};\mathbf{z}^{t+1};\beta^{t+1},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t+1};\mathbf{z}^{t};\beta^{t},\mu^{t})+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}
=\displaystyle\overset{\text{\char 172}}{=} 𝒜(𝐗t+1)𝐲t+1,𝐳t+1𝐳t+βt+1βt2𝒜(𝐗t+1)𝐲t+122+εzβt𝒜(𝐗t+1)𝐲t+122\displaystyle~{}\langle\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1},\mathbf{z}^{t+1}-\mathbf{z}^{t}\rangle+\tfrac{\beta^{t+1}-\beta^{t}}{2}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}
=\displaystyle\overset{\text{\char 173}}{=} {1σ+βt+1βt2σ2βt+εzσ2}1βt𝐳t+1𝐳t22\displaystyle~{}\{\tfrac{1}{\sigma}+\tfrac{\beta^{t+1}-\beta^{t}}{2\sigma^{2}\beta^{t}}+\tfrac{\varepsilon_{z}}{\sigma^{2}}\}\cdot\tfrac{1}{\beta^{t}}\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2}
=\displaystyle\overset{\text{\char 174}}{=} {1σ+ξ2σ2+εzσ2}ω1βt𝐳t+1𝐳t22\displaystyle~{}\underbrace{\{\tfrac{1}{\sigma}+\tfrac{\xi}{2\sigma^{2}}+\tfrac{\varepsilon_{z}}{\sigma^{2}}\}}_{\triangleq\omega}\cdot\tfrac{1}{\beta^{t}}\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2}
\displaystyle\overset{\text{\char 175}}{\leq} ωσ˙βt(𝐳t𝐳t122𝐳t+1𝐳t22)+2ωσ¨χ2βt𝐲t+1𝐲t22+2ωσ¨βtCh2(2t2t+1)\displaystyle~{}\tfrac{\omega\dot{\sigma}}{\beta^{t}}(\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}-\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2})+\tfrac{2\omega\ddot{\sigma}}{\chi^{2}}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\tfrac{2\omega\ddot{\sigma}}{\beta^{t}}C^{2}_{h}(\tfrac{2}{t}-\tfrac{2}{t+1})
\displaystyle\overset{\text{\char 176}}{\leq} ωσ˙βt1𝐳t𝐳t122tωσ˙βt𝐳t+1𝐳t22+2ωσ¨χ2βt𝐲t+1𝐲t22+2ωσ¨β0Ch2(2t2t+1)=𝕋t𝕋t+1,\displaystyle~{}\smash{\underbrace{\tfrac{\omega\dot{\sigma}}{\beta^{t-1}}\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|_{2}^{2}}_{\triangleq\mathbb{Z}^{t}}}-\tfrac{\omega\dot{\sigma}}{\beta^{t}}\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2}+\tfrac{2\omega\ddot{\sigma}}{\chi^{2}}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\underbrace{\tfrac{2\omega\ddot{\sigma}}{\beta^{0}}C^{2}_{h}(\tfrac{2}{t}-\tfrac{2}{t+1})}_{=\mathbb{T}^{t}-\mathbb{T}^{t+1}}, (33)

where step uses the definition of L(𝐗,𝐲;𝐳;β;μ)L(\mathbf{X},\mathbf{y};\mathbf{z};\beta;\mu); step uses 𝐳t+1𝐳t=σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}-\mathbf{z}^{t}=\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}); step uses βt+1(1+ξ)βt\beta^{t+1}\leq(1+\xi)\beta^{t}; step uses the upper bound for 𝐳t+1𝐳t22\|\mathbf{z}^{t+1}-\mathbf{z}^{t}\|_{2}^{2} as shown in Lemma 4.1(b); step uses βtβt1β0\beta^{t}\geq\beta^{t-1}\geq\beta^{0}.

Adding Inequalities (C.4) and (C.4) together, we have:

L(𝐗t+1,𝐲t+1,𝐳t+1;βt+1,μt)L(𝐗t+1,𝐲t,𝐳t;βt,μt1)+(μtμt1)Ch2\displaystyle~{}\textstyle L(\mathbf{X}^{t+1},\mathbf{y}^{t+1},\mathbf{z}^{t+1};\beta^{t+1},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})+(\mu^{t}-\mu^{t-1})C_{h}^{2}
+𝕋t+1𝕋t+t+1t+εzβt𝒜(𝐗t+1)𝐲t+122\displaystyle~{}\textstyle+\mathbb{T}^{t+1}-\mathbb{T}^{t}+\mathbb{Z}^{t+1}-\mathbb{Z}^{t}+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}
\displaystyle\leq 12{1χ1+4ωσ¨χ2}βt𝐲t+1𝐲t22\displaystyle~{}\tfrac{1}{2}\{\tfrac{1}{\chi}-1+\tfrac{4\omega\ddot{\sigma}}{\chi^{2}}\}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}
\displaystyle\overset{\text{\char 172}}{\leq} 12{1+1+4ωσ¨χ}εyβt𝐲t+1𝐲t22,\displaystyle~{}\underbrace{\tfrac{1}{2}\{-1+\tfrac{1+4\omega\ddot{\sigma}}{\chi}\}}_{\triangleq\,-\varepsilon_{y}}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2},

where step uses χ1\chi\geq 1.

C.5 Proof of Lemma 4.6

Proof.

We define 𝒮(𝐗,𝐲t;𝐳t;βt)f(𝐗)+𝐳t,𝒜(𝐗)𝐲t+βt2𝒜(𝐗)𝐲t22\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})\triangleq f(\mathbf{X})+\langle\mathbf{z}^{t},\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\rangle+\tfrac{\beta^{t}}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\|_{2}^{2}.

We let 𝐆t𝐗𝒮(𝐗𝖼t,𝐲t;𝐳t;βt)g(𝐗t)\mathbf{G}^{t}\in\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}_{{\sf c}}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\partial g(\mathbf{X}^{t}).

We define 𝕏t12(α+θα)(βt)𝐗t𝐗t1𝖥2\mathbb{X}^{t}\triangleq\tfrac{1}{2}(\alpha+\theta\alpha)\ell(\beta^{t})\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}.

We define εx(θ1αθα)(1+ξ)(α+θα)>0\varepsilon_{x}^{\prime}\triangleq(\theta-1-\alpha-\theta\alpha)-(1+\xi)(\alpha+\theta\alpha)>0, and εx12εx¯>0\varepsilon_{x}\triangleq\tfrac{1}{2}\varepsilon_{x}^{\prime}\underline{\rm{\ell}}>0.

First, using the optimality condition of 𝐗t+1\mathbf{X}^{t+1}\in\mathcal{M}, we have:

𝐗t+1𝐗t,𝐆t+θ(βt)2𝐗t+1𝐗𝖼t𝖥2𝐗t𝐗t,𝐆t+θ(βt)2𝐗t𝐗𝖼t𝖥2.\displaystyle\langle\mathbf{X}^{t+1}-\mathbf{X}^{t},\mathbf{G}^{t}\rangle+\tfrac{\theta\ell(\beta^{t})}{2}\|\mathbf{X}^{t+1}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}^{2}\leq\langle\mathbf{X}^{t}-\mathbf{X}^{t},\mathbf{G}^{t}\rangle+\tfrac{\theta\ell(\beta^{t})}{2}\|\mathbf{X}^{t}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}^{2}. (34)

Second, we have:

L(𝐗t+1,𝐲t,𝐳t;μt,βt)L(𝐗t,𝐲t,𝐳t;μt,βt)\displaystyle~{}L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\mu^{t},\beta^{t})-L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\mu^{t},\beta^{t})
=\displaystyle= 𝒮(𝐗t+1,𝐲t;𝐳t;βt)𝒮(𝐗t,𝐲t;𝐳t;βt)+g(𝐗t)g(𝐗t+1)\displaystyle~{}\textstyle\mathcal{S}(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\mathcal{S}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})+g(\mathbf{X}^{t})-g(\mathbf{X}^{t+1})
\displaystyle\overset{\text{\char 172}}{\leq} (βt)2𝐗t+1𝐗t𝖥2+𝐗t+1𝐗t,𝐗𝒮(𝐗t,𝐲t;𝐳t;βt)+𝐗t𝐗t+1,g(𝐗t),\displaystyle~{}\textstyle\tfrac{\ell(\beta^{t})}{2}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\langle\mathbf{X}^{t+1}-\mathbf{X}^{t},\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})\rangle+\langle\mathbf{X}^{t}-\mathbf{X}^{t+1},\partial g(\mathbf{X}^{t})\rangle, (35)

where step uses the (βt)\ell(\beta^{t})-smoothness of 𝒮(𝐗,𝐲t;𝐳t;βt)\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}) and convexity of g(𝐗)g(\mathbf{X});

Third, we derive:

𝐗t+1𝐗t,𝐗𝒮(𝐗t,𝐲t;𝐳t;βt)𝐗𝒮(𝐗𝖼t,𝐲t;𝐳t;βt)\displaystyle\langle\mathbf{X}^{t+1}-\mathbf{X}^{t},\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}_{{\sf c}}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})\rangle (36)
\displaystyle\overset{\text{\char 172}}{\leq} 𝐗t+1𝐗t𝖥𝐗𝒮(𝐗t,𝐲t;𝐳t;βt)𝐗𝒮(𝐗𝖼t,𝐲t;𝐳t;βt)𝖥\displaystyle\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}\cdot\|\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}_{{\sf c}}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 𝐗t+1𝐗t𝖥(βt)𝐗t𝐗𝖼t𝖥\displaystyle\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}\cdot\ell(\beta^{t})\|\mathbf{X}^{t}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} α(βt)𝐗t+1𝐗t𝖥𝐗t𝐗t1𝖥\displaystyle\alpha\ell(\beta^{t})\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}\cdot\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 175}}{\leq} α(βt)2𝐗t+1𝐗t𝖥2+α(βt)2𝐗t𝐗t1𝖥2,\displaystyle\tfrac{\alpha\ell(\beta^{t})}{2}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\tfrac{\alpha\ell(\beta^{t})}{2}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|^{2}_{\mathsf{F}},

where step uses the norm inequality; step uses the (βt)\ell(\beta^{t})-smoothness of 𝒮(𝐗,𝐲t;𝐳t;βt)\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}); step uses 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}_{{\sf c}}^{t}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}); step uses ab12a2+12b2ab\leq\frac{1}{2}a^{2}+\frac{1}{2}b^{2} for all aa\in\mathbb{R} and bb\in\mathbb{R}.

Summing Inequalities (34),(36), and (C.5), we obtain:

L(𝐗t+1,𝐲t,𝐳t;μt,βt)L(𝐗t,𝐲t,𝐳t;μt,βt)\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\mu^{t},\beta^{t})-L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\mu^{t},\beta^{t})
\displaystyle\leq (βt)2{(1+α)𝐗t+1𝐗t𝖥2+α𝐗t𝐗t1𝖥+θ𝐗t𝐗𝖼t𝖥2θ𝐗t+1𝐗𝖼t𝖥2}\displaystyle\textstyle\tfrac{\ell(\beta^{t})}{2}\{(1+\alpha)\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}+\alpha\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}+\theta\|\mathbf{X}^{t}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}^{2}-\theta\|\mathbf{X}^{t+1}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}^{2}\}
=\displaystyle\overset{\text{\char 172}}{=} (βt)2{(1+α)𝐗t+1𝐗t𝖥2+(α+θα2)𝐗t𝐗t1𝖥2θ𝐗t+1𝐗tα(𝐗t𝐗t1)𝖥2}\displaystyle\textstyle\tfrac{\ell(\beta^{t})}{2}\{(1+\alpha)\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}+(\alpha+\theta\alpha^{2})\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}-\theta\|\mathbf{X}^{t+1}-\mathbf{X}^{t}-\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1})\|_{\mathsf{F}}^{2}\}
\displaystyle\overset{\text{\char 173}}{\leq} (βt)2{(1+α)𝐗t+1𝐗t𝖥2+(α+θα2)𝐗t𝐗t1𝖥2\displaystyle\textstyle\tfrac{\ell(\beta^{t})}{2}\{(1+\alpha)\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}+(\alpha+\theta\alpha^{2})\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}
+θ(α1)𝐗t+1𝐗t𝖥2θα(α1)𝐗t𝐗t1𝖥2}\displaystyle+\theta(\alpha-1)\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}-\theta\alpha(\alpha-1)\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}\}
=\displaystyle\overset{}{=} 12(α+θα)(βt)𝐗t𝐗t1𝖥2𝕏t+(βt)2𝐗t+1𝐗t𝖥2{1+α+θαθ}\displaystyle\textstyle\underbrace{\tfrac{1}{2}(\alpha+\theta\alpha)\ell(\beta^{t})\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}}_{\triangleq\mathbb{X}^{t}}+\tfrac{\ell(\beta^{t})}{2}\cdot\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}\cdot\{1+\alpha+\theta\alpha-\theta\}
=\displaystyle\overset{}{=} 𝕏t𝕏t+1+12𝐗t+1𝐗t𝖥2{(βt)(1+α+θαθ)+(βt+1)(α+θα)}\displaystyle\textstyle\mathbb{X}^{t}-\mathbb{X}^{t+1}+\tfrac{1}{2}\cdot\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}\cdot\{\ell(\beta^{t})(1+\alpha+\theta\alpha-\theta)+\ell(\beta^{t+1})(\alpha+\theta\alpha)\}
\displaystyle\overset{\text{\char 174}}{\leq} 𝕏t𝕏t+1+12𝐗t+1𝐗t𝖥2(βt){(1+α+θαθ)+(1+ξ)(α+θα)εx}\displaystyle\textstyle\mathbb{X}^{t}-\mathbb{X}^{t+1}+\tfrac{1}{2}\cdot\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}\cdot\ell(\beta^{t})\{\underbrace{(1+\alpha+\theta\alpha-\theta)+(1+\xi)(\alpha+\theta\alpha)}_{\triangleq-\varepsilon_{x}^{\prime}}\}
\displaystyle\overset{\text{\char 175}}{\leq} 𝕏t𝕏t+112𝐗t+1𝐗t𝖥2εxβt¯\displaystyle\textstyle\mathbb{X}^{t}-\mathbb{X}^{t+1}-\tfrac{1}{2}\cdot\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}}\cdot\varepsilon_{x}^{\prime}\cdot\beta^{t}\underline{\rm{\ell}}
=\displaystyle\overset{\text{\char 176}}{=} 𝕏t𝕏t+1εxβt𝐗t+1𝐗t𝖥2,\displaystyle\textstyle\mathbb{X}^{t}-\mathbb{X}^{t+1}-\varepsilon_{x}\cdot\beta^{t}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|^{2}_{\mathsf{F}},

where step uses 𝐗𝖼t=𝐗t+α(𝐗t𝐗t1)\mathbf{X}_{{\sf c}}^{t}=\mathbf{X}^{t}+\alpha(\mathbf{X}^{t}-\mathbf{X}^{t-1}); step uses Lemma A.1 with 𝐚=𝐗t+1𝐗t\mathbf{a}=\mathbf{X}^{t+1}-\mathbf{X}^{t}, and 𝐛=𝐗t𝐗t1\mathbf{b}=\mathbf{X}^{t}-\mathbf{X}^{t-1}; step uses the fact that (βt+1)(1+ξ)(βt)\ell(\beta^{t+1})\leq(1+\xi)\ell(\beta^{t}), which is implied by βt+1(1+ξ)βt\beta^{t+1}\leq(1+\xi)\beta^{t}; step uses Lemma 4.3 that βt¯(βt)βt¯\beta^{t}\underline{\rm{\ell}}\leq\ell(\beta^{t})\leq\beta^{t}\overline{\rm{\ell}}; step uses εx12εx¯>0\varepsilon_{x}\triangleq\tfrac{1}{2}\varepsilon_{x}^{\prime}\underline{\rm{\ell}}>0.

C.6 Proof of Lemma 4.7

Proof.

We define: ΘtL(𝐗t,𝐲t;𝐳t;βt,μt1)+μt1Ch2+𝕋t+t+𝕏t\Theta^{t}\triangleq L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})+\mu^{t-1}C_{h}^{2}+\mathbb{T}^{t}+\mathbb{Z}^{t}+\mathbb{X}^{t},

We define e~t𝐲t𝐲t12+𝒜(𝐗t)𝐲t2+𝐗t𝐗t1𝖥2\tilde{e}_{t}\triangleq\|\mathbf{y}^{t}-\mathbf{y}^{t-1}\|^{2}+\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|^{2}+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}^{2}.

Part (a). Using Lemma 4.5, we have:

L(𝐗t+1,𝐲t+1;𝐳t+1;βt+1,μt)L(𝐗t+1,𝐲t;𝐳t;βt,μt1)(μt1μt)Ch2\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t+1};\mathbf{z}^{t+1};\beta^{t+1},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})-(\mu^{t-1}-\mu^{t})C_{h}^{2} (37)
\displaystyle\leq 𝕋t𝕋t+1+tt+1εyβt𝐲t+1𝐲t22εzβt𝒜(𝐗t+1)𝐲t+122.\displaystyle\mathbb{T}^{t}-\mathbb{T}^{t+1}+\mathbb{Z}^{t}-\mathbb{Z}^{t+1}-\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}-\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}.

Using Lemma 4.6, we have:

L(𝐗t+1,𝐲t;𝐳t;βt,μt1)L(𝐗t,𝐲t;𝐳t;βt,μt1)𝕏t𝕏t+1εxβt𝐗t+1𝐗t𝖥2.\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})-L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})\leq\mathbb{X}^{t}-\mathbb{X}^{t+1}-\varepsilon_{x}\beta^{t}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}.

Adding these two inequalities together and using the definition of Θt\Theta^{t}, we have:

ΘtΘt+1\displaystyle\Theta^{t}-\Theta^{t+1} \displaystyle\geq εyβt𝐲t+1𝐲t22+εxβt𝐗t+1𝐗t𝖥2+εzβt𝒜(𝐗t+1)𝐲t+122\displaystyle\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\varepsilon_{x}\beta^{t}\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}
\displaystyle\geq min(εy,εx,εz)βte~t+1.\displaystyle\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})\cdot\beta^{t}\cdot\tilde{e}_{t+1}.

Part (b). Telescoping this inequality over tt from 1 to TT, we have:

t=1Tβte~t+1\displaystyle\textstyle\sum_{t=1}^{T}\beta^{t}\tilde{e}_{t+1} 1min(εy,εx,εz)t=1T(ΘtΘt+1)\displaystyle~{}\leq\textstyle\textstyle\tfrac{1}{\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})}\cdot\sum_{t=1}^{T}(\Theta^{t}-\Theta^{t+1})
=1min(εy,εx,εz)(Θ1ΘT+1)\displaystyle~{}=\textstyle\tfrac{1}{\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})}\cdot(\Theta^{1}-\Theta^{T+1})
1min(εy,εx,εz)(Θ1Θ¯),\displaystyle~{}\overset{\text{\char 172}}{\leq}\textstyle\tfrac{1}{\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})}\cdot(\Theta^{1}-\underline{\rm{\Theta}}), (38)

where step uses ΘtΘ¯\Theta^{t}\geq\underline{\rm{\Theta}}. Furthermore, we have:

t=1Tβte~t+1=t=1T1βt(βt)2e~t+11βTt=1T(βt)2e~t+113TβT(t=1Tβtet+1)2,\displaystyle\textstyle\sum_{t=1}^{T}\beta^{t}\tilde{e}_{t+1}=\textstyle\sum_{t=1}^{T}\tfrac{1}{\beta^{t}}(\beta^{t})^{2}\tilde{e}_{t+1}\geq\textstyle\tfrac{1}{\beta^{T}}\sum_{t=1}^{T}(\beta^{t})^{2}\tilde{e}_{t+1}\overset{\text{\char 172}}{\geq}\textstyle\tfrac{1}{3T\beta^{T}}(\sum_{t=1}^{T}\beta^{t}e^{t+1})^{2}, (39)

where step uses i=1n𝐱i21n(i=1n|𝐱i|)2\sum_{i=1}^{n}\mathbf{x}_{i}^{2}\geq\tfrac{1}{n}(\sum_{i=1}^{n}|\mathbf{x}_{i}|)^{2} for all 𝐱n\mathbf{x}\in\mathbb{R}^{n}. Combining Inequalities (38) and (39), we have: t=1Tβtet+1{Θ1Θ¯min(εy,εx,εz)3TβT}1/2=𝒪(T(1+p)/2)\sum_{t=1}^{T}\beta^{t}e^{t+1}\leq\textstyle\{\tfrac{\Theta^{1}-\underline{\rm{\Theta}}}{\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})}\cdot 3T\beta^{T}\}^{1/2}=\mathcal{O}(T^{(1+p)/2}).

C.7 Proof of Theorem 4.8

Proof.

We define Crit(𝐗,𝐲,𝐳)𝒜(𝐗)𝐲+h(𝐲)𝐳+Proj𝐓𝐗(f(𝐗)g(𝐗)+𝒜𝖳(𝐳))𝖥\operatorname{Crit}(\mathbf{X},\mathbf{y},\mathbf{z})\triangleq\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|+\|\partial h(\mathbf{y})-\mathbf{z}\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X})-\partial g(\mathbf{X})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}))\|_{\mathsf{F}}.

We define 𝐆˙f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)\dot{\mathbf{G}}\triangleq\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}).

We define 𝐆¨f(𝐗𝖼t)g(𝐗t)+𝒜𝖳(𝐳t+βt𝒜(𝐗𝖼t)𝐲t)+θ(βt)(𝐗t+1𝐗𝖼t)\ddot{\mathbf{G}}\triangleq\nabla f(\mathbf{X}_{{\sf c}}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}+\beta^{t}\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t})+\theta\ell(\beta^{t})(\mathbf{X}^{t+1}-\mathbf{X}_{{\sf c}}^{t}).

We first derive the following inequalities:

𝐆¨𝐆˙𝖥\displaystyle~{}\|\ddot{\mathbf{G}}-\dot{\mathbf{G}}\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} f(𝐗t)f(𝐗𝖼t)βt𝒜𝖳(𝒜(𝐗𝖼t)𝐲t)θ(βt)(𝐗t+1𝐗𝖼t)𝖥\displaystyle~{}\|\nabla f(\mathbf{X}^{t})-\nabla f(\mathbf{X}_{{\sf c}}^{t})-\beta^{t}\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t})-\theta\ell(\beta^{t})(\mathbf{X}^{t+1}-\mathbf{X}_{{\sf c}}^{t})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} Lf𝐗t𝐗𝖼t𝖥+βtA¯𝒜(𝐗𝖼t)𝐲t+θ(βt)𝐗t+1𝐗𝖼t𝖥\displaystyle~{}L_{f}\|\mathbf{X}^{t}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}+\beta^{t}\overline{\rm{A}}\|\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t}\|+\theta\ell(\beta^{t})\|\mathbf{X}^{t+1}-\mathbf{X}_{{\sf c}}^{t}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} Lf𝐗t𝐗t1𝖥+βtA¯{𝒜(𝐗t)𝐲t+A¯𝐗t𝐗t1𝖥}\displaystyle~{}L_{f}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}+\beta^{t}\overline{\rm{A}}\{\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\overline{\rm{A}}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}\}
+θ(βt)(𝐗t+1𝐗t𝖥+𝐗t𝐗t1𝖥)\displaystyle~{}+\theta\ell(\beta^{t})(\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}})
\displaystyle\overset{\text{\char 175}}{\leq} (Lf+βtA¯2+θ(βt))𝐗t𝐗t1𝖥+βtA¯𝒜(𝐗t)𝐲t+θ(βt)𝐗t+1𝐗t𝖥\displaystyle~{}(L_{f}+\beta^{t}\overline{\rm{A}}^{2}+\theta\ell(\beta^{t}))\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}+\beta^{t}\overline{\rm{A}}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\theta\ell(\beta^{t})\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 176}}{=} 𝒪(βt1et)+𝒪(βtet+1),\displaystyle~{}\mathcal{O}(\beta^{t-1}e^{t})+\mathcal{O}(\beta^{t}e^{t+1}), (40)

where step uses the definitions of {𝐆¨,𝐆˙}\{\ddot{\mathbf{G}},\dot{\mathbf{G}}\}; step uses the triangle inequality; step uses the fact that f(𝐗)f(\mathbf{X}) is LfL_{f}-smooth, 𝐗t𝐗𝖼t𝖥𝐗t𝐗t1𝖥\|\mathbf{X}^{t}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\leq\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, 𝐗t+1𝐗𝖼t𝖥𝐗t+1𝐗t𝖥+𝐗t𝐗t1𝖥\|\mathbf{X}^{t+1}-\mathbf{X}^{t}_{{\sf c}}\|_{\mathsf{F}}\leq\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}+\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, and 𝒜(𝐗𝖼t)𝐲t𝒜(𝐗t)𝐲t𝖥+A¯𝐗t𝐗t1𝖥\|\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t}\|\leq\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|_{\mathsf{F}}+\overline{\rm{A}}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, as shown in Lemma A.6.

We derive the following inequalities:

Proj𝐓𝐗t(𝐆˙)𝖥\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} Proj𝐓𝐗t(𝐆˙)+Proj𝐓𝐗t+1(𝐆¨)𝖥\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})+\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t+1}}\mathcal{M}}(\ddot{\mathbf{G}})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 2𝐆˙𝐆¨𝖥+2r𝐆˙𝐗t+1𝐗t𝖥\displaystyle 2\|\dot{\mathbf{G}}-\ddot{\mathbf{G}}\|_{\mathsf{F}}+2\sqrt{r}\|\dot{\mathbf{G}}\|\cdot\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} 𝒪(βt1et)+𝒪(βtet+1)+2r(Cf+Cg+A¯z¯)𝐗t+1𝐗t𝖥\displaystyle\mathcal{O}(\beta^{t-1}e^{t})+\mathcal{O}(\beta^{t}e^{t+1})+2\sqrt{r}(C_{f}+C_{g}+\overline{\rm{A}}\overline{\rm{z}})\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}
=\displaystyle= 𝒪(βt1et)+𝒪(βtet+1),\displaystyle\mathcal{O}(\beta^{t-1}e^{t})+\mathcal{O}(\beta^{t}e^{t+1}),

where step uses the optimality of 𝐗t+1\mathbf{X}^{t+1} that:

𝟎=Proj𝐓𝐗t+1(𝐆¨);\displaystyle\mathbf{0}=\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t+1}}\mathcal{M}}(\ddot{\mathbf{G}});

step uses the result of Lemma A.7 by applying

𝐗=𝐗t,𝐗~=𝐗t+1,𝐏=𝐆˙,and𝐏~=𝐆¨;\displaystyle\mathbf{X}=\mathbf{X}^{t},~{}\tilde{\mathbf{X}}=\mathbf{X}^{t+1},~{}\mathbf{P}=\dot{\mathbf{G}},~{}\text{and}~{}\tilde{\mathbf{P}}=\ddot{\mathbf{G}};

step uses Inequality (C.7), and the fact that 𝐆˙=f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t))f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)𝖥Cf+Cg+A¯z¯\|\dot{\mathbf{G}}\|=\|\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}))\|\leq\|\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t})\|_{\mathsf{F}}\leq C_{f}+C_{g}+\overline{\rm{A}}\overline{\rm{z}}.

Finally, we derive:

1Tt=1TCrit(𝐗t,𝐲˘t,𝐳t)\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t},\breve{\mathbf{y}}^{t},\mathbf{z}^{t})
=\displaystyle\overset{\text{\char 172}}{=} 1Tt=1T{𝒜(𝐗t)𝐲˘t+h(𝐲˘t)𝐳t+Proj𝐓𝐗t(𝐆˙)𝖥}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\|\mathcal{A}(\mathbf{X}^{t})-\breve{\mathbf{y}}^{t}\|+\|\partial h(\breve{\mathbf{y}}^{t})-\mathbf{z}^{t}\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}}\}
\displaystyle\overset{\text{\char 173}}{\leq} 1Tt=1T{𝒜(𝐗t)𝐲t+(11σ)(𝐳t𝐳t1)+Proj𝐓𝐗t(𝐆˙)𝖥}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\|(1-\tfrac{1}{\sigma})(\mathbf{z}^{t}-\mathbf{z}^{t-1})\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}}\}
=\displaystyle\overset{\text{\char 174}}{=} 1Tt=1T{𝒪(βt1et)+𝒪(βtet+1)}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\mathcal{O}(\beta^{t-1}e^{t})+\mathcal{O}(\beta^{t}e^{t+1})\}
=\displaystyle\overset{\text{\char 175}}{=} 𝒪(T(p1)/2)=𝒪(T1/3),\displaystyle\textstyle\mathcal{O}(T^{(p-1)/2})=\mathcal{O}(T^{-1/3}),

where step uses the definition of Crit(𝐗,𝐲,𝐳)\operatorname{Crit}(\mathbf{X},\mathbf{y},\mathbf{z}); step uses 𝐳t+1h(𝐲˘t+1)(11σ)(𝐳t+1𝐳t)\mathbf{z}^{t+1}-\partial h(\breve{\mathbf{y}}^{t+1})\ni(1-\tfrac{1}{\sigma})(\mathbf{z}^{t+1}-\mathbf{z}^{t}), as shown in Lemma 4.1; step uses 𝐳t𝐳t1=σβt1(𝒜(𝐗t)𝐲t)2βt𝒜(𝐗t)𝐲t=𝒪(βt1et)\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|=\|\sigma\beta^{t-1}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})\|\leq 2\beta^{t}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|=\mathcal{O}(\beta^{t-1}e^{t}); step uses the choice p=1/3p=1/3 and Lemma 4.7(b).

C.8 Proof of Lemma 4.10

Proof.

We define 𝒮(𝐗,𝐲t;𝐳t;βt)f(𝐗)+𝐳t,𝒜(𝐗)𝐲t+βt2𝒜(𝐗)𝐲t22\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})\triangleq f(\mathbf{X})+\langle\mathbf{z}^{t},\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\rangle+\tfrac{\beta^{t}}{2}\|\mathcal{A}(\mathbf{X})-\mathbf{y}^{t}\|_{2}^{2}.

We let 𝐆t𝐗𝒮(𝐗t,𝐲t;𝐳t;βt)g(𝐗t)\mathbf{G}^{t}\in\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-\partial g(\mathbf{X}^{t}).

We define ηtbtγjβt(0,)\eta^{t}\triangleq\tfrac{b^{t}\gamma^{j}}{\beta^{t}}\in(0,\infty).

Part (a). Initially, we show that 𝐆t𝖥\|\mathbf{G}^{t}\|_{\mathsf{F}} is always bounded for tt with 𝐗\mathbf{X}\in\mathcal{M}. We have:

𝐆t𝖥\displaystyle\|\mathbf{G}^{t}\|_{\mathsf{F}} =\displaystyle= f(𝐗t)g(𝐗t)+𝒜𝖳[𝐳t+βt(𝒜(𝐗t)𝐲t)]𝖥\displaystyle\|\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}[\mathbf{z}^{t}+\beta^{t}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})]\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} f(𝐗t)g(𝐗t)+𝒜𝖳[𝐳t+βtσβt1(𝐳t𝐳t1)]𝖥\displaystyle\|\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}[\mathbf{z}^{t}+\tfrac{\beta^{t}}{\sigma\beta^{t-1}}(\mathbf{z}^{t}-\mathbf{z}^{t-1})]\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} f(𝐗t)𝖥+g(𝐗t)𝖥+A¯{𝐳t+βtσβt1(𝐳t+𝐳t1)}\displaystyle\|\nabla f(\mathbf{X}^{t})\|_{\mathsf{F}}+\|\partial g(\mathbf{X}^{t})\|_{\mathsf{F}}+\overline{\rm{A}}\cdot\{\|\mathbf{z}^{t}\|+\tfrac{\beta^{t}}{\sigma\beta^{t-1}}(\|\mathbf{z}^{t}\|+\|\mathbf{z}^{t-1}\|)\}
\displaystyle\overset{\text{\char 174}}{\leq} Cf+Cg+A¯(z¯+2(1+ξ)z¯)g¯,\displaystyle C_{f}+C_{g}+\overline{\rm{A}}\cdot(\overline{\rm{z}}+2(1+\xi)\overline{\rm{z}})\triangleq\overline{g},

where step uses 𝐳t+1=𝐳t+σβt(𝒜(𝐗t+1)𝐲t+1)\mathbf{z}^{t+1}=\mathbf{z}^{t}+\sigma\beta^{t}(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}); step uses the triangle inequality; step uses f(𝐗t)𝖥Cf\|\nabla f(\mathbf{X}^{t})\|_{\mathsf{F}}\leq C_{f}, g(𝐗t)𝖥Cg\|\nabla g(\mathbf{X}^{t})\|_{\mathsf{F}}\leq C_{g}, 𝒜(𝐗t)𝖥𝒜(𝐗t)A¯\|\nabla\mathcal{A}(\mathbf{X}^{t})\|_{\mathsf{F}}\leq\|\nabla\mathcal{A}(\mathbf{X}^{t})\|\leq\overline{\rm{A}}, 𝐳tz¯\|\mathbf{z}^{t}\|\leq\overline{\rm{z}}, 1σ1\tfrac{1}{\sigma}\leq 1, βtβt1(1+ξ)\beta^{t}\leq\beta^{t-1}(1+\xi); step uses ξ1\xi\leq 1.

We derive the following inequalities:

L(𝐗t+1,𝐲t;𝐳t;βt,μt)L(𝐗t,𝐲t;𝐳t;βt,μt)=˙(𝐗t+1)˙(𝐗t)\displaystyle~{}L(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t})-L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t})=\dot{\mathcal{L}}(\mathbf{X}^{t+1})-\dot{\mathcal{L}}(\mathbf{X}^{t})
=\displaystyle\overset{\text{\char 172}}{=} {𝒮t(𝐗t+1,𝐲t;𝐳t;βt)g(𝐗t+1)}{𝒮t(𝐗t,𝐲t;𝐳t;βt)g(𝐗t)}\displaystyle~{}\{\mathcal{S}^{t}(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-g(\mathbf{X}^{t+1})\}-\{\mathcal{S}^{t}(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})-g(\mathbf{X}^{t})\}
\displaystyle\overset{\text{\char 173}}{\leq} 12(βt)𝐗t+1𝐗t𝖥2+𝐆t,𝐗t+1𝐗t\displaystyle~{}\tfrac{1}{2}\ell(\beta^{t})\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\langle\mathbf{G}^{t},\mathbf{X}^{t+1}-\mathbf{X}^{t}\rangle
=\displaystyle\overset{\text{\char 174}}{=} 12(βt)Retr𝐗t(ηt𝔾ρt)𝐗t𝖥2+𝐆t,Retr𝐗t(ηt𝔾ρt)𝐗t+ηt𝔾ρtηt𝐆t,𝔾ρt\displaystyle~{}\tfrac{1}{2}\ell(\beta^{t})\|\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho})-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\langle\mathbf{G}^{t},\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho})-\mathbf{X}^{t}+\eta^{t}\mathbb{G}^{t}_{\rho}\rangle-\eta^{t}\langle\mathbf{G}^{t},\mathbb{G}^{t}_{\rho}\rangle
\displaystyle\overset{\text{\char 175}}{\leq} 12(βt)Retr𝐗t(ηt𝔾ρt)𝐗t𝖥2+g¯Retr𝐗t(ηt𝔾ρt)𝐗t+ηt𝔾ρt𝖥ηtmax(1,2ρ)𝔾ρt𝖥2\displaystyle~{}\tfrac{1}{2}\ell(\beta^{t})\|\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho})-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}+\overline{g}\|\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho})-\mathbf{X}^{t}+\eta^{t}\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}-\tfrac{\eta^{t}}{\max(1,2\rho)}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}
\displaystyle\overset{\text{\char 176}}{\leq} 12(βt)k˙ηt𝔾ρt𝖥2+12g¯k¨ηt𝔾ρt𝖥2ηtmax(1,2ρ)𝔾ρt𝖥2\displaystyle~{}\tfrac{1}{2}\ell(\beta^{t})\dot{k}\|\eta^{t}\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}+\tfrac{1}{2}\overline{g}\ddot{k}\|\eta^{t}\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}-\tfrac{\eta^{t}}{\max(1,2\rho)}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}
=\displaystyle\overset{\text{\char 177}}{=} ηt𝔾ρt𝖥2{12(βt)k˙btγjβt+12g¯k¨btγjβt1max(1,2ρ)}\displaystyle~{}\eta^{t}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}\cdot\{\tfrac{1}{2}\ell(\beta^{t})\dot{k}\tfrac{b^{t}\gamma^{j}}{\beta^{t}}+\tfrac{1}{2}\overline{g}\ddot{k}\tfrac{b^{t}\gamma^{j}}{\beta^{t}}-\tfrac{1}{\max(1,2\rho)}\}
\displaystyle\overset{\text{\char 178}}{\leq} ηt𝔾ρt𝖥2{(b¯2k˙¯+b¯2β0k¨g¯)γj1max(1,2ρ)}\displaystyle~{}\eta^{t}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}\cdot\{(\tfrac{\overline{b}}{2}\dot{k}\overline{\rm{\ell}}+\tfrac{\overline{b}}{2\beta^{0}}\ddot{k}\overline{g})\gamma^{j}-\tfrac{1}{\max(1,2\rho)}\}
\displaystyle\overset{\text{\char 179}}{\leq} ηt𝔾ρt𝖥2{δ},\displaystyle~{}\eta^{t}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}\cdot\{-\delta\}, (41)

where step uses the definitions of L(𝐗,𝐲;𝐳;β,μ)L(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu); step uses the fact that the function g(𝐗)g(\mathbf{X}) is convex and the function 𝒮(𝐗,𝐲t;𝐳t;βt)\mathcal{S}(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t}) is (βt)\ell(\beta^{t})-smooth w.r.t. 𝐗\mathbf{X}; step uses 𝐗t+1=Retr𝐗t(ηt𝔾ρt)\mathbf{X}^{t+1}=\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho}); step uses the Cauchy-Schwarz Inequality, 𝐆t𝖥g¯\|\mathbf{G}^{t}\|_{\mathsf{F}}\leq\overline{g}, and Lemma 2.12(a) that 𝐆t,𝔾ρt1max(1,2ρ)𝔾ρt𝖥2\langle\mathbf{G}^{t},\mathbb{G}^{t}_{\rho}\rangle\geq\tfrac{1}{\max(1,2\rho)}\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}; step uses Lemma 2.10 with 𝚫ηt𝔾ρt\bm{\Delta}\triangleq-\eta^{t}\mathbb{G}^{t}_{\rho} given that 𝐗t\mathbf{X}^{t}\in\mathcal{M} and 𝚫𝐓𝐗t\bm{\Delta}\in\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}; step uses ηtbtγjβt\eta^{t}\triangleq\tfrac{b^{t}\gamma^{j}}{\beta^{t}}; step uses (βt)βt¯\ell(\beta^{t})\leq\beta^{t}\overline{\rm{\ell}}, β0βt\beta^{0}\leq\beta^{t}, and btb¯b^{t}\leq\overline{b}; step uses the fact that γj\gamma^{j} is sufficiently small such that:

γj2(1max(1,2ρ)δ)¯k˙b¯+g¯k¨b¯/β0γ¯.\displaystyle\gamma^{j}\leq\frac{2(\tfrac{1}{\max(1,2\rho)}-\delta)}{\overline{\rm{\ell}}\dot{k}\overline{b}+\overline{g}\ddot{k}\overline{b}/\beta^{0}}\triangleq\overline{\gamma}. (42)

Given Inequality (C.8) coincides with the condition of the line search procedure, we complete the proof.

Part (b). We derive the following inequalities:

L(𝐗t+1,𝐲t;𝐳t;βt,μt)L(𝐗t,𝐲t;𝐳t;βt,μt)\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t})-L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t})
\displaystyle\overset{\text{\char 172}}{\leq} 𝔾ρt𝖥2δηt\displaystyle-\|\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}^{2}\delta{\eta}^{t}
\displaystyle\overset{\text{\char 173}}{\leq} 𝔾1/2t𝖥2δηtmin(1,2ρ)2\displaystyle-\|\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}\delta{\eta}^{t}\cdot\min(1,2\rho)^{2}
=\displaystyle\overset{\text{\char 174}}{=} 1βt𝔾1/2t𝖥2δbtγj1γmin(1,2ρ)2\displaystyle-\tfrac{1}{\beta^{t}}\|\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}\cdot\delta b^{t}\gamma^{j-1}\gamma\cdot\min(1,2\rho)^{2}
\displaystyle\overset{\text{\char 175}}{\leq} 1βt𝔾1/2t𝖥2δb¯γ¯γmin(1,2ρ)2εx,\displaystyle-\tfrac{1}{\beta^{t}}\|\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}\cdot\underbrace{\delta\underline{b}\overline{\gamma}\gamma\cdot\min(1,2\rho)^{2}}_{\triangleq\varepsilon_{x}},

where step uses Inequality (C.8); step uses Lemma 2.12(b) that 𝔾ρ𝖥min(1,2ρ)𝔾1/2𝖥\|\mathbb{G}_{\rho}\|_{\mathsf{F}}\geq\min(1,2\rho)\|\mathbb{G}_{1/2}\|_{\mathsf{F}}; step uses the definition ηtbtγjβt{\eta}^{t}\triangleq\tfrac{b^{t}\gamma^{j}}{\beta^{t}}; step uses btb¯b^{t}\geq\underline{b}, and the following inequality:

γj1γ¯γj,\displaystyle\gamma^{j-1}\geq\overline{\gamma}\geq\gamma^{j},

which can be implied by the stopping criteria of the line search procedure.

C.9 Proof of Lemma 4.12

Proof.

We define: ΘtL(𝐗t,𝐲t;𝐳t;βt,μt1)+μt1Ch2+𝕋t+t+0×𝕏t\Theta^{t}\triangleq L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})+\mu^{t-1}C_{h}^{2}+\mathbb{T}^{t}+\mathbb{Z}^{t}+0\times\mathbb{X}^{t},

We define e~t𝐲t𝐲t12+𝒜(𝐗t)𝐲t2+1βt𝔾1/2t𝖥2\tilde{e}_{t}\triangleq\|\mathbf{y}^{t}-\mathbf{y}^{t-1}\|^{2}+\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|^{2}+\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}^{2}.

Part (a). Using Lemma 4.5, we have:

L(𝐗t+1,𝐲t+1;𝐳t+1;βt+1,μt)L(𝐗t+1,𝐲t;𝐳t;βt,μt1)(μt1μt)Ch2\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t+1};\mathbf{z}^{t+1};\beta^{t+1},\mu^{t})-L(\mathbf{X}^{t+1},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t-1})-(\mu^{t-1}-\mu^{t})C_{h}^{2} (43)
\displaystyle\leq 𝕋t𝕋t+1+tt+1εyβt𝐲t+1𝐲t22εzβt𝒜(𝐗t+1)𝐲t+122.\displaystyle\mathbb{T}^{t}-\mathbb{T}^{t+1}+\mathbb{Z}^{t}-\mathbb{Z}^{t+1}-\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}-\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}.

Using Lemma 4.10, we have:

L(𝐗t+1,𝐲t,𝐳t;βt,μt1)L(𝐗t,𝐲t,𝐳t;βt,μt1)0×𝕏t0×𝕏t+1εxβt1βt𝔾1/2t𝖥2.\displaystyle L(\mathbf{X}^{t+1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})-L(\mathbf{X}^{t},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\mu^{t-1})\leq 0\times\mathbb{X}^{t}-0\times\mathbb{X}^{t+1}-\varepsilon_{x}\beta^{t}\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}.

Adding these two inequalities together and using the definition of Θt\Theta^{t}, we have:

ΘtΘt+1\displaystyle\Theta^{t}-\Theta^{t+1} \displaystyle\geq εyβt𝐲t+1𝐲t22+εxβt1βt𝔾1/2t𝖥2+εzβt𝒜(𝐗t+1)𝐲t+122\displaystyle\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\varepsilon_{x}\beta^{t}\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|^{2}_{\mathsf{F}}+\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}
\displaystyle\geq min(εy,εx,εz)βte~t+1.\displaystyle\min(\varepsilon_{y},\varepsilon_{x},\varepsilon_{z})\cdot\beta^{t}\cdot\tilde{e}_{t+1}.

Part (b). Using the same strategy as in deriving Lemma 4.7(b), we finish the proof.

C.10 Proof of Theorem 4.13

Proof.

We define Crit(𝐗,𝐲,𝐳)𝒜(𝐗)𝐲+h(𝐲)𝐳+Proj𝐓𝐗(f(𝐗)g(𝐗)+𝒜𝖳(𝐳))𝖥\operatorname{Crit}(\mathbf{X},\mathbf{y},\mathbf{z})\triangleq\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|+\|\partial h(\mathbf{y})-\mathbf{z}\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\nabla f(\mathbf{X})-\partial g(\mathbf{X})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}))\|_{\mathsf{F}}.

We define 𝐆˙f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)\dot{\mathbf{G}}\triangleq\nabla f(\mathbf{X}^{t})-\partial g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}), and 𝐆¨βt𝒜𝖳(𝒜(𝐗t)𝐲t)\ddot{\mathbf{G}}\triangleq\beta^{t}\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}).

We let 𝐆=𝐆t𝐗L(𝐗t,𝐲t;𝐳t;βt,μt)\mathbf{G}=\mathbf{G}^{t}\in\partial_{\mathbf{X}}L(\mathbf{X}^{t},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t}).

First, we obtain:

𝔾1/2t\displaystyle\mathbb{G}^{t}_{1/2} =\displaystyle\overset{\text{\char 172}}{=} 𝐆12𝐗t𝐆𝖳𝐗t12𝐗t[𝐗t]𝖳𝐆\displaystyle\mathbf{G}-\tfrac{1}{2}\mathbf{X}^{t}\mathbf{G}^{\mathsf{T}}\mathbf{X}^{t}-\tfrac{1}{2}\mathbf{X}^{t}[\mathbf{X}^{t}]^{\mathsf{T}}\mathbf{G}
=\displaystyle\overset{\text{\char 173}}{=} (𝐆˙12𝐗t𝐆˙𝖳𝐗t12𝐗t[𝐗t]𝖳𝐆˙)+(𝐆¨12𝐗t𝐆¨𝖳𝐗t12𝐗t[𝐗t]𝖳𝐆¨)\displaystyle(\dot{\mathbf{G}}-\tfrac{1}{2}\mathbf{X}^{t}\dot{\mathbf{G}}^{\mathsf{T}}\mathbf{X}^{t}-\tfrac{1}{2}\mathbf{X}^{t}[\mathbf{X}^{t}]^{\mathsf{T}}\dot{\mathbf{G}})+(\ddot{\mathbf{G}}-\tfrac{1}{2}\mathbf{X}^{t}\ddot{\mathbf{G}}^{\mathsf{T}}\mathbf{X}^{t}-\tfrac{1}{2}\mathbf{X}^{t}[\mathbf{X}^{t}]^{\mathsf{T}}\ddot{\mathbf{G}})
=\displaystyle\overset{\text{\char 174}}{=} Proj𝐓𝐗t(𝐆˙)+Proj𝐓𝐗t(𝐆¨)\displaystyle\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})+\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\ddot{\mathbf{G}})

where step uses the definition 𝔾ρt𝐆ρ𝐗t𝐆𝖳𝐗t(1ρ)𝐗t[𝐗t]𝖳𝐆\mathbb{G}^{t}_{\rho}\triangleq\mathbf{G}-\rho\mathbf{X}^{t}\mathbf{G}^{\mathsf{T}}\mathbf{X}^{t}-(1-\rho)\mathbf{X}^{t}[\mathbf{X}^{t}]^{\mathsf{T}}\mathbf{G}, as shown in Algorithm 1; step uses 𝐆𝐆˙+𝐆¨\mathbf{G}\in\dot{\mathbf{G}}+\ddot{\mathbf{G}}; step uses the fact that Proj𝐓𝐗(𝚫)=𝚫12𝐗(𝚫𝖳𝐗+𝐗𝖳𝚫)\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})=\bm{\Delta}-\tfrac{1}{2}\mathbf{X}(\bm{\Delta}^{\mathsf{T}}\mathbf{X}+\mathbf{X}^{\mathsf{T}}\bm{\Delta}) for all 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r} Absil et al. (2008a). This leads to:

Proj𝐓𝐗t(𝐆˙)𝖥\displaystyle\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}} =\displaystyle= 𝔾1/2tProj𝐓𝐗t(𝐆¨)𝖥\displaystyle\|\mathbb{G}^{t}_{1/2}-\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\ddot{\mathbf{G}})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 172}}{\leq} 𝔾1/2t𝖥+Proj𝐓𝐗t(𝐆¨)𝖥\displaystyle\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\ddot{\mathbf{G}})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 𝔾1/2t𝖥+𝐆¨𝖥\displaystyle\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}+\|\ddot{\mathbf{G}}\|_{\mathsf{F}}
\displaystyle\overset{}{\leq} 𝔾1/2t𝖥+βtA¯𝒜(𝐗t)𝐲t\displaystyle\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}+\beta^{t}\overline{\rm{A}}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|
\displaystyle\overset{}{\leq} βtet+1+𝒪(βt1et),\displaystyle\beta^{t}e^{t+1}+\mathcal{O}(\beta^{t-1}e^{t}),

where step uses the triangle inequality; step uses Lemma 2.11 that Proj𝐓𝐗(𝚫)𝖥𝚫𝖥\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}}\mathcal{M}}(\bm{\Delta})\|_{\mathsf{F}}\leq\|\bm{\Delta}\|_{\mathsf{F}} for all 𝚫n×r\bm{\Delta}\in\mathbb{R}^{n\times r}.

Finally, we derive:

1Tt=1TCrit(𝐗t,𝐲˘t,𝐳t)\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\operatorname{Crit}(\mathbf{X}^{t},\breve{\mathbf{y}}^{t},\mathbf{z}^{t})
=\displaystyle\overset{\text{\char 172}}{=} 1Tt=1T{𝒜(𝐗t)𝐲˘t+h(𝐲˘t)𝐳t+Proj𝐓𝐗t(𝐆˙)𝖥}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\|\mathcal{A}(\mathbf{X}^{t})-\breve{\mathbf{y}}^{t}\|+\|\partial h(\breve{\mathbf{y}}^{t})-\mathbf{z}^{t}\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}}\}
\displaystyle\overset{\text{\char 173}}{\leq} 1Tt=1T{𝒜(𝐗t)𝐲t+(11σ)(𝐳t𝐳t1)+Proj𝐓𝐗t(𝐆˙)𝖥}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|+\|(1-\tfrac{1}{\sigma})(\mathbf{z}^{t}-\mathbf{z}^{t-1})\|+\|\operatorname{Proj}_{\mathbf{T}_{\mathbf{X}^{t}}\mathcal{M}}(\dot{\mathbf{G}})\|_{\mathsf{F}}\}
=\displaystyle\overset{\text{\char 174}}{=} 1Tt=1T{𝒪(βtet+1)+𝒪(βt1et)}\displaystyle\textstyle\tfrac{1}{T}\sum_{t=1}^{T}\{\mathcal{O}(\beta^{t}e^{t+1})+\mathcal{O}(\beta^{t-1}e^{t})\}
=\displaystyle\overset{\text{\char 175}}{=} 𝒪(T(p1)/2)=𝒪(T1/3),\displaystyle\textstyle\mathcal{O}(T^{(p-1)/2})=\mathcal{O}(T^{-1/3}),

where step uses the definition of Crit(𝐗,𝐲,𝐳)\operatorname{Crit}(\mathbf{X},\mathbf{y},\mathbf{z}); step uses 𝐳t+1h(𝐲˘t+1)(11σ)(𝐳t+1𝐳t)\mathbf{z}^{t+1}-\partial h(\breve{\mathbf{y}}^{t+1})\ni(1-\tfrac{1}{\sigma})(\mathbf{z}^{t+1}-\mathbf{z}^{t}), as shown in Lemma 4.1; step uses 𝐳t𝐳t1=σβt1(𝒜(𝐗t)𝐲t)2βt𝒜(𝐗t)𝐲t=𝒪(βt1et)\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|=\|\sigma\beta^{t-1}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})\|\leq 2\beta^{t}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|=\mathcal{O}(\beta^{t-1}e^{t}); step uses the choice p=1/3p=1/3 and Lemma 4.7(b).

Appendix D Proofs for Section 5

D.1 Proof of Lemma 5.4

We begin by presenting the following four useful lemmas.

Lemma D.1.

For both OADMM-EP and OADMM-RR, we have:

(𝐝𝐗,𝐝𝐗,𝐝𝐲,𝐝𝐳)Θ(𝐗t,𝐗t1,𝐲t,𝐳t;βt,βt1,μt1,t),\displaystyle(\mathbf{d}_{\mathbf{X}},\mathbf{d}_{\mathbf{X}^{-}},\mathbf{d}_{\mathbf{y}},\mathbf{d}_{\mathbf{z}})\in\partial\Theta(\mathbf{X}^{t},\mathbf{X}^{t-1},\mathbf{y}^{t},\mathbf{z}^{t};\beta^{t},\beta^{t-1},\mu^{t-1},t), (44)

where 𝐝𝐗𝔸t+{βt+2ωσ¨σ2βt1}𝒜𝖳(𝒜(𝐗t)𝐲t)+α(θ+1)(βt)(𝐗t𝐗t1)\mathbf{d}_{\mathbf{X}}\triangleq\mathbb{A}^{t}+\{\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1}\}\cdot\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})+\alpha(\theta+1)\ell(\beta^{t})(\mathbf{X}^{t}-\mathbf{X}^{t-1}), 𝐝𝐗α(θ+1)(βt)(𝐗t1𝐗t)\mathbf{d}_{\mathbf{X}^{-}}\triangleq\alpha(\theta+1)\ell(\beta^{t})(\mathbf{X}^{t-1}-\mathbf{X}^{t}), 𝐝𝐲hμt1(𝐲t)𝐳t+(𝐲t𝒜(𝐗t))(βt+2ωσ¨σ2βt1)\mathbf{d}_{\mathbf{y}}\triangleq\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})-\mathbf{z}^{t}+(\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t}))\cdot(\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1}), 𝐝𝐳𝒜(𝐗t)𝐲t\mathbf{d}_{\mathbf{z}}\triangleq\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}. Here, 𝔸tI(𝐗t)+f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)\mathbb{A}^{t}\triangleq\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla f(\mathbf{X}^{t})-\nabla g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}).

Proof.

We define the Lyapunov function as: Θ(𝐗,𝐗,𝐲,𝐳;β,β,μ,t)L(𝐗,𝐲;𝐳;β,μ)+ωσ¨σ2β𝒜(𝐗)𝐲22+α(θ+1)(β)2𝐗𝐗𝖥2+4ωσ¨β0Ch21t+Ch2μ\Theta(\mathbf{X},\mathbf{X}^{-},\mathbf{y},\mathbf{z};\beta,\beta^{-},\mu^{-},t)\triangleq L(\mathbf{X},\mathbf{y};\mathbf{z};\beta,\mu^{-})+\omega\ddot{\sigma}\sigma^{2}\beta^{-}\|\mathcal{A}(\mathbf{X})-\mathbf{y}\|_{2}^{2}+\tfrac{\alpha(\theta+1)\ell(\beta)}{2}\|\mathbf{X}-\mathbf{X}^{-}\|_{\mathsf{F}}^{2}+\tfrac{4\omega\ddot{\sigma}}{\beta^{0}}C_{h}^{2}\tfrac{1}{t}+C_{h}^{2}\mu^{-}.

Using this definition, we can promptly derive the conclusion of the lemma.

Lemma D.2.

For OADMM-EP, we define {𝐝𝐗,𝐝𝐗,𝐝𝐲,𝐝𝐳}\{\mathbf{d}_{\mathbf{X}},\mathbf{d}_{\mathbf{X}^{-}},\mathbf{d}_{\mathbf{y}},\mathbf{d}_{\mathbf{z}}\} as in Lemma D.1. There exists a constant KK such that:

1βt{𝐝𝐗𝖥+𝐝𝐗𝖥+𝐝𝐲+𝐝𝐳}K{𝒳t+𝒵t+𝒳t1+𝒵t1}.\displaystyle\textstyle\tfrac{1}{\beta^{t}}\{\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{y}}\|+\|\mathbf{d}_{\mathbf{z}}\|\}\leq\textstyle K\{\mathcal{X}^{t}+\mathcal{Z}^{t}+\mathcal{X}^{t-1}+\mathcal{Z}^{t-1}\}. (45)

Here, 𝒳t𝐗t𝐗t1𝖥\mathcal{X}^{t}\triangleq\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}, and 𝒵t𝒜(𝐗t)𝐲t\mathcal{Z}^{t}\triangleq\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|.

Proof.

First, we obtain:

1βt𝔸t𝖥=I(𝐗t)+f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)𝖥\displaystyle~{}\tfrac{1}{\beta^{t}}\|\mathbb{A}^{t}\|_{\mathsf{F}}=\|\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla f(\mathbf{X}^{t})-\nabla g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t})\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} 1βtg(𝐗t1)g(𝐗t)+f(𝐗t)f(𝐗𝖼t1)θ(βt1)(𝐗t𝐗𝖼t1)\displaystyle~{}\tfrac{1}{\beta^{t}}\|\nabla g(\mathbf{X}^{t-1})-\nabla g(\mathbf{X}^{t})+\nabla f(\mathbf{X}^{t})-\nabla f(\mathbf{X}^{t-1}_{{\sf c}})-\theta\ell(\beta^{t-1})(\mathbf{X}^{t}-\mathbf{X}^{t-1}_{{\sf c}})
+𝒜𝖳(𝐳t𝐳t1)βt1𝒜𝖳(𝒜(𝐗𝖼t1)𝐲t1))𝖥\displaystyle~{}+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t}-\mathbf{z}^{t-1})-\beta^{t-1}\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t-1}_{{\sf c}})-\mathbf{y}^{t-1}))\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 1βtLg𝐗t𝐗t1𝖥+1βt(Lf+θ(βt1))𝐗t𝐗𝖼t1𝖥\displaystyle~{}\tfrac{1}{\beta^{t}}L_{g}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}+\tfrac{1}{\beta^{t}}(L_{f}+\theta\ell(\beta^{t-1}))\|\mathbf{X}^{t}-\mathbf{X}^{t-1}_{{\sf c}}\|_{\mathsf{F}}
+1βtA¯𝐳t𝐳t1+1βtβt1A¯{𝒜(𝐗t1)𝐲t1+A¯𝐗t1𝐗t2𝖥}\displaystyle~{}+\tfrac{1}{\beta^{t}}\overline{\rm{A}}\|\mathbf{z}^{t}-\mathbf{z}^{t-1}\|+\tfrac{1}{\beta^{t}}\beta^{t-1}\overline{\rm{A}}\{\|\mathcal{A}(\mathbf{X}^{t-1})-\mathbf{y}^{t-1}\|+\overline{\rm{A}}\|\mathbf{X}^{t-1}-\mathbf{X}^{t-2}\|_{\mathsf{F}}\}
=\displaystyle\overset{}{=} 𝒪(𝐗t𝐗t1𝖥)+𝒪(𝒜(𝐗t)𝐲t)\displaystyle~{}\mathcal{O}(\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}})+\mathcal{O}(\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|)
+𝒪(𝐗t1𝐗t2𝖥)+𝒪(𝒜(𝐗t1)𝐲t1),\displaystyle~{}+\mathcal{O}(\|\mathbf{X}^{t-1}-\mathbf{X}^{t-2}\|_{\mathsf{F}})+\mathcal{O}(\|\mathcal{A}(\mathbf{X}^{t-1})-\mathbf{y}^{t-1}\|), (46)

where step uses the optimality of 𝐗t+1\mathbf{X}^{t+1} for OADMM-EP that:

I(𝐗t+1)g(𝐗t)\displaystyle~{}\textstyle\partial I_{\mathcal{M}}(\mathbf{X}^{t+1})-\nabla g(\mathbf{X}^{t})
\displaystyle\ni θ(βt)(𝐗t+1𝐗𝖼t)𝐗𝒮(𝐗𝖼t,𝐲t;𝐳t;βt)\displaystyle~{}\textstyle-\theta\ell(\beta^{t})(\mathbf{X}^{t+1}-\mathbf{X}^{t}_{{\sf c}})-\nabla_{\mathbf{X}}\mathcal{S}(\mathbf{X}^{t}_{{\sf c}},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t})
=\displaystyle= θ(βt)(𝐗t+1𝐗𝖼t)f(𝐗𝖼t)𝒜𝖳[𝐳t+βt(𝒜(𝐗𝖼t)𝐲t)];\displaystyle~{}-\theta\ell(\beta^{t})(\mathbf{X}^{t+1}-\mathbf{X}^{t}_{{\sf c}})-\nabla f(\mathbf{X}^{t}_{{\sf c}})-\mathcal{A}^{\mathsf{T}}[\mathbf{z}^{t}+\beta^{t}(\mathcal{A}(\mathbf{X}^{t}_{{\sf c}})-\mathbf{y}^{t})]; (47)

step uses the triangle inequality, the LfL_{f}-Lipschitz continuity of f(𝐗)\nabla f(\mathbf{X}) for all 𝐗\mathbf{X}; the LgL_{g}-Lipschitz continuity of g(𝐗)\nabla g(\mathbf{X}), and the upper bound of 𝒜(𝐗𝖼t)𝐲t\|\mathcal{A}(\mathbf{X}_{{\sf c}}^{t})-\mathbf{y}^{t}\| as shown in Lemma A.6(c); step uses the upper bound of 𝐗t𝐗𝖼t1𝖥\|\mathbf{X}^{t}-\mathbf{X}^{t-1}_{{\sf c}}\|_{\mathsf{F}}, and 𝐳t𝐳t1=σβt1(𝒜(𝐗t)𝐲t)\mathbf{z}^{t}-\mathbf{z}^{t-1}=\sigma\beta^{t-1}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}).

Part (a). We bound the term 1βt𝐝𝐗𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}. We have:

1βt𝐝𝐗𝖥\displaystyle~{}\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} 1βt𝔸t+(βt+2ωσ¨σ2βt1)𝒜𝖳(𝒜(𝐗t)𝐲t)+α(θ+1)(βt)(𝐗t𝐗t1)𝖥\displaystyle~{}\textstyle\tfrac{1}{\beta^{t}}\|\mathbb{A}^{t}+(\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1})\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})+\alpha(\theta+1)\ell(\beta^{t})(\mathbf{X}^{t}-\mathbf{X}^{t-1})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 1βt𝔸t𝖥+(1+2ωσ¨σ2)A¯𝒜(𝐗t)𝐲t𝖥+α(θ+1)¯𝐗t𝐗t1𝖥\displaystyle~{}\textstyle\tfrac{1}{\beta^{t}}\|\mathbb{A}^{t}\|_{\mathsf{F}}+(1+2\omega\ddot{\sigma}\sigma^{2})\overline{\rm{A}}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|_{\mathsf{F}}+\alpha(\theta+1)\overline{\rm{\ell}}\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} 𝒪(𝐗t𝐗t1𝖥)+𝒪(𝒜(𝐗t)𝐲t)+𝒪(𝐗t1𝐗t2𝖥)+𝒪(𝒜(𝐗t1)𝐲t1),\displaystyle~{}\textstyle\mathcal{O}(\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}})+\mathcal{O}(\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|)+\mathcal{O}(\|\mathbf{X}^{t-1}-\mathbf{X}^{t-2}\|_{\mathsf{F}})+\mathcal{O}(\|\mathcal{A}(\mathbf{X}^{t-1})-\mathbf{y}^{t-1}\|),

where step uses the definition of 𝐝𝐗\mathbf{d}_{\mathbf{X}} in Lemma D.1; step uses the triangle inequality, βt1βt\beta^{t-1}\leq\beta^{t}, and (βt)βt¯\ell(\beta^{t})\leq\beta^{t}\overline{\rm{\ell}}; step uses Inequality (D.2).

Part (b). We bound the term 1βt𝐝𝐗𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}. We have:

1βt𝐝𝐗𝖥=1βtα(θ+1)(βt)𝐗t1𝐗t𝖥=𝒪(𝐗t𝐗t1𝖥),\displaystyle\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}\overset{\text{\char 172}}{=}\tfrac{1}{\beta^{t}}\alpha(\theta+1)\ell(\beta^{t})\|\mathbf{X}^{t-1}-\mathbf{X}^{t}\|_{\mathsf{F}}\overset{\text{\char 173}}{=}\mathcal{O}(\|\mathbf{X}^{t}-\mathbf{X}^{t-1}\|_{\mathsf{F}}), (48)

where step uses the definition of 𝐝𝐗\mathbf{d}_{\mathbf{X}^{-}} in Lemma D.1; step uses (βt)βt¯\ell(\beta^{t})\leq\beta^{t}\overline{\rm{\ell}}.

Part (c). We bound the term 1βt𝐝𝐲𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}}. We have:

1βt𝐝𝐲\displaystyle\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{y}}\| =\displaystyle\overset{\text{\char 172}}{=} 1βthμt1(𝐲t)𝐳t+(𝐲t𝒜(𝐗t))(βt+2ωσ¨σ2βt1)\displaystyle\tfrac{1}{\beta^{t}}\|\nabla h_{\mu^{t-1}}(\mathbf{y}^{t})-\mathbf{z}^{t}+(\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t}))\cdot(\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1})\|
=\displaystyle\overset{\text{\char 173}}{=} 1βt(11σ)(𝐳t1𝐳t)+(𝐲t𝒜(𝐗t))(βt+2ωσ¨σ2βt1)\displaystyle\tfrac{1}{\beta^{t}}\|(1-\tfrac{1}{\sigma})(\mathbf{z}^{t-1}-\mathbf{z}^{t})+(\mathbf{y}^{t}-\mathcal{A}(\mathbf{X}^{t}))\cdot(\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1})\|
=\displaystyle\overset{\text{\char 174}}{=} 𝒪(𝒜(𝐗t)𝐲t),\displaystyle\mathcal{O}(\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|),

where step uses the definition of 𝐝𝐲\mathbf{d}_{\mathbf{y}} in Lemma D.1; step uses the fact that 𝐳t1σ(𝐳t𝐳t+1)=hμt(𝐲t+1)\mathbf{z}^{t}-\tfrac{1}{\sigma}(\mathbf{z}^{t}-\mathbf{z}^{t+1})=\nabla h_{\mu^{t}}(\mathbf{y}^{t+1}), as shown in Lemma 4.1; step uses 1βt(𝐳t+1𝐳t)=σ(𝒜(𝐗t+1)𝐲t+1)\tfrac{1}{\beta^{t}}(\mathbf{z}^{t+1}-\mathbf{z}^{t})=\sigma(\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}), and βt1=𝒪(βt)\beta^{t-1}=\mathcal{O}(\beta^{t}).

Part (d). We bound the term 1βt𝐝𝐳𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}. We have: 1βt𝐝𝐳1β0𝒜(𝐗t)𝐲t\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{z}}\|\leq\tfrac{1}{\beta^{0}}\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|.

Part (e). Combining the upper bounds for the terms {1βt𝐝𝐗𝖥,1βt𝐝𝐗𝖥,1βt𝐝𝐲𝖥,1βt𝐝𝐳𝖥}\{\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}\}, we finish the proof of this lemma. ∎

Lemma D.3.

For OADMM-RR, we define {𝐝𝐗,𝐝𝐗,𝐝𝐲,𝐝𝐳}\{\mathbf{d}_{\mathbf{X}},\mathbf{d}_{\mathbf{X}^{-}},\mathbf{d}_{\mathbf{y}},\mathbf{d}_{\mathbf{z}}\} as in Lemma D.1. There exists a constant KK such that :

1βt{𝐝𝐗𝖥+𝐝𝐗𝖥+𝐝𝐲+𝐝𝐳}K{𝒳t+𝒵t},\displaystyle\textstyle\tfrac{1}{\beta^{t}}\{\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{y}}\|+\|\mathbf{d}_{\mathbf{z}}\|\}\leq\textstyle K\{\mathcal{X}^{t}+\mathcal{Z}^{t}\},

Here, 𝒳t1βt𝔾1/2𝖥\mathcal{X}^{t}\triangleq\|\tfrac{1}{\beta^{t}}\mathbb{G}_{1/2}\|_{\mathsf{F}}, and 𝒵t𝒜(𝐗t)𝐲t\mathcal{Z}^{t}\triangleq\|\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}\|.

Proof.

We define 𝐆tf(𝐗t)g(𝐗t)+A𝖳(𝐳t+βt(𝒜(𝐗t)𝐲t))\mathbf{G}^{t}\triangleq\nabla f(\mathbf{X}^{t})-\nabla g(\mathbf{X}^{t})+A^{\mathsf{T}}(\mathbf{z}^{t}+\beta^{t}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})).

We define ˙(𝐗)L(𝐗,𝐲t;𝐳t;βt,μt)\dot{\mathcal{L}}(\mathbf{X})\triangleq L(\mathbf{X},\mathbf{y}^{t};\mathbf{z}^{t};\beta^{t},\mu^{t}), we have: ˙(𝐗t)=𝐆t\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})=\mathbf{G}^{t}.

First, given 𝐗t\mathbf{X}^{t}\in\mathcal{M}, we obtain:

1βtI(𝐗t)+˙(𝐗t)𝖥\displaystyle\textstyle\tfrac{1}{\beta^{t}}\|\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})\|_{\mathsf{F}} \displaystyle\overset{\text{\char 172}}{\leq} 1βt˙(𝐗t)𝐗t[˙(𝐗t)]𝖳𝐗t𝖥\displaystyle\textstyle\tfrac{1}{\beta^{t}}\|\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})-\mathbf{X}^{t}[\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})]^{\mathsf{T}}\mathbf{X}^{t}\|_{\mathsf{F}} (49)
=\displaystyle\overset{\text{\char 173}}{=} 1βt𝐆t𝐗t[𝐆t]𝖳𝐗t𝖥=1βt𝔾1t𝖥\displaystyle\textstyle\tfrac{1}{\beta^{t}}\|\mathbf{G}^{t}-\mathbf{X}^{t}[\mathbf{G}^{t}]^{\mathsf{T}}\mathbf{X}^{t}\|_{\mathsf{F}}=\tfrac{1}{\beta^{t}}\|\mathbb{G}^{t}_{1}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} 1βtmax(1,1/ρ)𝔾1/2𝖥=𝒪(𝒳t),\displaystyle\textstyle\tfrac{1}{\beta^{t}}\max(1,1/\rho)\cdot\|\mathbb{G}_{1/2}\|_{\mathsf{F}}=\mathcal{O}(\mathcal{X}^{t}),

where step uses Lemma 2.13; step uses the definitions of {𝐆t,𝐃ρt}\{\mathbf{G}^{t},\mathbf{D}_{\rho}^{t}\} as in Algorithm 1; step uses 𝔾1𝖥max(1,1/ρ)𝔾ρ𝖥\|\mathbb{G}_{1}\|_{\mathsf{F}}\leq\max(1,1/\rho)\|\mathbb{G}_{\rho}\|_{\mathsf{F}}, as shown in Lemma 2.12(b).

Part (a). We bound the term 1βt𝐝𝐗𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}. We have:

1βt𝐝𝐗𝖥\displaystyle\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}
=\displaystyle\overset{\text{\char 172}}{=} 1βtI(𝐗t)+˙(𝐗t)+2ωσ¨σ2βt1𝒜𝖳(𝒜(𝐗t)𝐲t)𝖥\displaystyle\textstyle\tfrac{1}{\beta^{t}}\|\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1}\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} 1βtI(𝐗t)+˙(𝐗t)𝖥+2ωσ¨σ2𝒜𝖳(𝒜(𝐗t)𝐲t)𝖥\displaystyle\textstyle\tfrac{1}{\beta^{t}}\|\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla\dot{\mathcal{L}}(\mathbf{X}^{t})\|_{\mathsf{F}}+2\omega\ddot{\sigma}\sigma^{2}\|\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t})\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 174}}{\leq} 𝒪(𝒳t)+𝒪(𝒵t),\displaystyle\textstyle\mathcal{O}(\mathcal{X}^{t})+\mathcal{O}(\mathcal{Z}^{t}),

where step uses 𝐝𝐗=I(𝐗t)+f(𝐗t)g(𝐗t)+𝒜𝖳(𝐳t)+{βt+2ωσ¨σ2βt1}𝒜𝖳(𝒜(𝐗t)𝐲t)\mathbf{d}_{\mathbf{X}}=\partial I_{\mathcal{M}}(\mathbf{X}^{t})+\nabla f(\mathbf{X}^{t})-\nabla g(\mathbf{X}^{t})+\mathcal{A}^{\mathsf{T}}(\mathbf{z}^{t})+\{\beta^{t}+2\omega\ddot{\sigma}\sigma^{2}\beta^{t-1}\}\cdot\mathcal{A}^{\mathsf{T}}(\mathcal{A}(\mathbf{X}^{t})-\mathbf{y}^{t}) with the choice α=0\alpha=0 for OADMM-RR; step uses the triangle inequality and βt1βt\beta^{t-1}\leq\beta^{t}; step uses Inequality (49).

Part (b). We bound the term 1βt𝐝𝐗𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}. Given α=0\alpha=0, we conclude that 1βt𝐝𝐗𝖥=0\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}=0.

Part (c). We bound the terms 1βt𝐝𝐲𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}} and 1βt𝐝𝐳𝖥\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}. Considering that the same strategies for updating {𝐲t,𝐳t}\{\mathbf{y}^{t},\mathbf{z}^{t}\} are employed, their bounds in OADMM-RR are identical to those in OADMM-ER.

Part (d). Combining the upper bounds for the terms {1βt𝐝𝐗𝖥,1βt𝐝𝐗𝖥,1βt𝐝𝐲𝖥,1βt𝐝𝐳𝖥}\{\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}},\tfrac{1}{\beta^{t}}\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}\}, we finish the proof of this lemma.

Now, we proceed to prove the main result of this lemma.

Lemma D.4.

(Subgradient Bounds) (a) For OADMM-EP, there exists a constant K>0K>0 such that: dist(𝟎,Θ(𝕨t;𝕦t))βtK(et+et1)\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq\beta^{t}K(e^{t}+e^{t-1}). (b) For OADMM-RR, there exists a constant K>0K>0 such that: dist(𝟎,Θ(𝕨t;𝕦t))βtKet\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq\beta^{t}Ke^{t}. Here, dist(𝟎,Θ(𝕨t;𝕦t)){dist2(𝟎,𝐗Θ(𝕨t;𝕦t))+dist2(𝟎,𝐗Θ(𝕨t;𝕦t))+dist2(𝟎,𝐲Θ(𝕨t;𝕦t))+dist2(𝟎,𝐳Θ(𝕨t;𝕦t))}1/2\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\triangleq\{\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{X}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{X}^{-}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{y}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))+\operatorname{dist}^{2}(\mathbf{0},\partial_{\mathbf{z}}\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\}^{1/2}.

Proof.

For OADMM-EP, we have:

dist(𝟎,Θ(𝕨t;𝕦t))=\displaystyle\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))= 𝐝𝐗𝖥2+𝐝𝐗𝖥2+𝐝𝐲𝖥2+𝐝𝐳𝖥2\displaystyle~{}\sqrt{\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}^{2}+\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}^{2}+\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}}^{2}+\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}^{2}}
\displaystyle\overset{\text{\char 172}}{\leq} 𝐝𝐗𝖥+𝐝𝐗𝖥+𝐝𝐲𝖥+𝐝𝐳𝖥\displaystyle~{}\|\mathbf{d}_{\mathbf{X}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{X}^{-}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{y}}\|_{\mathsf{F}}+\|\mathbf{d}_{\mathbf{z}}\|_{\mathsf{F}}
\displaystyle\overset{\text{\char 173}}{\leq} Kβt{𝒳t+𝒵t+𝒳t1+𝒵t1}\displaystyle~{}K\beta^{t}\{\mathcal{X}^{t}+\mathcal{Z}^{t}+\mathcal{X}^{t-1}+\mathcal{Z}^{t-1}\}
\displaystyle\overset{\text{\char 174}}{\leq} Kβt(et+et1),\displaystyle~{}K\beta^{t}(e^{t}+e^{t-1}),

where step uses a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for all a0a\geq 0 and b0b\geq 0; step uses Lemma D.2; step uses the definition of KK.

For OADMM-RR, using Lemma D.3 and similar strategies, we have: dist(𝟎,Θ(𝕨t;𝕦t))βtKet\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq\beta^{t}Ke^{t}.

D.2 Proof of Theorem 5.6

Proof.

We define K˙3K/min(εx,εy,εz)\dot{K}\triangleq 3K/\min(\varepsilon_{x},\varepsilon_{y},\varepsilon_{z}).

Firstly, using Assumption 5.1, we have:

φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦))dist(𝟎,Θ(𝕨t;𝕦t))1.\displaystyle\varphi^{\prime}(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}))\cdot\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\geq 1. (50)

Secondly, given the desingularization function φ()\varphi(\cdot) is concave, for any a,ba,b\in\mathbb{R}, we have: φ(b)+(ab)φ(a)φ(a)\varphi(b)+(a-b)\varphi^{\prime}(a)\leq\varphi(a). Applying the inequality above with a=Θ(𝕨t;𝕦t)Θ(𝕨;𝕦)a=\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}) and b=Θ(𝕨t+1;𝕦t+1)Θ(𝕨;𝕦)b=\Theta({\mathbbm{w}}^{t+1};{\mathbbm{u}}^{t+1})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}), we have:

(Θ(𝕨t;𝕦t)Θ(𝕨t+1;𝕦t+1))φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦))\displaystyle\textstyle(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{t+1};{\mathbbm{u}}^{t+1}))\cdot\varphi^{\prime}(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty})) (51)
\displaystyle\leq φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦))φtφ(Θ(𝕨t+1;𝕦t+1)Θ(𝕨;𝕦))φt+1.\displaystyle\textstyle\underbrace{\varphi(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}))}_{\triangleq\varphi^{t}}-\underbrace{\varphi(\Theta({\mathbbm{w}}^{t+1};{\mathbbm{u}}^{t+1})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}))}_{\triangleq\varphi^{t+1}}.

Third, we derive the following inequalities for OADMM-EP:

min(εz,εy,εx)βt{𝒜(𝐗t+1)𝐲t+122+𝐲t+1𝐲t22+𝐗t+1𝐗t𝖥2}\displaystyle\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})\beta^{t}\{\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}\} (52)
\displaystyle\overset{\text{\char 172}}{\leq} εzβt𝒜(𝐗t+1)𝐲t+122+εyβt𝐲t+1𝐲t22+εx(βt)𝐗t+1𝐗t𝖥2\displaystyle\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\varepsilon_{x}\ell(\beta^{t})\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}
\displaystyle\overset{\text{\char 173}}{\leq} ΘtΘt+1=Θ(𝕨t;𝕦t)Θ(𝕨t+1;𝕦t+1)\displaystyle\Theta^{t}-\Theta^{t+1}=\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{t+1};{\mathbbm{u}}^{t+1})
\displaystyle\overset{\text{\char 174}}{\leq} (φtφt+1)1φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦)))\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot\tfrac{1}{\varphi^{\prime}(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty})))}
\displaystyle\overset{\text{\char 175}}{\leq} (φtφt+1)dist(𝟎,Θ(𝕨t;𝕦t))\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))
\displaystyle\overset{\text{\char 176}}{\leq} (φtφt+1)Kβt(et+et1),\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot K\beta^{t}(e^{t}+e^{t-1}),

where step uses (βt)βt¯\ell(\beta^{t})\geq\beta^{t}\underline{\rm{\ell}}; step uses Lemma 4.7; step uses Inequality (51); step uses Inequality (50); step uses Lemma 5.4. We further derive the following inequalities:

(et+1)2\displaystyle\textstyle(e^{t+1})^{2} \displaystyle\triangleq (𝒜(𝐗t+1)𝐲t+12+𝐲t+1𝐲t2+𝐗t+1𝐗t𝖥)2\displaystyle(\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}+\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}})^{2} (53)
\displaystyle\overset{\text{\char 172}}{\leq} 3{𝒜(𝐗t+1)𝐲t+122+𝐲t+1𝐲t22+𝐗t+1𝐗t𝖥2}\displaystyle\textstyle 3\cdot\{\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}^{2}\}
\displaystyle\overset{\text{\char 173}}{\leq} {3K/min(εz,εy,εx)}(et+et1)(φtφt+1),\displaystyle\textstyle\{3K/\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})\}\cdot(e^{t}+e^{t-1})\cdot(\varphi^{t}-\varphi^{t+1}),

where step uses the norm inequality that (a+b+c)23(a2+b2+c2)(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2}) for any a,b,ca,b,c\in\mathbb{R}; step uses Inequality (52).

Fourth, we derive the following inequalities for OADMM-RR:

min(εz,εy,εx)βt{𝒜(𝐗t+1)𝐲t+122+𝐲t+1𝐲t22+1β𝔾1/2t𝖥2}\displaystyle\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})\beta^{t}\{\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\|\tfrac{1}{\beta}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}^{2}\} (54)
\displaystyle\overset{}{\leq} εzβt𝒜(𝐗t+1)𝐲t+122+εyβt𝐲t+1𝐲t22+εxβ𝔾1/2t𝖥2\displaystyle\varepsilon_{z}\beta^{t}\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\varepsilon_{y}\beta^{t}\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\tfrac{\varepsilon_{x}}{\beta}\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}^{2}
\displaystyle\overset{\text{\char 172}}{\leq} ΘtΘt+1=Θ(𝕨t;𝕦t)Θ(𝕨t+1;𝕦t+1)\displaystyle\Theta^{t}-\Theta^{t+1}=\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{t+1};{\mathbbm{u}}^{t+1})
\displaystyle\overset{\text{\char 173}}{\leq} (φtφt+1)1φ(Θ(𝕨t;𝕦t)Θ(𝕨;𝕦)))\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot\tfrac{1}{\varphi^{\prime}(\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty})))}
\displaystyle\overset{\text{\char 174}}{\leq} (φtφt+1)dist(𝟎,Θ(𝕨t;𝕦t))\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))
\displaystyle\overset{\text{\char 175}}{\leq} (φtφt+1)Kβt(et+et1),\displaystyle\textstyle(\varphi^{t}-\varphi^{t+1})\cdot K\beta^{t}(e^{t}+e^{t-1}),

where step uses Lemma 4.12; step uses Inequality (51); step uses Inequality (50); step uses Lemma 5.4. We further derive the following inequalities:

(et+1)2\displaystyle\textstyle(e^{t+1})^{2} \displaystyle\triangleq (𝒜(𝐗t+1)𝐲t+1+𝐲t+1𝐲t+1β𝔾1/2t𝖥)2\displaystyle(\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|+\|\tfrac{1}{\beta}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}})^{2} (55)
\displaystyle\overset{\text{\char 172}}{\leq} 3{𝒜(𝐗t+1)𝐲t+122+𝐲t+1𝐲t22+1β𝔾1/2t𝖥2}\displaystyle\textstyle 3\cdot\{\|\mathcal{A}(\mathbf{X}^{t+1})-\mathbf{y}^{t+1}\|_{2}^{2}+\|\mathbf{y}^{t+1}-\mathbf{y}^{t}\|_{2}^{2}+\|\tfrac{1}{\beta}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}^{2}\}
\displaystyle\overset{\text{\char 173}}{\leq} {3K/min(εz,εy,εx)}(φtφt+1)(et+et1),\displaystyle\textstyle\{3K/\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})\}\cdot(\varphi^{t}-\varphi^{t+1})\cdot(e^{t}+e^{t-1}),

where step uses the norm inequality that (a+b+c)23(a2+b2+c2)(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2}) for any a,b,ca,b,c\in\mathbb{R}; step uses Inequality (54).

Part (a). Given Inequalities (53) and (55), we establish the following unified inequality applicable to both OADMM-EP and OADMM-RR:

(et+1)2(et+et1){3K/min(εz,εy,εx)}K˙(φtφt+1).\displaystyle(e^{t+1})^{2}\leq(e^{t}+e^{t-1})\cdot\underbrace{\{3K/\min(\varepsilon_{z},\varepsilon_{y},\varepsilon_{x})\}}_{\triangleq\dot{K}}\cdot(\varphi^{t}-\varphi^{t+1}). (56)

Part (b). Considering Inequality (56) and applying Lemma A.10 with ptK˙φtp^{t}\triangleq\dot{K}\varphi^{t}, we have:

t,i=tei+1et+et1+4K˙φt.\displaystyle\textstyle\forall t,\,\sum_{i=t}^{\infty}e^{i+1}\leq e^{t}+e^{t-1}+4\dot{K}\varphi^{t}.

Letting t=1t=1, we have: i=1ei+1e1+e0+4K˙φ1\sum_{i=1}^{\infty}e^{i+1}\leq\textstyle e^{1}+e^{0}+4\dot{K}\varphi^{1}.

D.3 Proof of Lemma 5.8

Proof.

We define dti=tei+1d^{t}\triangleq\sum_{i=t}^{\infty}e^{i+1}.

Part (a-i). For OADMM-EP, we have for all t1t\geq 1: 𝐗t𝐗𝖥i=t𝐗i𝐗i+1𝖥i=t{𝐗i+1𝐗i𝖥+𝐲i+1𝐲i+𝒜(𝐗i+1)𝐲i+1}=i=tei+1dt\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\overset{\text{\char 172}}{\leq}\textstyle\sum_{i=t}^{\infty}\|\mathbf{X}^{i}-\mathbf{X}^{i+1}\|_{\mathsf{F}}\leq\textstyle\sum_{i=t}^{\infty}\{\|\mathbf{X}^{i+1}-\mathbf{X}^{i}\|_{\mathsf{F}}+\|\mathbf{y}^{i+1}-\mathbf{y}^{i}\|+\|\mathcal{A}(\mathbf{X}^{i+1})-\mathbf{y}^{i+1}\|\}=\textstyle\sum_{i=t}^{\infty}e^{i+1}\triangleq d^{t}, where step use the triangle inequality.

Part (a-ii). For OADMM-RR, we have: 𝐗t+1𝐗t𝖥=Retr𝐗t(ηt𝔾ρt)𝐗t𝖥k˙ηt𝔾ρt𝖥k˙ηtmax(2ρ,1)𝔾1/2t𝖥=k˙max(2ρ,1)btγjβt𝔾1/2t𝖥k˙max(2ρ,1)b¯γ¯1βt𝔾1/2t𝖥=𝒪(1βt𝔾1/2t𝖥)\|\mathbf{X}^{t+1}-\mathbf{X}^{t}\|_{\mathsf{F}}\overset{\text{\char 172}}{=}\|\operatorname{Retr}_{\mathbf{X}^{t}}(-\eta^{t}\mathbb{G}^{t}_{\rho})-\mathbf{X}^{t}\|_{\mathsf{F}}\overset{\text{\char 173}}{\leq}\dot{k}\|\eta^{t}\mathbb{G}^{t}_{\rho}\|_{\mathsf{F}}\overset{\text{\char 174}}{\leq}\dot{k}\eta^{t}\max(2\rho,1)\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}\overset{\text{\char 175}}{=}\dot{k}\max(2\rho,1)\tfrac{b^{t}\gamma^{j}}{\beta^{t}}\|\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}\overset{\text{\char 176}}{\leq}\dot{k}\max(2\rho,1)\overline{b}\overline{\gamma}\cdot\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}=\mathcal{O}(\|\tfrac{1}{\beta^{t}}\mathbb{G}^{t}_{1/2}\|_{\mathsf{F}}), where step uses the update rule of 𝐗t+1\mathbf{X}^{t+1}; step uses Lemma 2.10; step uses Lemma 2.12(c); step uses the definition of ηtbtγjβt\eta^{t}\triangleq\tfrac{b^{t}\gamma^{j}}{\beta^{t}}; step uses btb¯b^{t}\leq\overline{b}, and the fact that γjγ¯\gamma^{j}\leq\overline{\gamma}. Furthermore, we derive for all t1t\geq 1: 𝐗t𝐗𝖥i=t𝐗i𝐗i+1𝖥𝒪(i=t1βt𝔾1/2i𝖥)𝒪(i=tei+1)=𝒪(dt)\|\mathbf{X}^{t}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\leq\textstyle\sum_{i=t}^{\infty}\|\mathbf{X}^{i}-\mathbf{X}^{i+1}\|_{\mathsf{F}}\leq\textstyle\mathcal{O}(\sum_{i=t}^{\infty}\|\tfrac{1}{\beta^{t}}\mathbb{G}_{1/2}^{i}\|_{\mathsf{F}})\overset{}{\leq}\mathcal{O}(\sum_{i=t}^{\infty}e^{i+1})=\mathcal{O}(d^{t}).

Part (b). We define φtφ(st)\varphi^{t}\triangleq\varphi(s^{t}), where stΘ(𝕨t;𝕦t)Θ(𝕨;𝕦)s^{t}\triangleq\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}). Using the definition of dtd^{t}, we derive:

dt\displaystyle\textstyle d^{t} \displaystyle\overset{}{\triangleq} i=tei+1\displaystyle\textstyle\sum_{i=t}^{\infty}e^{i+1}
\displaystyle\overset{\text{\char 172}}{\leq} et+et1+4K˙φt\displaystyle\textstyle e^{t}+e^{t-1}+4\dot{K}\varphi^{t}
=\displaystyle\overset{\text{\char 173}}{=} et+et1+4K˙c~{[st]σ~}1σ~σ~\displaystyle\textstyle e^{t}+e^{t-1}+4\dot{K}\tilde{c}\cdot\{[s^{t}]^{\tilde{\sigma}}\}^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
=\displaystyle\overset{\text{\char 174}}{=} et+et1+4K˙c~{c~(1σ~)1φ(st)}1σ~σ~\displaystyle\textstyle e^{t}+e^{t-1}+4\dot{K}\tilde{c}\cdot\{\tilde{c}(1-\tilde{\sigma})\cdot\frac{1}{\varphi^{\prime}(s^{t})}\}^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
\displaystyle\overset{\text{\char 175}}{\leq} et+et1+4K˙c~{c~(1σ~)dist(𝟎,Θ(𝕨t;𝕦t))}1σ~σ~\displaystyle\textstyle e^{t}+e^{t-1}+4\dot{K}\tilde{c}\cdot\{\tilde{c}(1-\tilde{\sigma})\cdot\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\}^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
\displaystyle\overset{\text{\char 176}}{\leq} et+et1+4K˙c~{c~(1σ~)βtK(et+et1)}1σ~σ~\displaystyle\textstyle e^{t}+e^{t-1}+4\dot{K}\tilde{c}\cdot\{\tilde{c}(1-\tilde{\sigma})\cdot\beta^{t}K(e^{t}+e^{t-1})\}^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
=\displaystyle\overset{\text{\char 177}}{=} dt2dt+4K˙c~{c~(1σ~)βtK(dt2dt)})1σ~σ~\displaystyle\textstyle d^{t-2}-d^{t}+4\dot{K}\tilde{c}\cdot\{\tilde{c}(1-\tilde{\sigma})\cdot\beta^{t}K(d^{t-2}-d^{t})\})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
=\displaystyle\overset{}{=} dt2dt+4K˙c~[c~(1σ~)K]1σ~σ~K¨{(βt(dt2dt))1σ~σ~},\displaystyle\textstyle d^{t-2}-d^{t}+\underbrace{4\dot{K}\tilde{c}\cdot[\tilde{c}(1-\tilde{\sigma})K]^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}}_{\triangleq\ddot{K}}\cdot\{(\beta^{t}(d^{t-2}-d^{t}))^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}\},

where step uses i=tei+1et+et1+4K˙φt\sum_{i=t}^{\infty}e^{i+1}\leq e^{t}+e^{t-1}+4\dot{K}\varphi^{t}, as shown in Theorem 5.6(b); step uses the definitions that φtφ(st)\varphi^{t}\triangleq\varphi(s^{t}), stΘ(𝕨t;𝕦t)Θ(𝕨;𝕦)s^{t}\triangleq\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t})-\Theta({\mathbbm{w}}^{\infty};{\mathbbm{u}}^{\infty}), and φ(s)=c~s1σ~\varphi(s)=\tilde{c}s^{1-\tilde{\sigma}}; step uses φ(s)=c~(1σ~)[s]σ~\varphi^{\prime}(s)=\tilde{c}(1-\tilde{\sigma})\cdot[s]^{-\tilde{\sigma}}, leading to [st]σ~=c~(1σ~)1φ(st)[s^{t}]^{\tilde{\sigma}}=\tilde{c}(1-\tilde{\sigma})\cdot\tfrac{1}{\varphi^{\prime}(s^{t})}; step uses Assumption 5.1 that 1dist(𝟎,Θ(𝕨t;𝕦t))φ(st)1\leq\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\cdot\varphi^{\prime}(s^{t}); step uses dist(𝟎,Θ(𝕨t;𝕦t))K(et+et1)\operatorname{dist}(\mathbf{0},\partial\Theta({\mathbbm{w}}^{t};{\mathbbm{u}}^{t}))\leq K(e^{t}+e^{t-1}) for both OADMM-EP and OADMM-RR, as shown in Lemma 5.4; step uses the fact that et=dt1dte^{t}=d^{t-1}-d^{t}, which implies:

et+et1=(dt1dt)+(dt2dt1)=dt2dt.\displaystyle e^{t}+e^{t-1}=(d^{t-1}-d^{t})+(d^{t-2}-d^{t-1})=d^{t-2}-d^{t}.

D.4 Proof of Theorem 5.9

Proof.

Using Lemma 5.8(b), we have:

dtdt2dt+K¨{(βt(dt2dt))1σ~σ~}.\displaystyle d^{t}\leq d^{t-2}-d^{t}+\ddot{K}\cdot\{(\beta^{t}(d^{t-2}-d^{t}))^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}\}. (57)

We consider two cases for Inequality (57).

Part (a). σ~(14,12]\tilde{\sigma}\in(\tfrac{1}{4},\frac{1}{2}]. We define up(1σ~)σ~[13,1)u\triangleq\frac{p(1-\tilde{\sigma})}{\tilde{\sigma}}\in[\tfrac{1}{3},1), where p=13p=\tfrac{1}{3} is a fixed constant.

We define β~tK¨(βt)1σ~σ~\tilde{\beta}^{t}\triangleq\ddot{K}(\beta^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}. We define t{i|di2di1}t^{\prime}\triangleq\{i\,|\,d^{i-2}-d^{i}\leq 1\}.

For all ttt\geq t^{\prime}, we have from Inequality (57):

dt\displaystyle d^{t} \displaystyle\leq dt2dt+(dt2dt)1σ~σ~K¨(βt)1σ~σ~β~t\displaystyle d^{t-2}-d^{t}+(d^{t-2}-d^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}\cdot\underbrace{\ddot{K}(\beta^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}}_{\triangleq\tilde{\beta}^{t}} (58)
\displaystyle\overset{\text{\char 172}}{\leq} dt2dt+(dt2dt)β~t\displaystyle d^{t-2}-d^{t}+(d^{t-2}-d^{t})\cdot\tilde{\beta}^{t}
\displaystyle\overset{}{\leq} dt2β~t+1β~t+2,\displaystyle d^{t-2}\cdot\tfrac{\tilde{\beta}^{t}+1}{\tilde{\beta}^{t}+2},

where step uses the fact that [Δ(1σ~)/σ~]/Δ=Δ(12σ~)/σ~=Δ(1/σ~2)Δ0=1[\Delta^{(1-\tilde{\sigma})/\tilde{\sigma}}]/\Delta=\Delta^{(1-2\tilde{\sigma})/\tilde{\sigma}}=\Delta^{(1/\tilde{\sigma}-2)}\leq\Delta^{0}=1 for all Δ=dt2dt[0,1]\Delta=d^{t-2}-d^{t}\in[0,1] and σ~(0,12]\tilde{\sigma}\in(0,\frac{1}{2}].

Furthermore, We derive:

t=1T(β~t)1=𝒪(t=1T[tp]1σ~σ~)=𝒪(t=1Ttu)𝒪(T1u),\displaystyle\textstyle\sum_{t=1}^{T}(\tilde{\beta}^{t})^{-1}\overset{\text{\char 172}}{=}\textstyle\mathcal{O}\left(\sum_{t=1}^{T}[t^{p}]^{-\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}\right)\overset{\text{\char 173}}{=}\textstyle\mathcal{O}(\sum_{t=1}^{T}t^{-u})\overset{\text{\char 174}}{\geq}\textstyle\mathcal{O}(T^{1-u}),

where step uses β~tK¨(βt)1σ~σ~\tilde{\beta}^{t}\triangleq\ddot{K}(\beta^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}} and βtβ0(1+ξtp)=𝒪(tp)\beta^{t}\triangleq\beta^{0}(1+\xi t^{p})=\mathcal{O}(t^{p}); step uses the definition of uu; step uses Lemma A.9 that: t=1Ttu(1u)T1u=𝒪(T1u)\sum_{t=1}^{T}t^{-u}\geq(1-u)T^{1-u}=\mathcal{O}(T^{1-u}) for all u(0,1)u\in(0,1).

Applying Lemma Lemma A.12 with a=1ua=1-u, we have:

dT𝒪(1exp(T1u)).\displaystyle d^{T}\leq\mathcal{O}\left(\tfrac{1}{\exp(T^{1-u})}\right).

Part (b). σ~(12,1)\tilde{\sigma}\in(\frac{1}{2},1). We define w1σ~σ~(0,1)w\triangleq\frac{1-\tilde{\sigma}}{\tilde{\sigma}}\in(0,1), and τ1/w1(0,)\tau\triangleq 1/w-1\in(0,\infty).

We define β~t=K˙1/wβt\tilde{\beta}^{t}={\dot{K}}^{1/w}\beta^{t}, where K˙K¨+R1w(β0)w{\dot{K}}\triangleq\ddot{K}+R^{1-w}(\beta^{0})^{-w}, and Rd0R\triangleq d^{0}.

Notably, we have: dt2dtd0Rd^{t-2}-d^{t}\leq d^{0}\triangleq R for all t2t\geq 2.

For all t2t\geq 2, we have from Inequality (57):

dt\displaystyle d^{t} \displaystyle\leq dt2dt+K¨(βt)1σ~σ~(dt2dt)1σ~σ~\displaystyle d^{t-2}-d^{t}+\ddot{K}(\beta^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}(d^{t-2}-d^{t})^{\frac{1-\tilde{\sigma}}{\tilde{\sigma}}}
=\displaystyle\overset{\text{\char 172}}{=} K¨{βt(dt2dt)}w+dt2dt\displaystyle\ddot{K}\{\beta^{t}(d^{t-2}-d^{t})\}^{w}+d^{t-2}-d^{t}
\displaystyle\overset{\text{\char 173}}{\leq} K¨{βt(dt2dt)}w+(dt2dt)wR1w\displaystyle\ddot{K}\{\beta^{t}(d^{t-2}-d^{t})\}^{w}+(d^{t-2}-d^{t})^{w}\cdot R^{1-w}
\displaystyle\overset{\text{\char 174}}{\leq} K¨{βt(dt2dt)}w+(dt2dt)wR1w(βtβ0)w\displaystyle\ddot{K}\{\beta^{t}(d^{t-2}-d^{t})\}^{w}+(d^{t-2}-d^{t})^{w}\cdot R^{1-w}\cdot(\tfrac{\beta^{t}}{\beta^{0}})^{w}
=\displaystyle\overset{}{=} {βt(dt2dt)}w(K¨+R1w(β0)w)K˙,\displaystyle\{\beta^{t}(d^{t-2}-d^{t})\}^{w}\cdot\underbrace{(\ddot{K}+R^{1-w}\cdot(\beta^{0})^{-w})}_{\triangleq\dot{K}},

where step uses the the definition of ww; step uses the fact that maxx(0,R]xxwR1w\max_{x\in(0,R]}\frac{x}{x^{w}}\leq R^{1-w} if w(0,1)w\in(0,1) and R>0R>0; step uses β0βt\beta^{0}\leq\beta^{t} and w(0,1)w\in(0,1). We further obtain:

[dt]1/w=[dt]τ+1(dt2dt)βtK˙1/wβ~t.\displaystyle\underbrace{[d^{t}]^{1/w}}_{=[d^{t}]^{\tau+1}}\leq(d^{t-2}-d^{t})\cdot\underbrace{\beta^{t}{\dot{K}}^{1/w}}_{\triangleq\tilde{\beta}^{t}}.

Additionally, we have:

t=1T(1/β~t)=𝒪(t=1T(1/βt))=𝒪(t=1Ttp)𝒪(T1p),\displaystyle\textstyle\sum_{t=1}^{T}({1}/{\tilde{\beta}^{t}})\overset{\text{\char 172}}{=}\textstyle\mathcal{O}(\sum_{t=1}^{T}({1}/{{\beta}^{t}}))\overset{\text{\char 173}}{=}\textstyle\mathcal{O}(\sum_{t=1}^{T}t^{-p})\overset{\text{\char 174}}{\geq}\textstyle\mathcal{O}(T^{1-p}),

where step uses β~t=K˙1/wβt\tilde{\beta}^{t}={\dot{K}}^{1/w}\beta^{t}; step uses βtβ0(1+ξtp)=𝒪(tp)\beta^{t}\triangleq\beta^{0}(1+\xi t^{p})=\mathcal{O}(t^{p}); step uses Lemma A.9 that: t=1Ttu(1p)T1u=𝒪(T1p)\sum_{t=1}^{T}t^{-u}\geq(1-p)T^{1-u}=\mathcal{O}(T^{1-p}) for all p(0,1)p\in(0,1).

Applying Lemma A.13 with a=1pa=1-p, we have:

dT𝒪(1/(T(1p)/τ)).\displaystyle d^{T}\leq\mathcal{O}(1/(T^{(1-p)/\tau})).

Part (c). Finally, using the fact 𝐗T𝐗𝖥𝒪(dT)\|\mathbf{X}^{T}-\mathbf{X}^{\infty}\|_{\mathsf{F}}\leq\mathcal{O}(d^{T}) as shown in Lemma D.3(b), we finish the proof of this theorem.

Appendix E Additional Experiments Details and Results

\blacktriangleright Datasets. In our experiments, we utilize several datasets comprising both randomly generated and publicly available real-world data. These datasets are structured as data matrices 𝐃m˙×d˙\mathbf{D}\in\mathbb{R}^{\dot{m}\times\dot{d}}. They are denoted as follows: ‘mnist-m˙\dot{m}-d˙\dot{d}’, ‘TDT2-m˙\dot{m}-dd^{\prime}’, ‘sector-mm^{\prime}-dd^{\prime}’, and ‘randn-m˙\dot{m}-d˙\dot{d}’, where randn(m,n){\mathrm{randn}(m,n)} generates a standard Gaussian random matrix of size m×nm\times n. The construction of 𝐃m˙×d˙\mathbf{D}\in\mathbb{R}^{\dot{m}\times\dot{d}} involves randomly selecting m˙\dot{m} examples and d˙\dot{d} dimensions from the original real-world dataset, sourced from http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html and https://www.csie.ntu.edu.tw/~cjlin/libsvm/. Subsequently, we normalize each column of 𝐃\mathbf{D} to possess a unit norm and center the data by subtracting the mean, denoted as 𝐃𝐃𝟏𝟏𝖳𝐃\mathbf{D}\Leftarrow\mathbf{D}-\mathbf{1}\mathbf{1}^{\mathsf{T}}\mathbf{D}.

\blacktriangleright Additional experiment Results. We present additional experimental results in Figures 4, 4, and 5. The figures demonstrate that the proposed OADMM method generally outperforms the other methods, with OADMM-EP surpassing OADMM-RR. These results reinforce our previous conclusions.

Refer to caption
(a) mnist-1500-500
Refer to caption
(b) mnist-2500-500
Refer to caption
(c) TDT2-1500-500
Refer to caption
(d) TDT2-2500-500
Refer to caption
(e) sector-1500-500
Refer to caption
(f) sector-2500-500
Refer to caption
(g) randn-1500-500
Refer to caption
(h) randn-2500-500
Figure 3: The convergence curve of the compared methods with ρ˙=10\dot{\rho}=10.
Refer to caption
(i) mnist-1500-500
Refer to caption
(j) mnist-2500-500
Refer to caption
(k) TDT2-1500-500
Refer to caption
(l) TDT2-2500-500
Refer to caption
(m) sector-1500-500
Refer to caption
(n) sector-2500-500
Refer to caption
(o) randn-1500-500
Refer to caption
(p) randn-2500-500
Figure 4: The convergence curve of the compared methods with ρ˙=100\dot{\rho}=100.
Refer to caption
(a) mnist-1500-500
Refer to caption
(b) mnist-2500-500
Refer to caption
(c) TDT2-1500-500
Refer to caption
(d) TDT2-2500-500
Refer to caption
(e) sector-1500-500
Refer to caption
(f) sector-2500-500
Refer to caption
(g) randn-1500-500
Refer to caption
(h) randn-2500-500
Figure 5: The convergence curve of the compared methods with ρ˙=1000\dot{\rho}=1000.