First-Order Bilevel Topology Optimization for Fast Mechanical Design

Zherong Pan, Xifeng Gao, and Kui Wu
Authors are with the Lightspeed & Quantum Studio, Tencent-America. {zrpan,xifgao,[email protected]}

Abstract

Topology Optimization (TO), which maximizes structural robustness under material weight constraints, is becoming an essential step for the automatic design of mechanical parts. However, existing TO algorithms use the Finite Element Analysis (FEA) that requires massive computational resources. We present a novel TO algorithm that incurs a much lower iterative cost. Unlike conventional methods that require exact inversions of large FEA system matrices at every iteration, we reformulate the problem as a bilevel optimization that can be solved using a first-order algorithm and only inverts the system matrix approximately. As a result, our method incurs a low iterative cost, and users can preview the TO results interactively for fast design updates. Theoretical convergence analysis and numerical experiments are conducted to verify our effectiveness. We further discuss extensions to use high-performance preconditioners and fine-grained parallelism on the Graphics Processing Unit (GPU).

Index Terms:

Topology Optimization, Bilevel Optimization, Low-Cost Robotics, Mechanisms and Design

I Introduction

Several emerging techniques over the past decade are contributing to the new trend of lightweight, low-cost mechanical designs. Commodity-level additive manufacturing has been made possible using 3D printers and user-friendly modeling software, allowing a novice to prototype devices, consumer products, and low-end robots. While the appearance of a mechanical part is intuitive to perceive and modify, other essential properties, such as structural robustness, fragility, vibration frequencies, and manufacturing cost, are relatively hard to either visualize or manage. These properties are highly related to the ultimate performance criteria, including flexibility, robotic maneuverability, and safety. For example, the mass and inertia of a collaborative robot should be carefully controlled to avoid harm to the human when they collide. For manipulators on spaceships, the length of each part should be optimized to maximize the end-effector reachability under strict weight limits. Therefore, many mechanical designers rely on TO methods to search for the optimal structure of mechanical parts by tuning their center-of-mass [1], inertia [2], infill pattern [3], or material distribution [4].

Refer to caption — Figure 1: Given the same amount of computational time (200s) on a grid resolution $128\times 256$ , our method using approximate inverse (a) can find the complex mechanical designs to resist the external load (red force pointing downwards), while the prior method [5] (b) is still far from convergence.

Unfortunately, the current TO algorithms [6] pose a major computational bottleneck. Indeed, TO algorithms rely on FEA to predict the relationship between the extrinsic (boundary loads) and intrinsic status (stress, strain, and energy densities) of a mechanical part. Such predictions are made by solving systems of (non-)linear equations during each iteration of TO. The design space and resulting system size are often gigantic when designing fine-grained material shapes. For example, an $128\times 256$ design space illustrated in Figure 1 incurs $10^{4}$ decision variables. Prior works make full use of high-end GPU [7] or many-core CPU [8, 9] to solve such problems. Even after such aggressive optimization, the total time spent on FEA is typically at the level of hours. As a result, mechanical engineers cannot modify or preview their designs without waiting for tens of minutes or even hours, which significantly slow down the loops of design refinement.

Main Results: We propose a low-cost algorithm that solves a large subclass of TO problems known as Solid Isotropic Material with Penalization (SIMP) [10]. The SIMP model is widely used to search for fine-grained geometric shapes by optimizing the infill levels. Our main observation is that, in conventional TO solvers, the computation resources are predominated by solving the system of equations to the machine precision. We argue that pursuing such precision for intermediary TO iterations is unnecessary as one only expects the system to be exactly solved on final convergence. Therefore, we propose to maintain an approximate solution that is refined over iterations. We show that such a TO solver corresponds to a bilevel reformulation of the SIMP problems (Section III). We further propose a first-order algorithm to solve these bilevel problems with guaranteed convergence to a first-order critical point (Section IV) of moderate accuracy. Compared with conventional TO algorithms [6], our method incurs a much lower iterative cost of $\mathcal{O}(E\log E)$ ( $E$ is the number of decision variables), allowing the mechanical designers to interactively preview the TO results and accelerate the loops of design refinement. We then discuss several useful extensions including different preconditioning schemes for refining the approximate solutions (Section IV-A) and a GPU-based implementation that achieves an asymptotic iterative cost of $\mathcal{O}(\log E)$ (Section IV-C). We highlight the effectiveness of our method through a row of 2D benchmark problems from [11, 12, 13] (Section V), where our iterative cost can be as low as $11-40$ ms when handling a resolution of up to $110$ k grid cells.

II Related Work

We review representative TO formulations, numerical solvers for TO problems, and bilevel optimizations.

TO Formulations & Applications: Generally speaking, TO involves all mathematical programming problems that use geometric shapes as decision variables. A TO problem has three essential component: geometric representations, governing system of equations, and design objectives/constraints. The SIMP model [10] uses density-based representation where the geometric shape is discretized using FEA and decision variables are the infill levels of discrete elements. There are other popular representations. For example, truss optimization [14] uses the decision variables to model the discrete beam existence. Infill pattern optimization [3] searches for the local arrangements of shape interiors. Compared with the SIMP model, these formulations typically induce much fewer decision variables due to simplicity of shapes or locality.

In terms of governing system of equations, the SIMP model assumes linearized elastic models, which means that the extrinsic and intrinsic properties of a mechanical part are linearly related. Linear models make fairly accurate predictions under the small-deformation assumption, which is the case with most mechanical design problems with rigid parts. For large deformations, there are extensions to the SIMP model that use nonlinear elastic models [15, 16] or even elastoplastic models [17]. However, the use of these nonlinear, non-convex models induces optimization problems of a more complex class than the SIMP model [18], for which our analysis does not apply. In a similar fashion as SIMP, truss optimization [19] has small- and large-deformation variants. We speculate that similar ideas and analysis apply for truss optimization problems under small-deformations.

TO formulations deviate from each other mostly due to their design objectives and constraints. The SIMP model minimizes the total potential energy under a single external load and a total-volume constraint. Standard SIMP model does not have constraints to limit deformations, but there are extensions to include strain/stress constraints [20]. Unfortunately, stress-constrained formulations also induce a more complex problem class, where our analysis does not apply. Yet another extension to the SIMP model [21, 22, 23] is to consider multiple or uncertain external loads. Such extension is favorable because a mechanical system can undergo a variety of unknown extrinsic situations. We speculate it is possible to adapt our method to these cases by sampling the uncertain situations and use our method as underlying solvers for each sampled scenario.

Numerical TO Solvers: Three categories of numerical approaches have been extensively used: truncated gradient method, general-purpose method, and hierarchical method. The reference implementation [10] of the SIMP solver uses the heuristic, truncated gradient method [24]. This method first computes the exact gradient and then updates the design along the negative gradient direction using heuristic step sizes. Although the convergence of these heuristic schemes is hard to analyze, they perform reasonably well in practice and are thus widely used, e.g. in [12]. General-purpose methods are off-the-shelf optimization algorithms (e.g., sequential quadratic programming (SQP)/interior point method [25], and augmented Lagrangian method (ALM) [26]). The generality of these solvers makes it easier to experiment with different TO formulations. Without problem-specific optimization, however, they typically scale poorly to the high-dimensional decision space of practical problems. In particular, the (Globally Convergent) Method of Moving Asymptotes (GCMMA) [27] is a special form of SQP that can handle a large set of box constraints, making it particularly suitable for SIMP-type problems. Finally, solving problem in a hierarchical approach is an attempting way to tackle high-dimensional problems. [28] [28] used a quadtree to focus computational resources on the material boundaries. The multigrid method [12, 29] constructs a hierarchy of grids to accelerate FEA system solves.

Bilevel Optimization [18]: When a constraint of an optimization problem (high-level problem) involves another optimization (low-level problem), the problem is considered bilevel. In our formulation, the high-level problem optimizes the material infill levels, while the low-level problem performs the FEA analysis. Bilevel problems frequently arise in robotics for time optimality [30], trajectory search [31], and task-and-motion planning [32]. As a special case considered in our method, if the low-level problem is strictly convex, then its solution is unique and the low-level problem can be replaced by a well-defined solution mapping function. This treatment essentially reduces the bilevel to a single-level optimization. Otherwise, low-level problem can be non-convex and the solution map is multi-valued, where additional mechanism is needed to compare and select low-level solutions (e.g., a maximum- or minimum-value function can be used and the associated problem is NP-hard [33]). It is noteworthy that oftentimes an arbitrary solution of the low-level problem suffice, in which case penalty method [34] can be used to approximate one of the solutions on the low-level. Bilevel formulation has been used in TO community for years, but prior works resort to off-the-shelf solvers [35] or machine learning [36] to (approximately) solve the underlying optimization problem, while we utilize the special structure of the SIMP model to design our efficient, first-order method.

III Problem Formulation

In this section we introduce the SIMP model using notations in [12]. (see Table II in our appendix for a list of symbols) This model uses FEA to discretize a mechanical part governed by the linear elastic constitutive law. If the material is undergoing an internal displacement $u(x):\Omega\to\mathbb{R}^{2}$ where $\Omega$ is the material domain, then the internal potential energy is accumulated according to the constitutive law as follows:

\displaystyle P[u(\bullet),v(\bullet)]=\int_{\Omega}\frac{1}{2}u(x)^{T}k_{e}(v(x),x)u(x)dx,

where $P[\bullet,\bullet]$ is the potential energy functional, $k_{e}$ is the spatially varying stiffness tensor parameterized by infinite-dimensional decision variable $v(x):\Omega\to\mathbb{R}$ that determines the infill levels. FEA discretizes the material domain $\Omega$ into a set of $E$ elements each spanning the sub-domain $\Omega_{e}$ . The elements are connected by a set of $N$ nodes. By restricting $P[\bullet]$ to a certain finite-dimensional functional space (the shape space), the potential energy can be written as the following sum over elements:

	$\displaystyle P(u,v)=$	$\displaystyle\frac{1}{2}u^{T}\sum_{e=1}^{K}K_{e}(v)u\triangleq\frac{1}{2}u^{T}K(v)u$
	$\displaystyle K_{e}(v)\triangleq$	$\displaystyle\int_{\Omega_{e}}k_{e}(v,x)dx.$

Here $K_{e}$ is the stiffness matrix of the $e$ th element, $K$ is the stiffness matrix assembled from $K_{e}$ . With a slight abuse of notation, we denote $P(\bullet,\bullet)$ as a finite-dimensional potential energy function, $u\in\mathbb{R}^{2N}$ without bracket as a vector of 2D nodal displacements ( $u\in\mathbb{R}^{3N}$ in 3D), and $v\in\mathbb{R}^{E}$ without bracket as a vector of elementwise infill levels. The SIMP model is compatible with all kinds of FEA discretization scheme and the case with regular grid is illustrated in Figure 2.

Suppose an external force $f$ is exerted on the mechanical part, then the total (internal+external) potential energy is:

\displaystyle P_{f}(u,v)=\frac{1}{2}u^{T}K(v)u-f^{T}u.

SIMP model works with arbitrary force setup, but external forces are only applied to the boundary in practice. For example, if only the $i$ th boundary node is under a force $f_{i}$ , then we have $f=e_{i}f_{i}$ with $e_{i}$ being the unit vector. The equilibrium state is derived by minimizing $P_{f}$ with respect to $u$ , giving $u_{f}=K^{-1}(v)f$ (Figure 2a). As the name suggests the two variables $u_{f},f$ are linearly related under the linear elastic constitutive law. The goal of TO is to minimize the induced internal potential energy due to $f$ :

\displaystyle\underset{v\in X}{\text{argmin}}\;l(v)\triangleq\frac{1}{2}u_{f}^{T}(v)K(v)u_{f}(v)=\frac{1}{2}f^{T}K^{-1}(v)f,

(1)

where $X$ is a compact, convex set bounding $v$ away from potentially singular configurations. Several widely used stiffness matrix parameterization of $K(v)$ includes the linear law $K(v)=\sum_{e=1}^{E}v_{e}K_{e}$ or the power law $K(v)=\sum_{e=1}^{E}v_{e}^{\eta}K_{e}$ where $\eta>0$ is some constant. In both cases, we need to choose $X$ such that the spectrum of $K$ is bounded from both sides:

\displaystyle\mathbb{0}<\underline{\rho}\mathbb{1}\leq\rho(K(v))\leq\bar{\rho}\mathbb{1},

(2)

for any $v\in X$ , where $\rho(\bullet)$ is the eigenvalues of a matrix. A typical choice of $X$ to this end is:

\displaystyle X=\{v|\mathbb{1}\geq v\geq\underline{v}\mathbb{1}>\mathbb{0}\land\mathbb{1}^{T}v\leq\bar{v}\},

where $v_{e}$ is the infill level of the $e$ th element, $\underline{v}$ is the minimal elementwise infill level, $\bar{v}$ is the total infill level, and $\mathbb{1},\mathbb{0}$ are the all-one and all-zero vectors, respectively. In this work, we use the following more general assumption on the stiffness matrix parameterization and the shape of $X$ :

Assumption III.1.

$X$ is a compact polytopic subset of $\mathbb{R}^{E}$ , on which Equation 2 holds and $K(v)$ is smooth.

Remark III.2.

The compactness of $X$ is obvious. $X$ is also polytopic because we only have linear constraints, which is essential when we measure optimality using relative projection error (see Section IV for details). Assumption III.1 is more general than the standard SIMP model. Indeed, $K(v)$ can be parameterized using any smooth activation function $\mathcal{A}_{e}(v)$ for the $e$ th element as $K(v)=\sum_{e=1}^{E}\mathcal{A}_{e}(v)K_{e}$ , which obviously satisfies Assumption III.1 as long as $\mathcal{A}_{e}(v)$ is bounded from both above and below on $X$ , away from zero. Our assumption also allows a variety of topology modifications and constraints, some of which are illustrated here: Symmetry-Constraint: We can plug in a left-right symmetric mapping matrix $\mathbb{S}$ and define $K(v)=K(\mathbb{S}v_{s})$ where $v_{s}$ is the infill levels for the left-half of the material block. Component-Cost-Constraint: A mechanical part can be divided into different sub-components and the amount of material can be assigned for each component as follows:

\displaystyle X\triangleq\{v|v=\left(\begin{array}[]{cccc}{v^{1}}^{T},&{v^{2}}^{T},&{\cdots}^{T},&{v^{\#}}^{T}\end{array}\right)^{T}\land\mathbb{1}^{T}v^{i}\leq\bar{v}^{i}\},

where $\#$ is the number of sub-components, $v^{i}$ is the sub-vector of $v$ consisting of decision variables for the $i$ th component, and $\bar{v}^{i}$ is the total allowed amount of material for that sub-component. Material-Filtering: [37] [37] proposed to control the thickness of materials by filtering the infill levels using some convolutional kernel denoted as $\mathbb{C}$ . As long as the convolution operator $\mathbb{C}(\bullet)$ is smooth, Assumption III.1 holds by plugging in $K(v)=\sum_{e=1}^{E}\mathcal{A}_{e}(\mathbb{C}(v))K_{e}$ .

All the conventional methods such as the (GC)MMA [38, 27], SQP [5, 25], and [12] for solving Equation 1 involve computing the exact gradient via sensitivity analysis (Figure 2b):

\displaystyle\nabla l=

\displaystyle\frac{1}{2}\frac{\partial{u_{f}^{T}}}{\partial{v}}Ku_{f}+\frac{1}{2}u_{f}^{T}\frac{\partial{K}}{\partial{v}}u_{f}+\frac{1}{2}u_{f}^{T}K\frac{\partial{u_{f}}}{\partial{v}}=-\frac{1}{2}u_{f}^{T}\frac{\partial{K}}{\partial{v}}u_{f},

where we have used the derivatives of the matrix inverse. The major bottleneck lies in the computation of $u_{f}$ that involves exact matrix inversion. Here we assume that $\frac{\partial^{2}{K}}{\partial{v}^{2}}$ is a third-order tensor of size $|u|\times|u|\times|v|$ and its left- or right-multiplication means contraction along the first and second dimension. Note that the convergence of GCMMA and SQP solvers rely on the line-search scheme that involves the computation of objective function values, which in turn requires matrix inversion. Practical methods [12] and [10] avoid the line-search schemes using heuristic step sizes. Although working well in practice, the theoretical convergence of these heuristic rules is difficult to establish.

IV Bilevel Topology Optimization

In this section, we first propose our core framework to solve SIMP problems that allows inexact matrix inversions. We then discuss extensions to accelerate convergence via preconditioning (Section IV-A) and fast projection operators (Section IV-B). Finally, we briefly discuss parallel algorithm implementations (Section IV-C). Inspired by the recent advent of first-order bilevel optimization algorithms [39, 40, 41], we propose to reformulation Equation 1 as the following bilevel optimization:

	$\displaystyle\underset{v\in X}{\text{argmin}}$	$\displaystyle f(u,v)\triangleq\frac{1}{2}u_{f}^{T}K(v)u_{f}$		(3)
	s.t.	$\displaystyle\quad u_{f}\in\underset{u}{\text{argmin}}\;P_{f}(u,v).$		(3)

The low-level part of Equation 3 is a least-square problem in $u$ and, since $K(v)$ is always positive definite when $v\in X$ , the low-level solution is unique. Plugging the low-level solution into the high-level objective function and Equation 1 is recovered. The first-order bilevel optimization solves Equation 3 by time-integrating the discrete dynamics system as described in Algorithm 1 and we denote this algorithm as First-Order Bilevel Topology Optimization (FBTO). The first line of Algorithm 1 is a single damped Jacobi iteration for refining the low-level solution using an adaptive step size of $\beta_{k}$ . But instead of performing multiple Jacobi iterations until convergence, we go ahead to use the inexact result after one iteration for sensitivity analysis. Finally, we use a projected gradient descend to update the infill levels with an adaptive step size of $\alpha_{k}$ .

Algorithm 1 FBTO

1:for k

\leftarrow 1,2,\cdots

u_{k+1}\leftarrow u_{k}-\beta_{k}(K(v_{k})u_{k}-f)

v_{k+1}\leftarrow\text{Proj}_{X}(v_{k}+\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k})

Here $\text{Proj}_{X}(\bullet)$ is the projection onto the convex set $X$ under Euclidean distance. This scheme has linear iterative cost of $\mathcal{O}(E)$ as compared with conventional method that has superlinear iterative cost due to sparse matrix inversions (some operations of Algorithm 1 have a complexity of $\mathcal{O}(N)$ but we know that $E=\Theta(N)$ ).

Remark IV.1.

The succinct form of our approximate gradient ( $-u_{k}^{T}{\partial{K}}/{\partial{v_{k}}}u_{k}/2$ in Line 3) makes use of the special structure of the SIMP model, which is not possible for more general bilevel problems. If a general-purpose first-order bilevel solver is used, e.g. [39] and follow-up works, one would first compute/store $\nabla_{uv}P_{f}$ and then approximate $\nabla_{uu}P_{f}^{-1}\nabla_{uv}P_{f}$ via sampling. Although the stochastic approximation scheme does not require exact matrix inversion, they induce an increasing number of samples with a larger number of iterations. Instead, we make use of the cancellation between $K(v)$ in the high-level objective and $K^{-1}(v)$ in the low-level objective to avoid matrix inversion or its approximations, which allows our algorithm to scale to high dimensions without increasing sample complexity.

As our main result, we show that Algorithm 1 is convergent under the following choices of parameters.

Assumption IV.2.

We choose constant $\beta_{k}$ and some constant $p>0$ satisfying the following condition:

\displaystyle 0<\beta_{k}<\frac{2\underline{\rho}}{\bar{\rho}^{2}}\land p>\frac{1-2\beta_{k}\underline{\rho}+\bar{\beta}^{2}\beta_{k}^{2}}{2\beta_{k}\underline{\rho}-\bar{\beta}^{2}\beta_{k}^{2}}.

(4)

Assumption IV.3.

Define $\Delta_{k}\triangleq\|u_{k}-u_{f}(v_{k})\|^{2}$ and suppose $\Delta_{1}\leq U^{2}$ , we choose the following uniformly bounded sequence $\{\Gamma_{k}\}$ with constant $\bar{\Gamma},\bar{\Theta}$ :

\bar{\Theta}\triangleq\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\land\Gamma_{k}\leq\bar{\Gamma}\leq\frac{U^{2}(1-\bar{\Theta})}{(U+U_{f})^{4}},

where $U_{f}$ is the finite upper bound of $u_{f}(v)$ on $X$ .

Assumption IV.4.

We choose $\alpha_{k}$ as follows:

\displaystyle\alpha_{k}=\frac{1}{k^{m}}\sqrt{\frac{\bar{\Gamma}}{L_{u}^{2}L_{\nabla K}^{2}}\frac{4}{p+1}},

for positive constants $m,p$ , where $L_{u}$ is the L-constant of $u_{f}$ and $L_{\nabla K}$ is the upper bound of the Frobenius norm of ${\partial{K(v)}}/{\partial{v}}$ when $v\in X$ .

Assumption IV.5.

We choose positive constants $m,n$ satisfying the following condition:

\displaystyle\max(1-2m+mn,2-2m-mn,2-3m)<0\land m<1.

Theorem IV.6.

Suppose $\{v_{k}\}$ is the sequence generated by Algorithm 1 under Assumption III.1, IV.2, IV.3, IV.4, IV.5. For any $\epsilon>0$ , there exists an iteration number $k$ such that $\|\nabla_{X}l(v_{k})\|\leq\epsilon$ , where $\nabla_{X}l(v_{k})$ is the projected negative gradient into the tangent cone of $X$ , $\mathcal{T}_{X}$ , defined as:

\displaystyle\nabla_{X}l(v_{k})=\underset{d\in\mathcal{T}_{X}}{\text{argmin}}\;\|d+\nabla l(v_{k})\|.

We prove Theorem IV.6 in Section VIII where we further show that the high-level optimality error scales as $\mathcal{O}(k^{m-1})$ , so that $m$ should be as small as possible for the best convergence rate. Direct calculation leads to the constraint of $m>3/4$ from the first two conditions of Assumption IV.5, so the convergence rate can be arbitrarily close to $\mathcal{O}(k^{-1/4})$ as measured by the following relative projection error [42]:

\displaystyle\Delta_{k}^{v}\triangleq

\displaystyle\frac{1}{\alpha_{k}^{2}}\|\text{Proj}_{X}(v_{k}-\alpha_{k}\nabla l(v_{k}))-v_{k}\|^{2}.

Although our analysis of convergence uses a similar technique as that for the two-timescale method [40], our result is only single-timescale. Indeed, the low-level step size $\beta_{k}$ can be constant and users only need to tune a decaying step size for $\alpha_{k}$ . We will further show that certain versions of our algorithm allow a large choice of $\beta_{k}=1$ (see Section IX-E). In addition, our formula for choosing $\alpha_{k}$ does not rely on the maximal number of iterations of the algorithm.

Algorithm 2 PFBTO

1:for k

\leftarrow 1,2,\cdots

\delta_{k}\leftarrow M^{-1}(v_{k})(K(v_{k})u_{k}-f)

u_{k+1}\leftarrow u_{k}-\beta_{k}K(v_{k})M^{-1}(v_{k})\delta_{k}

v_{k+1}\leftarrow\text{Proj}_{X}(v_{k}+\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k})

Algorithm 3 CPFBTO

1:for k

\leftarrow 1,2,\cdots

u_{k+1}\leftarrow u_{k}-\beta_{k}M^{-1}(v_{k})(K(v_{k})u_{k}-f)

v_{k+1}\leftarrow\text{Proj}_{X}(v_{k}+\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k})

IV-A Preconditioning

Our Algorithm 1 uses steepest gradient descend to solve the linear system in the low-level problem, which is known to have a slow convergence rate of $(1-1/\kappa)^{2k}/(1+1/\kappa)^{2k}$ [43] with $\kappa$ being the condition number of the linear system matrix. A well-developed technique to boost the convergence rate is preconditioning [44], i.e., pre-multiplying a symmetric positive-definite matrix $M(v)$ that approximates $K(v)$ whose inverse can be computed at a low-cost. Preconditioning is a widely used technique in the TO community [12, 45] to accelerate the convergence of iterative linear solvers in solving $u_{f}(v)=K(v)^{-1}f$ . In this section, we show that FBTO can be extended to this setting by pre-multiplying $M(v)$ in the low-level problem and we name Algorithm 2 as Pre-conditioned FBTO or PFBTO. Algorithm 2 is solving the following different bilevel program from Equation 3:

	$\displaystyle\underset{v\in X}{\text{argmin}}$	$\displaystyle f(u,v)\triangleq\frac{1}{2}u_{f}^{T}K(v)u_{f}$		(5)
	s.t.	$\displaystyle\quad u_{f}\in\underset{u}{\text{argmin}}\;\\|K(v)u-f\\|^{2},$		(5)

where the only different is the low-level system matrix being squared to $K^{2}(v)$ . Since $K(v)$ is non-singular, the two problems have the same solution set. Equation 5 allows us to multiply $M^{-1}(v)$ twice in Algorithm 2 to get the symmetric form of $K(v)M^{-2}(v)K(v)$ . To show the convergence of Algorithm 2, we only need the additional assumption on the uniform boundedness of the spectrum of $M(v)$ :

Assumption IV.7.

For any $v\in X$ , we have:

\displaystyle\mathbb{0}<\underline{\rho}_{M}\mathbb{1}\leq\rho(M(v))\leq\bar{\rho}_{M}\mathbb{1}.

Assumption IV.7 holds for all the symmetric-definite preconditioners. The convergence can then be proved following the same reasoning as Theorem IV.6 with a minor modification summarized in Section VIII-E. Our analysis sheds light on the convergent behavior of prior works using highly efficient preconditioners such as geometric multigrid [12, 45] and conjugate gradient method [46]. We further extend these prior works by enabling convergent algorithms using lightweight preconditioners such as Jacobi/Gauss-Seidel iterations and approximate inverse schemes, to name just a few. The practical performance of Algorithm 2 would highly depend on the design and implementation of specific preconditioners.

On the down side of Algorithm 2, the system matrix in the low-level problem of Equation 5 is squared and so is the condition number. Since the convergence speed of low-level problem is $(1-1/\kappa)^{2k}/(1+1/\kappa)^{2k}$ , squaring the system matrix can significantly slow down the convergence, counteracting the acceleration brought by preconditioning. This is because we need to derive a symmetric operator of form $K(v)M^{-2}(v)K(v)$ . As an important special case, however, only a single application of $M^{-1}(v)$ suffice if $M(v)$ commutes with $K(v)$ , leading to the Commutable PFBTO or CPFBTO Algorithm 3. A useful commuting preconditioner is the Arnoldi process used by the GMRES solver [47], which is defined as:

		$\displaystyle M^{-1}(v)b\triangleq\sum_{i=0}^{D}c_{i}^{*}K^{i}(v)b$		(6)
		$\displaystyle c_{i}^{*}\triangleq\underset{c_{i}}{\text{argmin}}\;\\|b-\sum_{i=1}^{D+1}c_{i-1}K^{i}(v)b\\|^{2},$		(6)

where $D$ is the size of Krylov subspace. By sharing the same eigenvectors, $M(v)$ is clearly commuting with $K(v)$ . This is a standard technique used as the inner loop of the GMRES solver, where the least square problem Equation LABEL:eq:LSP is solved via the Arnoldi iteration. The Arnoldi process can be efficiently updated through iterations if $K(v)$ is fixed, which is not the case with our problem. Therefore, we choose the more numerically stable Householder QR factorization to solve the least square problem, and we name Equation LABEL:eq:LSP as Krylov-preconditioner.

To show the convergence of Algorithm 3, we use a slightly different analysis summarized in the following theorem:

Assumption IV.8.

We choose constant $\beta_{k}$ and some constant $p>0$ satisfying the following condition:

\displaystyle 0<\beta_{k}<\frac{2\underline{\rho}\underline{\rho}_{M}^{2}}{\bar{\rho}^{2}\bar{\rho}_{M}}\land p>\frac{1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}}}{2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}-\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}}}.

(7)

Assumption IV.9.

Define $\Xi_{k}\triangleq\|K(v_{k})u_{k}-u_{f}\|^{2}$ and suppose $\Xi_{1}\leq U^{2}$ , we choose the following uniformly bounded sequence $\{\Gamma_{k}\}$ with constant $\bar{\Gamma},\bar{\Theta}$ :

\Theta_{k}\triangleq\frac{p+1}{p}(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}})\land\Gamma_{k}\leq\bar{\Gamma}\leq\frac{U^{2}(1-\bar{\Theta})}{(U/\underline{\rho}+U_{f})^{6}},

where $U_{f}$ is the finite upper bound of $u_{f}(v)$ on $X$ .

Assumption IV.10.

We choose $\alpha_{k}$ as follows:

\displaystyle\alpha_{k}=\frac{1}{k^{m}}\sqrt{\frac{\bar{\Gamma}}{L_{K}^{2}L_{\nabla K}^{2}}\frac{4}{p+1}},

for some positive constants $m,p$ , where $L_{K}$ is the L-coefficient of $K(v)$ ’s Frobenius norm on $X$ .

Theorem IV.11.

Suppose $M(v)$ commutes with $K(v)$ , $\{v_{k}\}$ is the sequence generated by Algorithm 3 under Assumption III.1, IV.8, IV.9, IV.10, IV.5, IV.7. For any $\epsilon>0$ , there exists an iteration number $k$ such that $\|\nabla_{X}l(v_{k})\|\leq\epsilon$ , where $\nabla_{X}l(v_{k})$ is the projected negative gradient into the tangent cone of $X$ .

The proof of Theorem IV.11 is provided in Section VIII-E, which allows the choice of a large, constant step size $\beta_{k}=1$ when the Krylov-preconditioner is used.

IV-B Efficient Implementation

We present some practical strategies to further accelerate the computational efficacy of all three Algorithm 1,2,3 that are compatible with our theoretical analysis. First, the total volume constraint is almost always active as noted in [38]. Therefore, it is useful to maintain the total volume constraint when computing the approximate gradient. If we define the approximate gradient as $\tilde{\nabla}l\triangleq-\frac{1}{2}u_{k}^{T}{\partial{K}}/{\partial{v_{k}}}u_{k}$ and consider the total volume constraint $\mathbb{1}^{T}v_{k}=\bar{v}$ , then a projected gradient can be computed by solving:

\displaystyle\tilde{\nabla}_{P}l\triangleq\underset{\mathbb{1}^{T}\tilde{\nabla}_{P}l=0}{\text{argmin}}\;\|\tilde{\nabla}_{P}l-\tilde{\nabla}l\|^{2},

with the analytic solution being $\tilde{\nabla}_{P}l=\tilde{\nabla}l-\mathbb{1}\mathbb{1}^{T}\tilde{\nabla}l/E$ . In our experiments, using $\tilde{\nabla}_{P}l$ in place of $\tilde{\nabla}l$ in the last line of FBTO algorithms can boost the convergence rate at an early stage of optimization.

Second, the implementation of the projection operator $\text{Proj}_{X}(\bullet)$ can be costly in high-dimensional cases. However, our convex set $X$ typically takes a special form that consists of mostly bound constraints $\mathbb{1}\geq v\geq\underline{v}\mathbb{1}$ with only one summation constraint $\mathbb{1}^{T}v\leq\bar{v}$ . Such a special convex set is known as a simplex and a special $\mathcal{O}(E\log(E))$ algorithm exists for implementing the $\text{Proj}_{X}(\bullet)$ as proposed in [48]. Although the original algorithm [48] only considers the equality constraint $\mathbb{1}^{T}v=1$ and single-sided bounds $v\geq 0$ , a similar technique can be adopted to handle our two-sided bounds and inequality summation constraint and we provide its derivation for completeness. We begin by checking whether the equality constraint is active. We can immediately return if $\mathbb{1}^{T}\text{Proj}_{\{\mathbb{1}\geq v\geq\underline{v}\}}(v_{k})\leq\bar{v}$ . Otherwise, the projection operator amounts to solving the following quadratic program:

\displaystyle\underset{\mathbb{1}\geq v\geq\underline{v}\mathbb{1}}{\text{argmin}}\;\frac{1}{2}\|v-v_{k}\|^{2}\quad\text{s.t.}\;\mathbb{1}^{T}v=\bar{v},

whose Lagrangian $\mathcal{L}$ and first-order optimality conditions are:

	$\displaystyle\mathcal{L}=$	$\displaystyle\frac{1}{2}\\|v-v_{k}\\|^{2}+\lambda_{\mathbb{1}}^{T}(\mathbb{1}-v)+\lambda_{\underline{v}}^{T}(v-\underline{v}\mathbb{1})+\lambda(\mathbb{1}^{T}v-\bar{v})$
	$\displaystyle v=$	$\displaystyle v_{k}+\lambda_{\mathbb{1}}-\lambda_{\underline{v}}-\lambda\mathbb{1}\land\mathbb{0}\leq\lambda_{\mathbb{1}}\perp\mathbb{1}-v\geq\mathbb{0}\land\mathbb{0}\leq\lambda_{\underline{v}}\perp{v-\underline{v}\mathbb{1}}\geq\mathbb{0},$

where $\lambda_{\mathbb{1}},\lambda_{\underline{v}},\lambda$ are Lagrange multipliers. We can conclude that $v=\text{Clamp}(v_{k}-\lambda\mathbb{1},\underline{v}\mathbb{1},\mathbb{1})$ and the following equality holds due to the constraint $\mathbb{1}^{T}v=\bar{v}$ being active:

\displaystyle\bar{v}=\mathbb{1}^{T}\text{Clamp}(v_{k}-\lambda\mathbb{1},\underline{v}\mathbb{1},\mathbb{1}),

(8)

which is a piecewise linear equation having at most $2E$ pieces. There are $E$ left-nodes in the form of $v_{k}-\underline{v}\mathbb{1}$ and $E$ right-nodes in the form of $v_{k}-\mathbb{1}$ , separating different linear pieces. The piecewise linear equation can be solved for $\lambda$ and thus $v$ by first sorting the $2E$ nodes at the cost of $\mathcal{O}(E\log E)$ and then looking at each piece for the solution. We summarize this process in Algorithm 4 where we maintain the sorted end points of the line segments via running sums.

Algorithm 4 Project

v_{k}

onto

X

(SLOPE is the slope of each line segment, RHS is the righthand side of Equation 8)

1:if

\mathbb{1}^{T}\text{Proj}_{\{\mathbb{1}\geq v\geq\underline{v}\}}(v_{k})\leq\bar{v}

then

2: Return

\text{Proj}_{\{\mathbb{1}\geq v\geq\underline{v}\}}(v_{k})

\text{ARRAY}\leftarrow\emptyset

4:for

v_{k}^{e}\in v_{k}

\text{ARRAY}\leftarrow\text{ARRAY}\bigcup\{v_{k}^{e}-\underline{v},v_{k}^{e}-1\}

6:Sort ARRAY from low to high

\triangleright

\mathcal{O}(E\log E)

\text{SLOPE}\leftarrow 0

\text{RHS}\leftarrow E

\text{NODES}\leftarrow\emptyset

\lambda_{\text{last}}^{*}\leftarrow 0

8:for

\lambda^{*}\in\text{ARRAY}

\triangleright

\mathcal{O}(E)

\text{RHS}\leftarrow\text{RHS}+\text{SLOPE}(\lambda^{*}-\lambda_{\text{last}}^{*})

10: if

\lambda^{*}

of form

v_{k}^{e}-1

then

11:

\text{SLOPE}\leftarrow\text{SLOPE}-1

12: else

\text{SLOPE}\leftarrow\text{SLOPE}+1

\triangleright

\lambda^{*}

of form

v_{k}^{e}-\underline{v}

13:

\text{NODES}\leftarrow\text{NODES}\bigcup\{<\lambda^{*},\text{RHS}>\}

\lambda_{\text{last}}^{*}\leftarrow\lambda^{*}

14:for Consecutive

\text{NODE},\text{NODE}^{\prime}\in\text{NODES}

\triangleright

\mathcal{O}(E)

15: if Segment

<\text{NODE},\text{NODE}^{\prime}>

passes through

\bar{v}

then

16: Solve for

\lambda

and return

\text{Clamp}(v_{k}-\lambda\mathbb{1},\underline{v}\mathbb{1},\mathbb{1})

IV-C Fine-Grained Parallelism

Prior work [12] proposes to accelerate TO solver on GPU, but they require a complex GPU multgrid implementation which is also costly to compute per iteration. In comparison, our Algorithm 1,2,3 can make full use of many-core hardwares with slight modifications. In this section, we discuss necessary modifications for a GPU implementation assuming the availability of basic parallel scan, reduction, inner-product, and radix sort operations [49]. A bottleneck in Algorithm 1,2,3 is matrix-vector production $K(v)u$ , of which the implementation depends on the type of FEA discretization. If the discretization uses a regular pattern, such as a regular grid in Figure 2, then the element-to-node mapping is known and can be hard coded, so the computation for each node is independent and costs $\mathcal{O}(1)$ if $E$ thread blocks are available. For irregular discretizations, we have to store the sparse matrix explicitly and suggest using a parallel sparse matrix-vector product routine [50]. To implement the Krylov-preconditioner, we perform in-place QR factorization to solve for $c_{i}^{*}$ . The in-place QR factorization involves computing the inner-product for $D^{2}$ times and then solving the upper-triangular system. Altogether, the least square solve in Equation LABEL:eq:LSP costs $\mathcal{O}(D^{2}\log E+D^{2})$ when $E$ threads are available. Here we use a serial implementation of the upper-triangular solve, which is not a performance penalty when $D\ll E$ . Finally, Algorithm 4 involves one radix sort that costs $\mathcal{O}(\log E)$ and three for loops, the first and last for loops do not have data dependency and cost $\mathcal{O}(1)$ . The second for loop accumulates two variables ( $\text{Slope},\text{Total}$ ) that can be accomplished by a parallel scan taking $\mathcal{O}(\log E)$ . In summary, the cost of each iteration of Algorithm 1,2,3 can be reduced to $\mathcal{O}(D^{2}\log E)$ when $\mathcal{O}(E)$ threads are available. We profile the parallel acceleration rate in Figure 4.

V Experiments

In this section, we compare our most efficient Algorithm 3 with other methods that can provide convergence guarantee. We implement our method using native C++ on a laptop machine with a 2.5G 6-core Intel i7 CPU and a 1.48G Nvidia GTX 1060 mobile GPU having 1280 cores. Our implementation makes full use of the cores on CPU via OpenMP and GPU via Cuda and we implement the regular grid FEA discretization. In all the examples, we apply a $7\times 7$ separable Gaussian kernel $\mathbb{C}$ as our material filter (see Remark III.2), which can be implemented efficiently on GPU as a product of multiple linear kernels. We found that a $7\times 7$ kernel achieves the best balance between material thickness and boundary sharpness. Unless otherwise stated, we choose Algorithm 3 with $\underline{v}=0.1,\eta=3,\alpha_{k}=\alpha_{0}k^{-3/4},\beta_{k}=1,D=20$ for our Krylov-preconditioner in all the experiments where $\alpha_{0}$ varies across different experiments. We justify these parameter choices in Section IX-E. For the three algorithms in Section IV, we terminate when the relative change to infill levels over two iterations ( $\|v_{k+1}-v_{k}\|_{\infty}$ ) is smaller than $10^{-4}$ and the absolute error of the linear system solve ( $\|K(v_{k})u_{k}-f\|_{\infty}$ ) is smaller than $10^{-2}$ , which can be efficiently checked on GPU by using a reduction operator that costs $\mathcal{O}(\log E)$ . The two conditions are measuring the optimality of the high- and low-level problems, respectively. The later is otherwise known as the tracking error, which measures how closely the low-level solution $u_{k}$ follows its optima as the high-level solution $v_{k}$ changes. We choose 7 benchmark problems from [11, 12, 13] to profile our method as illustrated in Figure 3 and summarized in Table I. Our iterative cost is at most $40$ ms when handling a grid resolution of $192\times 576$ , which allows mechanical designers to quickly preview the results.

Benchmark

Res.

Frac.

#Iter.

Time

(Total)

Time

(Iter.)

Mem.

Figure 1

128\times 256

0.4

49600

694

0.014

8

Figure 3a

192\times 576

0.3

14600

584

0.040

24

Figure 3b

192\times 576

0.3

8900

356

0.040

24

Figure 3c

192\times 384

0.3

37400

1159

0.031

18

Figure 3d

160\times 240

0.4

20400

346

0.017

8

Figure 3e

160\times 160

0.5

13300

146

0.011

4

Figure 3f

128\times 384

0.5

9300

195

0.021

10

Figure 3g

128\times 256

0.3

19400

252

0.013

8

TABLE I: Statistics of benchmark problems: benchmark index, grid resolution, volume fraction

(\bar{v})

, iterations until convergence, total computational time(s), iterative cost(s), and GPU memory cost (mb).

Convergence Analysis

We highlight the benefit of using approximate matrix inversions by comparing our method with Projected Gradient Descent (PGD). Using a single-level formulation, PGD differs from Algorithm 3 by using exact matrix inversions, i.e., replacing Line 2 with $u_{k+1}\leftarrow K^{-1}(v_{k})f$ . We profile the empirical convergence rate of the two formulations in Figure 5 for solving the benchmark problem in Figure 1. The comparisons on other benchmark problems are given in Section VII. Since exact, sparse matrix factorization prevents PGD from being implemented on GPU efficiently, our method allows fine-grained parallelism and is much more GPU-friendly. For comparison, we implement other steps of PGD on GPU while revert to CPU for sparse matrix factorization. As shown in Figure 5a, the high- and low-level error in our method suffers from some local fluctuation but they are both converging overall. We further plot the evolution of exact energy value $l(v_{k})$ . After about $70$ s of computation, our method converges to a better optima, while PGD is still faraway from convergence as shown in Figure 5b. We further notice that our method would significantly increase the objective function value at an early stage of optimization. This is because first-order methods rely on adding proximal regularization terms to the objective function to form a Lyapunov function (see e.g. [51]). Although our analysis is not based on Lyapunov candidates, we speculate a similar argument to [51] applies.

Parameter Sensitivity

When setting up the SIMP problem, the user needs to figure out the material model $K_{e}$ and external load profile $f$ . We use linear elastic material relying on Young’s modulus and Poisson’s ratio. We always set Young’s modulus to $1$ and normalize the external force $f$ , since these two parameters only scale the objective function without changing the solutions. When solving the SIMP problem, our algorithm relies on two critical parameters: initial high-level step size $\alpha_{0}$ and Krylov subspace size $D$ for preconditioning. We profile the sensitivity with respect to $\alpha_{0}$ in Figure 6 for benchmark problem in Figure 3e. We find our method convergent over a wide range of $\alpha_{0}$ , although an overly large $\alpha_{0}$ could lead to excessive initial fluctuation. In practice, matrix-vector product is the costliest step and takes up $96\%$ of the computational time, so the iterative cost is almost linear in $D$ . As illustrated in Figure 7, Algorithm 1 would require an extremely small $\alpha_{0}$ and more iterations to converge, while Algorithm 3 even with a small $D$ can effectively reduce the low-level error and allow a wide range of choices for $\alpha_{0}$ , leading to a faster overall convergence speed. We profiled the sensitivity to subspace size $D$ in Figure 8, and we observe convergent behaviors for all $\alpha_{0}$ when $D\geq 15$ , so we choose $D=20$ in all examples and only leave $\alpha_{0}$ to be tuned by users.

Comparison with SQP

SQP is an off-the-shelf algorithm that pertains global convergence guarantee [5] and have been used for topology optimization. We conduct comparisons with SQP on the benchmark problem in Figure 3c. We use the commercial SQP solver [52] where we avoid computing the exact Hessian matrix of $l(v)$ via L-BFGS approximation and we avoid factorizing the approximate Hessian matrix using a small number of conjugate gradient iterations. The comparisons are conducted at a resolution of $64\times 128$ , which is one third that of Figure 3c. Using an exact Hessian factorization or a larger resolution in SQP would use up the memory of our desktop machine. As shown in Figure 9, our method converges to an approximate solution much quicker than SQP, allowing users to preview the results interactively. Such performance of our method is consistent with many other first-order methods, e.g., the Alternating Method Direction of Multiplier (ADMM), which is faster in large-scale problems and converges to a solution of moderate accuracy, while SQP takes longer to converge to a more accurate solution.

Comparison with ALM

ALM formulates $u,v$ both as decision variables and treat FEA as additional hard constraints:

\displaystyle\underset{v\in X,u}{\text{argmin}}\;\frac{1}{2}u^{T}K(v)u\quad\text{s.t.}\;K(v)u=f.

By turning the hard constraints into soft penalty functions via the augmented Lagrangian penalty term, ALM stands out as another solver for TO that allows inexact matrix inversions. Although we are unaware of ALM being used for TO problems, it has been proposed to solve general bilevel programs in [34]. We compare our method with the widely used ALM solver [53]. As the major drawback, ALM needs to square the FEA system matrix $K(v)$ , which essentially squares its condition number and slows down the convergence. We have observed this phenomenon in Figure 10, where we run ALM at a resolution of $32\times 64$ which is one-sixth that of Figure 3c. Even at such a coarse resolution, ALM cannot present meaningful results until after $12600$ s ( $3.5$ hr) that is much longer than either FBTO or SQP. Our Algorithm 2 suffers from the same problem of squaring the system matrix. Therefore, we suggest using Algorithm 2 only when one has a highly efficient, non-commuting preconditioner such as a geometric multigrid.

VI Conclusion & Limitation

We present a novel approach to solve a large subclass of TO problems known as the SIMP model. Unlike classical optimization solvers that perform the exact sensitivity analysis during each iteration, we propose to use inexact sensitivity analysis that is refined over iterations. We show that this method corresponds to the first-order algorithm for a bilevel reformulation of TO problems. Inspired by the recent analytical result [39], we show that this approach has guaranteed convergence to a first-order critical point of the original problem. Our new approach leads to low-iterative cost and allows interactive result preview for mechanical designers. We finally discuss extensions to use preconditioners and fine-grained GPU parallelism. Experiments on several 2D benchmarks from [11, 12, 13] highlight the computational advantage of our method.

An inherent limitation of our method is the requirement to estimate step size parameters $\alpha_{k},\beta_{k}$ . We have shown that $\beta_{k}$ is fixed and our results are not sensitive to $\alpha_{0}$ , but users still need to give a rough estimation of $\alpha_{0}$ that can affect practical convergence speed. Further, our first-order solver is based on plain gradient descendent steps, for which several acceleration schemes are available, e.g. Nesterov [54] and Anderson [55] accelerations. The analysis of these techniques are left as future works. Finally, when additional constraints are incorporated into the polytopic set $X$ or non-polytopic constraints are considered, our analysis needs to be adjusted and the efficacy of Algorithm 4 might be lost.

References

[1] Jun Wu, Lou Kramer and Rüdiger Westermann “Shape interior modeling and mass property optimization using ray-reps” In Computers & Graphics 58 Elsevier, 2016, pp. 66–72
[2] Moritz Bächer, Emily Whiting, Bernd Bickel and Olga Sorkine-Hornung “Spin-It: Optimizing Moment of Inertia for Spinnable Objects” In ACM Trans. Graph. 33.4 New York, NY, USA: Association for Computing Machinery, 2014 DOI: 10.1145/2601097.2601157
[3] Jun Wu, Anders Clausen and Ole Sigmund “Minimum compliance topology optimization of shell–infill composites for additive manufacturing” In Computer Methods in Applied Mechanics and Engineering 326, 2017, pp. 358–375 DOI: https://doi.org/10.1016/j.cma.2017.08.018
[4] Thanh T. Banh and Dongkyu Lee “Topology Optimization of Multi-Directional Variable Thickness Thin Plate with Multiple Materials” In Struct. Multidiscip. Optim. 59.5 Berlin, Heidelberg: Springer-Verlag, 2019, pp. 1503–1520 DOI: 10.1007/s00158-018-2143-8
[5] Susana Rojas-Labanda and Mathias Stolpe “An efficient second-order SQP method for structural topology optimization” In Structural and Multidisciplinary Optimization 53.6 Springer, 2016, pp. 1315–1333
[6] Ole Sigmund and Kurt Maute “Topology optimization approaches” In Structural and Multidisciplinary Optimization 48.6 Springer, 2013, pp. 1031–1055
[7] Jesús Martínez-Frutos, Pedro J. Martínez-Castejón and David Herrero-Pérez “Efficient topology optimization using GPU computing with multilevel granularity” In Advances in Engineering Software 106, 2017, pp. 47–62 DOI: https://doi.org/10.1016/j.advengsoft.2017.01.009
[8] A Mahdavi, R Balaji, M Frecker and Eric M Mockensturm “Topology optimization of 2D continua for minimum compliance using parallel computing” In Structural and Multidisciplinary Optimization 32.2 Springer, 2006, pp. 121–132
[9] Niels Aage, Erik Andreassen and Boyan Stefanov Lazarov “Topology optimization using PETSc: An easy-to-use, fully parallel, open source topology optimization framework” In Structural and Multidisciplinary Optimization 51.3 Springer, 2015, pp. 565–572
[10] Ole Sigmund “A 99 line topology optimization code written in Matlab” In Structural and multidisciplinary optimization 21.2 Springer, 2001, pp. 120–127
[11] S Ivvan Valdez et al. “Topology optimization benchmarks in 2D: Results for minimum compliance and minimum volume in planar stress problems” In Archives of Computational Methods in Engineering 24.4 Springer, 2017, pp. 803–839
[12] Jun Wu, Christian Dick and Rüdiger Westermann “A System for High-Resolution Topology Optimization” In IEEE Transactions on Visualization and Computer Graphics 22.3, 2016, pp. 1195–1208 DOI: 10.1109/TVCG.2015.2502588
[13] Sandilya Kambampati, Carolina Jauregui, Ken Museth and H Alicia Kim “Large-scale level set topology optimization for elasticity and heat conduction” In Structural and Multidisciplinary Optimization 61.1 Springer, 2020, pp. 19–38
[14] Herbert Martins Gomes “Truss optimization with dynamic constraints using a particle swarm algorithm” In Expert Systems with Applications 38.1 Elsevier, 2011, pp. 957–968
[15] Thomas Buhl, Claus BW Pedersen and Ole Sigmund “Stiffness design of geometrically nonlinear structures using topology optimization” In Structural and Multidisciplinary Optimization 19.2 Springer, 2000, pp. 93–104
[16] Tyler E Bruns and Daniel A Tortorelli “Topology optimization of non-linear elastic structures and compliant mechanisms” In Computer methods in applied mechanics and engineering 190.26-27 Elsevier, 2001, pp. 3443–3459
[17] Kurt Maute, Stefan Schwarz and Ekkehard Ramm “Adaptive topology optimization of elastoplastic structures” In Structural optimization 15.2 Springer, 1998, pp. 81–91
[18] Benoît Colson, Patrice Marcotte and Gilles Savard “Bilevel programming: A survey” In 4or 3.2 Springer, 2005, pp. 87–107
[19] Krister Svanberg “Optimization of geometry in truss design” In Computer Methods in Applied Mechanics and Engineering 28.1, 1981, pp. 63–80 DOI: https://doi.org/10.1016/0045-7825(81)90027-X
[20] Erik Holmberg, Bo Torstenfelt and Anders Klarbring “Stress constrained topology optimization” In Structural and Multidisciplinary Optimization 48.1 Springer, 2013, pp. 33–47
[21] Abderrahmane Habbal, Joakim Petersson and Mikael Thellner “Multidisciplinary topology optimization solved as a Nash game” In International Journal for Numerical Methods in Engineering 61.7 Wiley Online Library, 2004, pp. 949–963
[22] Erik Holmberg, Carl-Johan Thore and Anders Klarbring “Game theory approach to robust topology optimization with uncertain loading” In Structural and Multidisciplinary Optimization 55.4 Springer, 2017, pp. 1383–1397
[23] Peter D Dunning, H Alicia Kim and Glen Mullineux “Introducing loading uncertainty in topology optimization” In AIAA journal 49.4, 2011, pp. 760–768
[24] Martin P Bendsøe and Ole Sigmund “Optimization of structural topology, shape, and material” Springer, 1995
[25] Susana Rojas-Labanda and Mathias Stolpe “Benchmarking optimization solvers for structural topology optimization” In Structural and Multidisciplinary Optimization 52.3 Springer, 2015, pp. 527–547
[26] Gustavo Assis Silva and André Teófilo Beck “Reliability-based topology optimization of continuum structures subject to local stress constraints” In Structural and Multidisciplinary Optimization 57.6 Springer, 2018, pp. 2339–2355
[27] Krister Svanberg “The method of moving asymptotes—a new method for structural optimization” In International Journal for Numerical Methods in Engineering 24.2, 1987, pp. 359–373 DOI: https://doi.org/10.1002/nme.1620240207
[28] H Nguyen-Xuan “A polytree-based adaptive polygonal finite element method for topology optimization” In International Journal for Numerical Methods in Engineering 110.10 Wiley Online Library, 2017, pp. 972–1000
[29] Oded Amir, Niels Aage and Boyan S Lazarov “On multigrid-CG for efficient topology optimization” In Structural and Multidisciplinary Optimization 49.5 Springer, 2014, pp. 815–829
[30] Gao Tang, Weidong Sun and Kris Hauser “Enhancing bilevel optimization for uav time-optimal trajectory using a duality gap approach” In 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 2515–2521 IEEE
[31] Benoit Landry, Zachary Manchester and Marco Pavone “A Differentiable Augmented Lagrangian Method for Bilevel Nonlinear Optimization” In Proceedings of Robotics: Science and Systems, 2019 DOI: 10.15607/RSS.2019.XV.012
[32] Marc Toussaint “Logic-geometric programming: An optimization-based approach to combined task and motion planning” In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015
[33] Jonathan F Bard “Some properties of the bilevel programming problem” In Journal of optimization theory and applications 68.2 Springer, 1991, pp. 371–378
[34] Akshay Mehra and Jihun Hamm “Penalty method for inversion-free deep bilevel optimization” In Asian Conference on Machine Learning, 2021, pp. 347–362 PMLR
[35] Michal Kočvara “Topology optimization with displacement constraints: a bilevel programming approach” In Structural optimization 14.4 Springer, 1997, pp. 256–263
[36] Jonas Zehnder, Yue Li, Stelian Coros and Bernhard Thomaszewski “NTopo: Mesh-free Topology Optimization using Implicit Neural Representations” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021
[37] James K Guest, Jean H Prévost and Ted Belytschko “Achieving minimum length scale in topology optimization using nodal design variables and projection functions” In International journal for numerical methods in engineering 61.2 Wiley Online Library, 2004, pp. 238–254
[38] Krister Svanberg “MMA and GCMMA, versions September 2007” In Optimization and Systems Theory 104 KTH Stockholm, Sweden, 2007
[39] Saeed Ghadimi and Mengdi Wang “Approximation methods for bilevel programming” In arXiv preprint arXiv:1802.02246, 2018
[40] Mingyi Hong, Hoi-To Wai, Zhaoran Wang and Zhuoran Yang “A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic” In arXiv preprint arXiv:2007.05170, 2020
[41] Prashant Khanduri et al. “A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum” In Advances in neural information processing systems 34, 2021
[42] Paul H Calamai and Jorge J Moré “Projected gradient methods for linearly constrained problems” In Mathematical programming 39.1 Springer, 1987, pp. 93–116
[43] David G Luenberger and Yinyu Ye “Linear and nonlinear programming” Springer, 1984
[44] Michele Benzi “Preconditioning techniques for large linear systems: a survey” In Journal of computational Physics 182.2 Elsevier, 2002, pp. 418–477
[45] Haixiang Liu et al. “Narrow-Band Topology Optimization on a Sparsely Populated Grid” In ACM Trans. Graph. 37.6 New York, NY, USA: Association for Computing Machinery, 2018 DOI: 10.1145/3272127.3275012
[46] Thomas Borrvall and Joakim Petersson “Large-scale topology optimization in 3D using parallel computing” In Computer methods in applied mechanics and engineering 190.46-47 Elsevier, 2001, pp. 6201–6229
[47] Youcef Saad and Martin H Schultz “GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems” In SIAM Journal on scientific and statistical computing 7.3 SIAM, 1986, pp. 856–869
[48] Laurent Condat “Fast projection onto the simplex and the l1-ball” In Mathematical Programming 158.1 Springer, 2016, pp. 575–585
[49] Jason Sanders and Edward Kandrot “CUDA by example: an introduction to general-purpose GPU programming” Addison-Wesley Professional, 2010
[50] Nathan Bell and Michael Garland “Efficient sparse matrix-vector multiplication on CUDA”, 2008
[51] Tianyi Chen, Yuejiao Sun and Wotao Yin “Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems” In Advances in Neural Information Processing Systems 34, 2021
[52] Richard A Waltz and Jorge Nocedal “KNITRO 2.0 User’s Manual” In Ziena Optimization, Inc.[en ligne] disponible sur http://www. ziena. com (September, 2010) 7, 2004, pp. 33–34
[53] Robert J Vanderbei “LOQO user’s manual—version 3.10” In Optimization methods and software 11.1-4 Taylor & Francis, 1999, pp. 485–514
[54] Amir Beck and Marc Teboulle “A fast iterative shrinkage-thresholding algorithm for linear inverse problems” In SIAM journal on imaging sciences 2.1 SIAM, 2009, pp. 183–202
[55] Junzi Zhang, Brendan O’Donoghue and Stephen Boyd “Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations” In SIAM Journal on Optimization 30.4 SIAM, 2020, pp. 3170–3197

TABLE II: Table of Symbols

Variable	Definition
$x\in\Omega$	a point in material domain
$u$	displacement field
$P$	internal potential energy
$k_{e}$	stiffness tensor field
$v,v_{e}$	infill level, infill level of $e$ th element
$u_{k},v_{k},\delta_{k}$	solutions at $k$ th iteration
$\underline{v},\bar{v}$	infill level bounds
$\Omega_{e}$	sub-domain of $e$ th element
$E,N$	number of elements, nodes
$K,K_{e}$	stiffness matrix, stiffness of $e$ th element
$f$	external force field
$P_{f}$	total energy
$e_{i}$	unit vector at $i$ th element
$u_{f}$	displacement caused by force $f$
$l$	induced internal potential energy
$\eta$	power law coefficient
$\rho(\bullet)$	eigenvalue function
$\underline{\rho},\bar{\rho}$	eigenvalue bounds
$\mathbb{1},\mathbb{0}$	all-one, all-zero vectors
$\mathcal{A}_{e}$	activation function of $e$ th element
$\mathbb{S}$	symmetry mapping matrix
$v_{s}$	infill levels of the left-half material block
$\#$	number of sub-components
$\bar{v}^{i}$	total infill level of $i$ th component
$\mathbb{C}$	material filter operator
$\alpha_{k},\beta_{k}$	high-level, low-level step size
$\Delta_{k},\Delta_{k}^{v}$	low-level, high-level error metrics
$\Xi_{k}$	low-level error metric used to analyze Algorithm 3
$U$	upper bound of $\sqrt{\Delta_{1}}$
$\Gamma_{k},\Theta_{k}$	coefficients of low-level error
$\bar{\Gamma},\bar{\Theta}$	upper bounds of $\Gamma_{k},\Theta_{k}$
$p,m,n$	algorithmic constants
$L_{u},L_{K}L_{\Delta K},L_{\nabla l}$	L-constants of $u_{f},K,\Delta K,\nabla l$
$C,C_{v}$	coefficients of reduction rate of $\Delta_{k}^{v}$
$\mathcal{T}_{X}$	tangent cone of $X$
$d_{k},\tau_{k}$	feasible direction in $\mathcal{T}_{X}$ and step size
$\underline{l}$	lower bound of $l$
$U_{f}$	uniform upper bound of $\\|u_{f}\\|$
$\nabla_{X}$	projected gradient into normal cone
$\epsilon$	error tolerance of gradient norm
$\text{Proj}_{X}(\bullet)$	projection operator onto $X$
$\kappa$	condition number
$M$	preconditioner
$\underline{\rho}_{M},\bar{\rho}_{M}$	eigenvalue bounds of preconditioner
$c_{i}$	coefficients of Krylov vectors
$D$	dimension of Krylov subspace
$\tilde{\nabla}l$	approximate gradient
$\tilde{\nabla}_{P}l$	mean-projected approximate gradient
$\lambda_{\mathbb{1}},\lambda_{\underline{v}}$	Lagrangian multiplier for projection problem

VII Additional Experiments

We compare the performance of FBTO and PGD for all the benchmarks in Figure 3 and the results are summarized in Figure 11. Note that PGD uses an exact low-level solver so it requires less iterations then FBTO, while FBTO uses more iterations with a much lower iterative cost. For fairness we compare them based on computational time. PGD outperforms or performs similarly to FBTO on two of the problems (Figure 11df), where PGD converges rapidly in $5-10$ iterations while FBTO requires thousands of iterations. In all other cases, FBTO converges faster to an approximate solution.

VIII Convergence Analysis of Algorithm 1

We provide proof of Theorem IV.6. We start from some immediate observations on the low- and high-level objective functions. Then, we prove the rule of error propagation for the low-level optimality, i.e., the low-level different between $u_{k}$ and its optimal solution $u_{f}(v_{k})$ . Next, we choose parameters to let the difference diminish as the number of iterations increase. Finally, we focus on the high-level objective function and show that the difference between $v_{k}$ and a local optima of $l(v)$ would also diminish. Our analysis resembles the recent analysis on two-timescale bilevel optimization, but we use problem-specific treatment to handle our novel gradient estimation for the high-level problem without matrix inversion.

VIII-A Low-Level Error Propagation

If Assumption III.1 holds, then the low-level objective function is $\underline{\rho}$ -strongly convex and Lipschitz-continuous in $u$ for any $v$ with $\bar{\rho}$ being the L-constant, so that:

	$\displaystyle P_{f}(u_{k+1},v_{k})\leq P_{f}(u_{k},v_{k})+\left<u_{k+1}-u_{k},\nabla_{u}P_{f}(u_{k},v_{k})\right>+$
	$\displaystyle\frac{\bar{\rho}}{2}\\|u_{k+1}-u_{k}\\|^{2}\leq P_{f}(u_{k},v_{k})+\left[\frac{\bar{\rho}}{2}-\frac{1}{\beta_{k}}\right]\\|u_{k+1}-u_{k}\\|^{2}.$		(9)

Similarly, we can immediately estimate the approximation error of the low-level problem due to an update on $u$ :

Lemma VIII.1.

Under Assumption III.1, the following relationship holds for all $k\geq 1$ :

\displaystyle\Delta_{k+1}\leq\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\Delta_{k}+L_{u}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\|u_{k}\|^{4}.

Proof.

The following result holds by the triangle inequality and the bounded spectrum of $K(v)$ (Equation 2):

	$\displaystyle\\|u_{k+1}-u_{f}(v_{k})\\|^{2}$
$\displaystyle=$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
	$\displaystyle 2\left<u_{k+1}-u_{k},u_{k}-u_{f}(v_{k})\right>$
$\displaystyle=$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+\\|u_{k}-u_{f}(v_{k})\\|^{2}-$
	$\displaystyle 2\beta_{k}\left<K(v_{k})u_{k}-f,u_{k}-u_{f}(v_{k})\right>$
$\displaystyle\leq$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+(1-2\beta_{k}\underline{\rho})\\|u_{k}-u_{f}(v_{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\beta_{k}^{2}\bar{\rho}^{2}\\|u_{k}-u_{f}(v_{k})\\|^{2}+(1-2\beta_{k}\underline{\rho})\\|u_{k}-u_{f}(v_{k})\\|^{2}.$	(10)

The optimal solution to the low-level problem is $u_{f}(v)$ , which is a smooth function defined on a compact domain, so any derivatives of $u_{f}(v)$ is bounded L-continuous and we have the following estimate of the change to $u_{f}$ due to an update on $v$ :

	$\displaystyle\\|u_{k+1}-u_{f}(v_{k})\\|\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|$
$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\frac{p}{2}\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\frac{pL_{u}^{2}}{2}\\|v_{k}-v_{k+1}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\frac{pL_{u}^{2}\alpha_{k}^{2}}{8}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2},$	(11)

where we have used Young’s inequality and the contractive property of the $\text{Proj}_{X}(\bullet)$ operator. Here we denote $L_{u}$ as the L-constant of $u_{f}$ . By triangular inequality, we further have:

	$\displaystyle\\|u_{k+1}-u_{f}(v_{k+1})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|^{2}+$
	$\displaystyle 2\\|u_{k+1}-u_{f}(v_{k})\\|\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|^{2}+$
	$\displaystyle\frac{pL_{u}^{2}\alpha_{k}^{2}}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+L_{u}^{2}\\|v_{k}-v_{k+1}\\|^{2}+$
	$\displaystyle\frac{pL_{u}^{2}\alpha_{k}^{2}}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+L_{u}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
	$\displaystyle L_{u}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
	$\displaystyle L_{u}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}\\|^{4},$	(12)

where we have used Equation 10 and Equation 11. ∎

The result of Lemma VIII.1 is a recurrent relationship on the low-level optimality error $\Delta_{k}$ , which will be used to prove low-level convergence via recursive expansion.

VIII-B Low-Level Convergence

We use the following shorthand notation for the result in Lemma VIII.1:

	$\displaystyle\Delta_{k+1}\leq\Theta_{k}\Delta_{k}+\Gamma_{k}\\|u_{k}\\|^{4}$
	$\displaystyle\Theta_{k}\triangleq\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\quad\Gamma_{k}\triangleq L_{u}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}.$		(13)

By taking Assumption IV.2 and direct calculation, we can ensure that $\Theta_{k}\leq\bar{\Theta}<1$ (i.e., the first term is contractive). To bound the growth of the second term above, we show by induction that both $\Delta_{k}$ and $u_{k}$ can be uniformly bounded for all $k$ via a sufficiently small, constant $\Gamma_{k}\leq\bar{\Gamma}$ .

Lemma VIII.2.

Taking Assumption III.1, IV.2, IV.3, we have $\Delta_{k}\leq U^{2},\|u_{k}\|\leq U_{f}+U$ for all $k\geq 1$ , where $U_{f}$ is the uniform upper bound for $\|u_{f}(v)\|$ .

Proof.

First, since $v\in X$ and $X$ is compact, we have $\|u_{f}(v)\|\leq U_{f}<\infty$ . Next, we prove $\Delta_{k}\leq U^{2}$ by induction. We already have $\Delta_{1}\leq U^{2}$ . Now suppose $\Delta_{k}\leq U^{2}$ , then Equation 13 and our assumption on $\bar{\Gamma}$ immediately leads to:

	$\displaystyle\Delta_{k+1}\leq$	$\displaystyle\bar{\Theta}U^{2}+\bar{\Gamma}(\\|u_{k}-u_{f}(v_{k})\\|+\\|u_{f}(v_{k})\\|)^{4}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}U^{2}+\bar{\Gamma}(U+U_{f})^{4}\leq U^{2}.$

Finally, we have: $\|u_{k}\|\leq\|u_{k}-u_{f}(v_{k})\|+\|u_{f}(v_{k})\|\leq U+U_{f}$ and our lemma follows. ∎

The shrinking coefficient $\bar{\Theta}$ and the uniform boundedness of $\Delta_{k},u_{k}$ allows us to establish low-level convergence with the appropriate choice of $\alpha_{k}=\mathcal{O}(k^{-m})$ with $m\geq 1$ .

Theorem VIII.3.

Taking Assumption III.1, IV.2, IV.3, IV.4, we can upper bound $\Delta_{k+1}$ as:

\displaystyle\Delta_{k+1}\leq\bar{\Theta}^{k-1}\Delta_{1}+(U+U_{f})^{4}\Gamma_{1}\sum_{i=0}^{k-1}\frac{\bar{\Theta}^{i}}{(k-i)^{2m}}=\mathcal{O}(k^{1-2m}).

Proof.

Recursively expand on Equation 13 and we have:

		$\displaystyle\Delta_{k+1}\leq\bar{\Theta}\Delta_{k}+\Gamma_{k}\\|u_{k}\\|^{4}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}(\bar{\Theta}\Delta_{k-1}+\Gamma_{k-1}\\|u_{k-1}\\|^{4})+\Gamma_{k}\\|u_{k}\\|^{4}$
	$\displaystyle=$	$\displaystyle\bar{\Theta}^{2}\Delta_{k-1}+\bar{\Theta}\Gamma_{k-1}\\|u_{k-1}\\|^{4}+\Gamma_{k}\\|u_{k}\\|^{4}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{2}(\bar{\Theta}\Delta_{k-2}+\Gamma_{k-2}\\|u_{k-2}\\|^{4})+\bar{\Theta}\Gamma_{k-1}\\|u_{k-1}\\|^{4}+\Gamma_{k}\\|u_{k}\\|^{4}$
		$\displaystyle\cdots$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{k}\Delta_{1}+(U+U_{f})^{4}\sum_{i=0}^{k-1}\bar{\Theta}^{i}\Gamma_{k-i}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{k}\Delta_{1}+(U+U_{f})^{4}\Gamma_{1}\sum_{i=0}^{k-1}\frac{\bar{\Theta}^{i}}{(k-i)^{2m}}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{k}\Delta_{1}+(U+U_{f})^{4}\Gamma_{1}\left[\sum_{i=0}^{\lceil\frac{k-1}{2}\rceil}\frac{\bar{\Theta}^{i}}{(k-i)^{2m}}+\sum_{i=\lceil\frac{k+1}{2}\rceil}^{k-1}\frac{\bar{\Theta}^{i}}{(k-i)^{2m}}\right]$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{k}\Delta_{1}+(U+U_{f})^{4}\Gamma_{1}\left[\frac{\lceil\frac{k+1}{2}\rceil}{(k-\lceil\frac{k-1}{2}\rceil)^{2m}}+\frac{\bar{\Theta}^{\lceil\frac{k}{2}\rceil}}{1-\bar{\Theta}}\right]=\mathcal{O}(k^{1-2m}),$

where we have used our choice of $\alpha_{k}$ and Lemma VIII.2. ∎

Theorem VIII.3 is pivotal by allowing us to choose $m$ and tune the convergence speed of the low-level problem, which is used to establish the convergence of high-level problem.

VIII-C High-Level Error Propagation

The high-level error propagation is similar to the low-level analysis, which is due to the L-continuity of any derivatives of $l(v)$ in the compact domain $X$ . The following result reveals the rule of error propagation over a single iteration:

Lemma VIII.4.

Under Assumption III.1, IV.2, IV.3, IV.4, the following high-level error propagation rule holds for all $k\geq 1,q>0$ :

	$\displaystyle l(v_{k+1})-l(v_{k})\leq-\frac{1}{\alpha_{k}}\\|v_{k+1}-v_{k}\\|^{2}+$
	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}(L_{\nabla l}q+1)}{8q}(U+U_{f})^{4}+\frac{L_{\nabla K}^{2}q}{8}\Delta_{k}(U+2U_{f})^{2}.$

Proof.

By the smoothness of $l(v)$ , the compactness of $X$ , and the obtuse angle criterion, we have:

	$\displaystyle l(v_{k+1})-l(v_{k})\leq\left<v_{k+1}-v_{k},\nabla l(v_{k})\right>+\frac{L_{\nabla l}}{2}\\|v_{k+1}-v_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\left<v_{k+1}-v_{k},-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\right>+\frac{L_{\nabla l}L_{\nabla K}^{2}\alpha_{k}^{2}}{8}\\|u_{k}\\|^{4}$
$\displaystyle=$	$\displaystyle\left<v_{k+1}-v_{k},-\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\right>+\frac{L_{\nabla l}L_{\nabla K}^{2}\alpha_{k}^{2}}{8}\\|u_{k}\\|^{4}+$
	$\displaystyle\left<v_{k+1}-v_{k},\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\right>$
$\displaystyle\leq$	$\displaystyle-\frac{1}{\alpha_{k}}\\|v_{k+1}-v_{k}\\|^{2}+\frac{L_{\nabla l}L_{\nabla K}^{2}\alpha_{k}^{2}}{8}\\|u_{k}\\|^{4}+$
	$\displaystyle\left<v_{k+1}-v_{k},\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\right>,$	(14)

where $L_{\nabla l}$ is the L-constant of $\nabla l$ on $X$ . For the last term above, we can bound it by applying the Young’s inequality:

	$\displaystyle\left<v_{k+1}-v_{k},\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\right>$
$\displaystyle\leq$	$\displaystyle\frac{1}{2q}\\|v_{k+1}-v_{k}\\|^{2}+\frac{q}{2}\\|\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}}{8q}\\|u_{k}\\|^{4}+\frac{q}{2}\\|\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\\|^{2}$
$\displaystyle=$	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}}{8q}\\|u_{k}\\|^{4}+\frac{q}{2}\\|\frac{1}{2}(u_{k}-u_{f}(v_{k}))^{T}\frac{\partial{K}}{\partial{v_{k}}}(u_{k}+u_{f}(v_{k}))\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}}{8q}\\|u_{k}\\|^{4}+\frac{L_{\nabla K}^{2}q}{8}\Delta_{k}\\|u_{k}+u_{f}(v_{k})\\|^{2}.$	(15)

The lemma follows by combining Equation 14, Equation 15, and Lemma VIII.2. ∎

VIII-D High-Level Convergence

We first show that $\Delta_{k}^{v}$ is diminishing via the follow lemma:

Lemma VIII.5.

Under Assumption III.1, IV.2, IV.3, IV.4, we have the following bound on the accumulated high-level error $\Delta_{k}^{v}$ :

	$\displaystyle\frac{1}{2}\sum_{i=1}^{k}\alpha_{i}\Delta_{i}^{v}\leq l(v_{1})-\underline{l}+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+U_{f})^{4}\sum_{i=1}^{k}\alpha_{i}^{2-n}(L_{\nabla l}\alpha_{i}^{n}+1)+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}^{n}\Delta_{i}+\frac{L_{\nabla K}^{2}}{4}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}\Delta_{i},$

where $\underline{l}$ is the lower bound of $l$ on $X$ .

Proof.

By choosing $q=\alpha_{k}^{n}$ for some constant $n$ and summing up the recursive rule of Lemma VIII.4, we have the following result:

	$\displaystyle l(v_{k+1})-l(v_{1})\leq-\sum_{i=1}^{k}\frac{1}{\alpha_{i}}\\|v_{i+1}-v_{i}\\|^{2}+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+U_{f})^{4}\sum_{i=1}^{k}\alpha_{i}^{2-n}(L_{\nabla l}\alpha_{i}^{n}+1)+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}^{n}\Delta_{i}.$		(16)

Equation 16 can immediately lead to the $\mathcal{O}(k^{-1})$ convergence of $\|v_{i+1}-v_{i}\|^{2}/\alpha_{i}$ . However, the update from $v_{i}$ to $v_{i+1}$ is using the approximate gradient. To show the convergence of $l(v)$ , we need to consider an update using the exact gradient. This can be achieved by combining Equation 16 and the gradient error estimation in Equation 15:

	$\displaystyle\Delta_{k}^{v}\leq\frac{1}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+$
	$\displaystyle\frac{1}{\alpha_{k}^{2}}\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}+$
	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|$
$\displaystyle\leq$	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+\frac{2}{\alpha_{k}^{2}}\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+\frac{L_{\nabla K}^{2}}{2}\Delta_{k}(U+2U_{f})^{2}.$	(17)

We can prove our lemma by combining Lemma VIII.4, Equation 16, Equation 17. ∎

The last three terms on the righthand side of Lemma VIII.5 are power series, the summand of which scales at the speed of $\mathcal{O}(k^{-2m+mn}),\mathcal{O}(k^{1-2m-mn}),\mathcal{O}(k^{1-3m})$ , respectively. Therefore, for the three summation to be upper bounded for arbitrary $k$ , we need the first condition in Assumption IV.5. The following corollary is immediate:

Corollary VIII.6.

Under Assumption III.1, IV.2, IV.3, IV.4, IV.5, we have $\min_{i=1,\cdots,k}\alpha_{i}\Delta_{i}^{v}\leq Ck^{-1}$ and there are infinitely many $k$ such that $\Delta_{k}^{v}\leq C_{v}k^{m-1}$ for some constants $C,C_{v}$ independent of $k$ .

Proof.

By Lemma VIII.5 and Assumption IV.5, we have $\sum_{i=1}^{k}\alpha_{i}\Delta_{i}^{v}\leq C$ for some constant $C$ . Now suppose only finitely many $k$ satisfies $\Delta_{k}^{v}\leq C_{v}k^{m-1}$ , then after sufficiently large $k\geq K_{0}$ , we have:

\displaystyle\sum_{i=K_{0}}^{k}\alpha_{i}\Delta_{i}^{v}\geq\sqrt{\frac{\bar{\Gamma}}{L_{u}^{2}L_{\nabla K}^{2}}\frac{4}{p+1}}C_{v}\sum_{i=K_{0}}^{k}\frac{1}{i},

which is divergent, leading to a contradiction. ∎

Our remaining argument is similar to the standard convergence proof of PGD [42], with minor modification to account for our approximate gradient:

Proof of Theorem IV.6.

We denote by $\tilde{v}_{k+1}$ :

\displaystyle\tilde{v}_{k+1}\triangleq\text{Proj}_{X}(v_{k}-\alpha_{k}\nabla l(v_{k})).

Note that $\tilde{v}_{k+1}$ is derived from $v_{k}$ using the true gradient, while $v_{k+1}$ is derived using our approximate gradient. Due to the polytopic shape of $X$ in Assumption III.1, for $v_{k}\in X$ , we can always choose a feasible direction $d_{k}$ from $\mathcal{T}_{X}(v_{k})$ with $\|d_{k}\|\leq 1$ such that:

\displaystyle\|\nabla_{X}l(v_{k})\|\leq-\left<\nabla l(v_{k}),d_{k}\right>+\frac{\epsilon}{4}.

We further have the follow inequality holds for any $z\in X$ due to the obtuse angle criterion and the convexity of $X$ :

		$\displaystyle\left<\alpha_{k}\nabla l(v_{k}),\tilde{v}_{k+1}-z\right>$
	$\displaystyle\leq$	$\displaystyle\left<\alpha_{k}\nabla l(v_{k}),\tilde{v}_{k+1}-z\right>+\left<v_{k}-\alpha_{k}\nabla l(v_{k})-\tilde{v}_{k+1},\tilde{v}_{k+1}-z\right>$
	$\displaystyle=$	$\displaystyle\left<v_{k}-\tilde{v}_{k+1},\tilde{v}_{k+1}-z\right>\leq\\|v_{k}-\tilde{v}_{k+1}\\|\\|\tilde{v}_{k+1}-z\\|.$

Applying Corollary VIII.6 and we can choose sufficiently large $k$ such that:

		$\displaystyle-\left<\nabla l(\tilde{v}_{k+1}),\frac{\tau_{k}d_{k}}{\tau_{k}\\|d_{k}\\|}\right>\leq\frac{1}{\alpha_{k}}\\|v_{k}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\alpha_{k}}\\|v_{k}-v_{k+1}\\|+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle=$	$\displaystyle\sqrt{\Delta_{k}^{v}}+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{4}+\sqrt{\frac{L_{\nabla K}^{2}}{4}\Delta_{k}\\|u_{k}+u_{f}(v_{k})\\|^{2}}=\frac{\epsilon}{4}+\mathcal{O}(k^{(1-2m)/2}).$

Here the first inequality holds by choosing sufficiently small $k$ -dependent $\tau_{k}$ such that $z=\tilde{v}_{k+1}+\tau_{k}d_{k}\in X$ and using the fact that $\|d_{k}\|\leq 1$ . The third inequality holds by choosing sufficently large $k$ and Corollary VIII.6. The last equality holds by the contractive property of projection operator, Theorem VIII.3, and Lemma VIII.4. ∎

VIII-E Convergence Analysis of Algorithm 2

Informally, we establish convergence for the preconditioned Algorithm 2 by modifying Equation 10 as follows:

		$\displaystyle\\|u_{k+1}-u_{f}(v_{k})\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
		$\displaystyle 2\left<u_{k+1}-u_{k},u_{k}-u_{f}(v_{k})\right>$
	$\displaystyle=$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+\\|u_{k}-u_{f}(v_{k})\\|^{2}-$
		$\displaystyle 2\beta_{k}\left<K(v_{k})M^{-2}(v_{k})(K(v_{k})u_{k}-f),u_{k}-u_{f}(v_{k})\right>$
	$\displaystyle\leq$	$\displaystyle\\|u_{k+1}-u_{k}\\|^{2}+(1-2\beta_{k}\frac{\underline{\rho}^{2}}{\bar{\rho}_{M}^{2}})\\|u_{k}-u_{f}(v_{k})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\beta_{k}^{2}\frac{\bar{\rho}^{4}}{\underline{\rho}_{M}^{4}}\\|u_{k}-u_{f}(v_{k})\\|^{2}+(1-2\beta_{k}\frac{\underline{\rho}^{2}}{\bar{\rho}_{M}^{2}})\\|u_{k}-u_{f}(v_{k})\\|^{2}.$

Assumption IV.2 must also be modified to account for the spectrum $M(v)$ . All the other steps are identical to those of Theorem IV.6.

IX Convergence Analysis of Algorithm 3

The main difference in the analysis of Algorithm 3 lies in the use of a different low-level error metric defined as $\Xi_{k}\triangleq\|K(v_{k})u_{k}-f\|^{2}$ . Unlike $\Delta_{k}$ which requires exact matrix inversion, $\Xi_{k}$ can be computed at a rather low cost. We will further show that using $\Xi_{k}$ for analysis would lead to a much larger, constant choice of low-level step size $\beta_{k}=1$ . To prove Theorem IV.11, we follow the similar steps as Theorem IV.6 and only list necessary changes in this section.

IX-A Low-Level Error Propagation

Lemma IX.1.

Assuming $M(v)$ commutes with $K(v)$ and III.1, the following relationship holds for all $k\geq 1$ :

\displaystyle\Xi_{k+1}\leq\frac{p+1}{p}(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}})\Xi_{k}+L_{K}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\|u_{k}\|^{6},

where we define: $\|K(v_{k+1})-K(v_{k})\|_{F}^{2}\leq L_{K}\|v_{k+1}-v_{k}\|^{2}$ for any $v_{k},v_{k+1}\in X$ .

Proof.

The following result holds by the triangle inequality and the bounded spectrum of $K(v)$ (Equation 2):

	$\displaystyle\\|K(v_{k})u_{k+1}-f\\|^{2}$
$\displaystyle=$	$\displaystyle\\|K(v_{k})(u_{k+1}-u_{k})\\|^{2}+\\|K(v_{k})u_{k}-f\\|^{2}+$
	$\displaystyle 2\left<K(v_{k})(u_{k+1}-u_{k}),K(v_{k})u_{k}-f\right>$
$\displaystyle=$	$\displaystyle\\|K(v_{k})(u_{k+1}-u_{k})\\|^{2}+\\|K(v_{k})u_{k}-f\\|^{2}-$
	$\displaystyle 2\beta_{k}\left<K(v_{k})M^{-1}(v_{k})(K(v_{k})u_{k}-f),K(v_{k})u_{k}-f\right>$
$\displaystyle\leq$	$\displaystyle\\|K(v_{k})(u_{k+1}-u_{k})\\|^{2}+(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}})\\|K(v_{k})u_{k}-f\\|^{2}$
$\displaystyle\leq$	$\displaystyle(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}})\\|K(v_{k})u_{k}-f\\|^{2}.$	(18)

Next, we estimate an update due to $v_{k}$ via the Young’s inequality:

		$\displaystyle\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|\\|K(v_{k})u_{k+1}-f\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|K(v_{k})u_{k+1}-f\\|^{2}+\frac{p}{2}\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|K(v_{k})u_{k+1}-f\\|^{2}+\frac{pL_{K}^{2}\alpha_{k}^{2}}{8}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}\\|u_{k+1}\\|^{2}.$

Putting the above two equations together, we have:

	$\displaystyle\\|K(v_{k+1})u_{k+1}-f\\|^{2}$
$\displaystyle\leq$	$\displaystyle\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|^{2}+\\|K(v_{k})u_{k+1}-f\\|^{2}+$
	$\displaystyle 2\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|\\|K(v_{k})u_{k+1}-f\\|$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}})\\|K(v_{k})u_{k}-f\\|^{2}+$
	$\displaystyle L_{K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}\\|u_{k+1}\\|^{2},$	(19)

which is a recursive relationship to be used for proving the low-level convergence. ∎

IX-B Low-Level Convergence

We use the following shorthand notation for the result in Lemma IX.1:

	$\displaystyle\Xi_{k+1}\leq\Theta_{k}\Xi_{k}+\Gamma_{k}\\|u_{k}\\|^{6}$
	$\displaystyle\Theta_{k}\triangleq\frac{p+1}{p}(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}})\quad\Gamma_{k}\triangleq L_{K}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}.$		(20)

By taking Assumption IV.2 and direct calculation, we can ensure that $\Theta_{k}\leq\bar{\Theta}<1$ (i.e., the first term is contractive). To bound the growth of the second term above, we show by induction that both $\Xi_{k}$ and $u_{k}$ can be uniformly bounded for all $k$ via a sufficiently small, constant $\Gamma_{k}\leq\bar{\Gamma}$ .

Lemma IX.2.

Assuming $M(v)$ commutes with $K(v)$ , III.1, IV.8, IV.9, we have $\Xi_{k}\leq U^{2},\|u_{k}\|\leq U_{f}+U$ for all $k\geq 1$ , where $U_{f}$ is the uniform upper bound for $\|u_{f}(v)\|$ .

Proof.

First, since $v\in X$ and $X$ is compact, we have $\|u_{f}(v)\|\leq U_{f}<\infty$ . Next, we prove $\Xi_{k}\leq U^{2}$ by induction. We already have $\Xi_{1}\leq U^{2}$ . Now suppose $\Xi_{k}\leq U^{2}$ , then Equation 20 and our assumption on $\bar{\Gamma}$ immediately leads to:

	$\displaystyle\Xi_{k+1}\leq$	$\displaystyle\bar{\Theta}U^{2}+\bar{\Gamma}(\\|K^{-1}(v_{k})(K(v_{k})u_{k}-f)\\|+\\|u_{f}(v_{k})\\|)^{6}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}U^{2}+\bar{\Gamma}(U/\underline{\rho}+U_{f})^{6}\leq U^{2}.$

Finally, we have: $\|u_{k}\|\leq\|u_{k}-u_{f}(v_{k})\|+\|u_{f}(v_{k})\|\leq U+U_{f}$ and our lemma follows. ∎

The shrinking coefficient $\bar{\Theta}$ and the uniform boundedness of $\Xi_{k},u_{k}$ allows us to establish low-level convergence.

Theorem IX.3.

Taking Assumption III.1, IV.8, IV.9, IV.10, we can upper bound $\Xi_{k+1}$ as:

\displaystyle\Xi_{k+1}\leq\bar{\Theta}^{k-1}\Xi_{1}+(U+U_{f})^{6}\Gamma_{1}\sum_{i=0}^{k-1}\frac{\bar{\Theta}^{i}}{(k-i)^{2m}}=\mathcal{O}(k^{1-2m}).

Proof.

Recursively expand on Equation 13 and we have:

		$\displaystyle\Xi_{k+1}\leq\bar{\Theta}^{k}\Xi_{1}+(U+U_{f})^{6}\sum_{i=0}^{k-1}\bar{\Theta}^{i}\Gamma_{k-i}$
	$\displaystyle\leq$	$\displaystyle\bar{\Theta}^{k}\Xi_{1}+(U+U_{f})^{6}\Gamma_{1}\left[\frac{\lceil\frac{k+1}{2}\rceil}{(k-\lceil\frac{k-1}{2}\rceil)^{2m}}+\frac{\bar{\Theta}^{\lceil\frac{k}{2}\rceil}}{1-\bar{\Theta}}\right]=\mathcal{O}(k^{1-2m}),$

where we have used our choice of $\alpha_{k}$ , Lemma IX.2, and a similar argument as in Theorem VIII.3. ∎

Theorem IX.3 allows us to choose $m$ and tune the convergence speed of the low-level problem, which is used to establish the convergence of high-level problem.

IX-C High-Level Error Propagation

We first establish the high-level rule of error propagation over a single iteration:

Lemma IX.4.

Assuming $M(v)$ commutes with $K(v)$ , III.1, IV.2, IV.3, IV.4, the following high-level error propagation rule holds for all $k\geq 1,q>0$ :

	$\displaystyle l(v_{k+1})-l(v_{k})\leq-\frac{1}{\alpha_{k}}\\|v_{k+1}-v_{k}\\|^{2}+$
	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}(L_{\nabla l}q+1)}{8q}(U+U_{f})^{4}+\frac{L_{\nabla K}^{2}q}{8\underline{\rho}}\Xi_{k}(U+2U_{f})^{2}.$

Proof.

Equation 14 holds by the same argument as in Lemma VIII.4. For the last term in Equation 14, we have:

	$\displaystyle\left<v_{k+1}-v_{k},\frac{1}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}-\frac{1}{2}u_{f}^{T}(v_{k})\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})\right>$
$\displaystyle\leq$	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}}{8q}\\|u_{k}\\|^{4}+\frac{q}{2}\\|\frac{1}{2}(u_{k}-u_{f}(v_{k}))^{T}\frac{\partial{K}}{\partial{v_{k}}}(u_{k}+u_{f}(v_{k}))\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{L_{\nabla K}^{2}\alpha_{k}^{2}}{8q}\\|u_{k}\\|^{4}+\frac{L_{\nabla K}^{2}q}{8\underline{\rho}}\Xi_{k}\\|u_{k}+u_{f}(v_{k})\\|^{2}.$	(21)

The lemma follows by combining Equation 14, Equation 21, and Lemma IX.2. ∎

IX-D High-Level Convergence

We first show that $\Xi_{k}^{v}$ is diminishing via the follow lemma:

Lemma IX.5.

Assuming $M(v)$ commutes with $K(v)$ , III.1, IV.8, IV.9, IV.10, we have the following bound on the accumulated high-level error $\Delta_{k}^{v}$ :

	$\displaystyle\frac{1}{2}\sum_{i=1}^{k}\alpha_{i}\Delta_{i}^{v}\leq l(v_{1})-\underline{l}+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+U_{f})^{4}\sum_{i=1}^{k}\alpha_{i}^{2-n}(L_{\nabla l}\alpha_{i}^{n}+1)+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8\underline{\rho}}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}^{n}\Xi_{i}+\frac{L_{\nabla K}^{2}}{4\underline{\rho}}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}\Xi_{i},$

where $\underline{l}$ is the lower bound of $l$ on $X$ .

Proof.

By choosing $q=\alpha_{k}^{n}$ for some constant $n$ and summing up the recursive rule of Lemma IX.4, we have the following result:

	$\displaystyle l(v_{k+1})-l(v_{1})\leq-\sum_{i=1}^{k}\frac{1}{\alpha_{i}}\\|v_{i+1}-v_{i}\\|^{2}+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8}(U+U_{f})^{4}\sum_{i=1}^{k}\alpha_{i}^{2-n}(L_{\nabla l}\alpha_{i}^{n}+1)+$
	$\displaystyle\frac{L_{\nabla K}^{2}}{8\underline{\rho}}(U+2U_{f})^{2}\sum_{i=1}^{k}\alpha_{i}^{n}\Xi_{i}.$		(22)

We further estimate the update using true gradient:

\displaystyle\Delta_{k}^{v}\leq\frac{2}{\alpha_{k}^{2}}\|v_{k+1}-v_{k}\|^{2}+\frac{L_{\nabla K}^{2}}{2\underline{\rho}}\Xi_{k}(U+2U_{f})^{2}.

(23)

We can prove our lemma by combining Lemma IX.4, Equation 22, and Equation 23. ∎

The following corollary can be proved by the same argument as Corollary VIII.6:

Corollary IX.6.

Assuming $M(v)$ commutes with $K(v)$ , III.1, IV.8, IV.9, IV.10, IV.5, we have $\min_{i=1,\cdots,k}\alpha_{i}\Xi_{i}^{v}\leq Ck^{-1}$ and there are infinitely many $k$ such that $\Xi_{k}^{v}\leq C_{v}k^{m-1}$ for some constants $C,C_{v}$ independent of $k$ .

Finally, we list necessary changes to prove Theorem IV.11:

Proof of Theorem IV.11.

We choose $\tilde{v}_{k+1},d_{k},\tau_{k}$ as in proof of Theorem IV.6. Applying Corollary IX.6 and we can choose sufficiently large $k$ such that:

		$\displaystyle-\left<\nabla l(\tilde{v}_{k+1}),\frac{\tau_{k}d_{k}}{\tau_{k}\\|d_{k}\\|}\right>\leq\frac{1}{\alpha_{k}}\\|v_{k}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\alpha_{k}}\\|v_{k}-v_{k+1}\\|+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle=$	$\displaystyle\sqrt{\Delta_{k}^{v}}+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{4}+\sqrt{\frac{L_{\nabla K}^{2}}{4\underline{\rho}}\Xi_{k}\\|u_{k}+u_{f}(v_{k})\\|^{2}}=\frac{\epsilon}{4}+\mathcal{O}(k^{(1-2m)/2}).$

The last inequality holds by the contractive property of projection operator, Theorem IX.3, and Lemma IX.4. ∎

IX-E Parameter Choices

We argue that $\beta_{k}=1$ is a valid choice in Algorithm 3 if a Krylov-preconditioner is used. Note that choosing $\beta_{k}=1$ is generally incompatible with Assumption IV.8, which is because we make minimal assumption on the preconditioner $M(v)$ , only requiring it to be commuting with $K(v)$ and positive definite with bounded spectrum. However, many practical preconditioners can provide much stronger guarantee leading to $\beta_{k}=1$ being a valid choice. To see this, we observe that the purpose of choosing $\beta_{k}$ according to Assumption IV.8 is to establish the following contractive property in Equation 18:

\displaystyle\|K(v_{k})u_{k+1}-f\|^{2}\leq\bar{\Theta}\|K(v_{k})u_{k}-f\|^{2}.

(24)

Any preconditioner can be used if they pertain such property with $\beta_{k}=1$ . The following result shows that Krylov-preconditioner pertains the property:

Lemma IX.7.

Suppose $M(v)$ is the Krylov-preconditioner with positive $D$ and $\beta_{k}=1$ , then there exists some constant $\bar{\Theta}<1$ where Equation 24 holds in Algorithm 3 for any $k$ .

Proof.

We define $b\triangleq K(v_{k})u_{k}-f$ and, by the definition of $M(v)$ , we have:

		$\displaystyle\\|K(v_{k})u_{k+1}-f\\|^{2}=\\|K(v_{k})(u_{k}-M^{-1}(v_{k})b)-f\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|b-K(v_{k})M^{-1}(v_{k})b\\|^{2}=\\|b-\sum_{i=1}^{D+1}c_{i}^{*}K^{i}(v_{k})b\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\\|(I-\frac{1}{\bar{\rho}}K(v_{k}))b\\|^{2}\leq(1-\frac{\underline{\rho}}{\bar{\rho}})\\|b\\|^{2}\triangleq\bar{\Theta}\\|K(v_{k})u_{k}-f\\|^{2},$

where the first inequality holds by the definition of $c_{i}^{*}$ . ∎

Unfortunately, it is very difficult to theoretically establish this property for other practical preconditioners, such as incomplete LU and multigrid, although it is almost always observed in practice.

	$\displaystyle\\|u_{k+1}-u_{f}(v_{k+1})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|^{2}+$
	$\displaystyle 2\\|u_{k+1}-u_{f}(v_{k})\\|\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+\\|u_{f}(v_{k})-u_{f}(v_{k+1})\\|^{2}+$
	$\displaystyle\frac{pL_{u}^{2}\alpha_{k}^{2}}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+L_{u}^{2}\\|v_{k}-v_{k+1}\\|^{2}+$
	$\displaystyle\frac{pL_{u}^{2}\alpha_{k}^{2}}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}\\|u_{k+1}-u_{f}(v_{k})\\|^{2}+L_{u}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
	$\displaystyle L_{u}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\underline{\rho}+\beta_{k}^{2}\bar{\rho}^{2})\\|u_{k}-u_{f}(v_{k})\\|^{2}+$
	$\displaystyle L_{u}^{2}L_{\nabla K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}\\|^{4},$	(12)

	$\displaystyle\Delta_{k}^{v}\leq\frac{1}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+$
	$\displaystyle\frac{1}{\alpha_{k}^{2}}\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}+$
	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|$
$\displaystyle\leq$	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+\frac{2}{\alpha_{k}^{2}}\\|\frac{\alpha_{k}}{2}u_{f}(v_{k})^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{f}(v_{k})-\frac{\alpha_{k}}{2}u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{2}{\alpha_{k}^{2}}\\|v_{k+1}-v_{k}\\|^{2}+\frac{L_{\nabla K}^{2}}{2}\Delta_{k}(U+2U_{f})^{2}.$	(17)

		$\displaystyle-\left<\nabla l(\tilde{v}_{k+1}),\frac{\tau_{k}d_{k}}{\tau_{k}\\|d_{k}\\|}\right>\leq\frac{1}{\alpha_{k}}\\|v_{k}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\alpha_{k}}\\|v_{k}-v_{k+1}\\|+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle=$	$\displaystyle\sqrt{\Delta_{k}^{v}}+\frac{1}{\alpha_{k}}\\|v_{k+1}-\tilde{v}_{k+1}\\|$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{4}+\sqrt{\frac{L_{\nabla K}^{2}}{4}\Delta_{k}\\|u_{k}+u_{f}(v_{k})\\|^{2}}=\frac{\epsilon}{4}+\mathcal{O}(k^{(1-2m)/2}).$

		$\displaystyle\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|\\|K(v_{k})u_{k+1}-f\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|K(v_{k})u_{k+1}-f\\|^{2}+\frac{p}{2}\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2p}\\|K(v_{k})u_{k+1}-f\\|^{2}+\frac{pL_{K}^{2}\alpha_{k}^{2}}{8}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}\\|u_{k+1}\\|^{2}.$

	$\displaystyle\\|K(v_{k+1})u_{k+1}-f\\|^{2}$
$\displaystyle\leq$	$\displaystyle\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|^{2}+\\|K(v_{k})u_{k+1}-f\\|^{2}+$
	$\displaystyle 2\\|(K(v_{k+1})-K(v_{k}))u_{k+1}\\|\\|K(v_{k})u_{k+1}-f\\|$
$\displaystyle\leq$	$\displaystyle\frac{p+1}{p}(1-2\beta_{k}\frac{\underline{\rho}}{\bar{\rho}_{M}}+\beta_{k}^{2}\frac{\bar{\rho}^{2}}{\underline{\rho}_{M}^{2}})\\|K(v_{k})u_{k}-f\\|^{2}+$
	$\displaystyle L_{K}^{2}\alpha_{k}^{2}\frac{p+1}{4}\\|u_{k}^{T}\frac{\partial{K}}{\partial{v_{k}}}u_{k}\\|^{2}\\|u_{k+1}\\|^{2},$	(19)



	Figure 11: We summarize the comparative convergence history of PGD and FBTO for all the benchmarks (Figure 3a-g). PGD and FBTO are implemented on GPU with matrix factorization of PGD on CPU. All other parameters are the same.