Augmenting Iterative Trajectory for Bilevel Optimization: Methodology, Analysis and Extensions

Risheng Liu, Yaohua Liu, Shangzhi Zeng, Jin Zhang R. Liu and Y. Liu are with the DUT-RU International School of Information Science & Engineering, Dalian University of Technology, and also with the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, Liaoning, P.R., China. R. Liu is also with the Peng Cheng Laboratory, Shenzhen, Guangdong, P.R. China. E-mail: [email protected], [email protected]. S. Zeng is with the Department of Mathematics and Statistics, University of Victoria, Victoria, B.C., Canada. E-mail: [email protected]. J. Zhang is with the Department of Mathematics, Southern University of Science and Technology, and National Center for Applied Mathematics Shenzhen, China, (Corresponding author, E-mail: [email protected].) Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

In recent years, there has been a surge of machine learning applications developed with hierarchical structure, which can be approached from Bi-Level Optimization (BLO) perspective. However, most existing gradient-based methods overlook the interdependence between hyper-gradient calculation and Lower-Level (LL) iterative trajectory, focusing solely on the former. Consequently, convergence theory is constructed with restrictive LL assumptions, which are often challenging to satisfy in real-world scenarios. In this work, we thoroughly analyze the constructed iterative trajectory, and highlight two deficiencies, including empirically chosen initialization and default use of entire trajectory for hyper-gradient calculation. To address these issues, we incrementally introduce two augmentation techniques including Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and investigate various extension strategies such as prior regularization, different iterative mapping schemes and acceleration dynamics to construct Augmented Iterative Trajectory (AIT) for corresponding BLO scenarios (e.g., LL convexity and LL non-convexity). Theoretically, we provide convergence analysis for AIT and its variations under different LL assumptions, and establish the first convergence analysis for BLOs with non-convex LL subproblem. Finally, we demonstrate the effectiveness of AIT through three numerical examples, typical learning and vision applications (e.g., data hyper-cleaning and few-shot learning) and more challenging tasks such as neural architecture search.

Index Terms:

Bilevel optimization, gradient-based method, initialization auxiliary, pessimistic trajectory truncation, deep learning.

1 Introduction

Recently, a variety of rising deep learning applications share similar principles to construct hierarchical optimization models, including Hyperparameter Optimization [1, 2], Meta Learning [3, 4, 3], Neural Architecture Search [5, 6], Reinforcement Learning [7, 8, 9], and have achieved highly competitive performance in different fields. With proper reformulation, these models all can be considered as Bi-Level Optimization (BLO) problems, which optimize two nested subproblems in the form as:

\min_{\mathbf{x}\in\mathcal{X}}F(\mathbf{x},\mathbf{y}),\text{ s.t. }\mathbf{y}\in\mathcal{S}(\mathbf{x}):=\arg\min_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{x},\mathbf{y}),

(1)

where $\mathbf{x}\in\mathbb{R}^{m}$ and $\mathbf{y}\in\mathbb{R}^{n}$ correspond to the Upper-Level (UL) and Lower-Level (LL) variables respectively, the UL objective $F$ and LL objective $f$ are continuously differentiable functions and $\mathcal{S}(\mathbf{x})$ is the LL solution set for all $\mathbf{x}\in\mathcal{X}$ .

In general, the BLO problem can be considered as a leader-follower game process, where the follower ( i.e., the LL subproblem) continuously responds to the decision $\mathbf{x}$ of the Leader (i.e., the UL subproblem). From the optimistic BLO perspective, we assume that the obtained LL solution $\mathbf{y}\in\mathcal{S}(\mathbf{x})$ always leads to the optimal solution of UL objective for any given $\mathbf{x}$ , then the original BLO in Eq.(1) can be reduced to a simple single-level problem as

\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}):=\inf_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y}),

(2)

where $\varphi$ is the value function of a simple bilevel problem w.r.t. $\mathbf{y}$ for any given $\mathbf{x}$ . Then $\mathbf{x}$ and $\mathbf{y}$ have the cooperative relationship towards minimizing the UL subproblem. Based on this form, various branches of methods [10, 11, 12, 13] have explored how to optimize $\varphi(\mathbf{x})$ with respective to $\mathbf{x}$ and $\mathbf{y}$ simultaneously, thus solve the above applications captured by this hierarchical structure with high accuracy and efficiency.

Typically, BLO problems remain challenging to be solved caused by the nature of sophisticated dependence between UL and LL variables, especially when multiple LL solutions exist in $\mathcal{S}(\mathbf{x})$ . As for classical theory, KKT condition [14] has been considered as efficient tools to solve the above optimization problems. whereas, this type of methods are impractical to be applied in modern machine learning tasks limited by existence of too many multipliers. Meanwhile, Gradient-Based Methods (GBMs) [15] have been widely adopted in various learning and vision tasks to handle real-world applications. Practically speaking, the key of solving Eq. (2) with GBMs falls to accessing the highly coupled nested gradient caused by iterative optimization to solve the LL subproblem in Eq. (1). We denote $\mathbf{y}^{*}(\mathbf{x})$ as the Best Response mapping of the follower for a given UL variable $\mathbf{x}$ . By embedding $\mathbf{y}^{*}(\mathbf{x})$ into the simple bilevel problem in Eq. (2) as $F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right)$ , the hyper-gradient of $\varphi(\mathbf{x})$ can be derived by using the chain rule as

\nabla_{\mathbf{x}}\varphi(\mathbf{x})=\nabla_{\mathbf{x}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right)+G(\mathbf{x}),

(3)

where $\nabla_{\mathbf{x}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right)$ can be regarded as the direct gradient, and $G(\mathbf{x})=\left(\nabla_{\mathbf{x}}\mathbf{y}^{*}(\mathbf{x})\right)^{\top}\nabla_{\mathbf{y}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right)$ denotes the indirect gradient.

Since the calculation of hyper-gradient requires proper approximation of $\mathbf{y}^{*}(\mathbf{x})$ and indirect gradient $G(\mathbf{x})$ , classical dynamics-embedded GBMs [15] tend to construct the approximations by introducing dynamical system obtained from iterative optimization steps. Typically speaking, this type of gradient-based BLO methods approximate $\mathbf{y}^{*}(\mathbf{x})$ with $\mathbf{y}_{K}(\mathbf{x})$ constructed by the following optimization dynamics: $\mathbf{y}_{k+1}(\mathbf{x})=\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x})\right),\ k=0,\cdots,K-1,$ where $\mathbf{y}_{0}(\mathbf{x})=\mathbf{y}_{0}$ is a fixed given initial value and $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ denotes the iterative dynamics mapping at $k$ -th step derived with specific updating scheme. Based on the above formulation, we use $\mathcal{T}_{0:K}$ to denote the historical sequence of $\mathbf{y}_{k}(\mathbf{x})$ , $k=0,...,K$ , which will be discussed more thoroughly in this work. In essence, $\mathcal{T}_{0:K}$ reflects the iterative trajectory of LL variables to obtain $\mathbf{y}_{K}(\mathbf{x})$ , then the indirect gradient $G(\mathbf{x})$ can be approximated with this $\mathbf{y}_{K}(\mathbf{x})$ by $\left(\nabla_{\mathbf{x}}\mathbf{y}_{K}(\mathbf{x})\right)^{\top}\nabla_{\mathbf{y}}F\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right)$ . By this way, these GBMs can be regarded as using gradient descent methods solving following approximation problem of Eq. (2),

\min_{\mathbf{x}\in\mathcal{X}}\varphi_{K}(\mathbf{x}):=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})).

(4)

It is worth noting that $\mathbf{y}_{K}(\mathbf{x})$ can be regarded as good approximation as long as $\nabla_{\mathbf{y}}f\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right)$ uniformly converges to zero w.r.t. $\mathbf{x}\in\mathcal{X}$ and $F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x}))$ converges to the real BLO objective $\varphi(\mathbf{x})$ .

To ensure the obtained approximation in Eq. (4) is good enough to guarantee the convergence of solutions towards the original BLO, restrictive assumptions of LL subproblem, such as LL Singleton (LLS) and LL Convexity (LLC), are usually required by existing GBMs. Whereas, as complex network structure with multiple convolutional blocks and loss functions with specialized regularization terms are widely employed in lots of applications, the LL subproblems always have non-convex property, which implies a gap between the provided theoretical guarantee and complex properties of real world problems. To deal with these challenges, several works [12, 16] have spared efforts to design new iterative mapping schemes or approximation theory, thus partly relax the theoretical constraints and obtain some new convergence results. Nevertheless, there is still an urgent need for general BLO framework with solid theoretical analysis to cover the non-convex scenarios.

1.1 Contributions

In this work, we propose the Augmented Iterative Trajectory (AIT) framework with two augmentation techniques, including Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and a series of extension strategies to handle the above limitations of existing GBMs. More specifically, we first analyze the vanilla iterative trajectory constructed by the gradient-based BLO scheme, and highlight two significant deficiencies of iterative trajectory, including empirically choosing initialization of iterative trajectory (i.e., $\mathbf{y}_{0}(\mathbf{x}))$ and using entire trajectory for hyper-gradient calculation by default, both of which are less noticed by existing GBMs.

Aiming at the above deficiencies of iterative trajectory, we first introduce IA to optimize the initialization value for LL trajectory simultaneously with the UL variables. Then we propose prior regularization as warm start to guide the optimization of IA, and investigate different iterative dynamic mapping schemes and acceleration dynamics to construct our AIT framework for BLOs with Convex LL subproblem ( $\textrm{AIT}_{C}$ for short). To handle the challenging non-convex scenario, we further propose PTT operation to dynamically adjust the iterative trajectory for calculation of the hyper-gradient, and introduce the acceleration gradient scheme to construct our AIT framework for Non-Convex BLOs ( $\textrm{AIT}_{NC}$ for short). Theoretically, we conduct detailed convergence analysis of $\textrm{AIT}_{C}$ , $\textrm{AIT}_{NC}$ and its variations to support different challenging scenarios, especially for the unexplored non-convex scenario. Finally, we demonstrate our convergence results and performance improvement with extensive experiments based on the various numerical examples, typical learning and vision applications and other more challenging deep learning tasks. We summarize our main contributions as follows.

•

We conduct a thorough analysis of the iterative trajectory construction for the gradient-based BLO scheme, which is a significant departure from the existing GBMs. We then identify two limitations of the vanilla iterative trajectory adopted by current GBMs, and introduce our AIT to address BLOs with more relaxed LL assumptions.
•

We first propose IA to guide the optimization of LL variables, and then introduce prior regularization based on IA and explore various iterative dynamic mapping schemes (e.g., Nesterov’s acceleration dynamics) to construct $\textrm{AIT}_{C}$ for BLOs with Convex followers. To handle the more challenging non-convex scenarios, we further propose PTT operation to dynamically truncate the trajectory, and investigate acceleration gradient scheme to construct $\textrm{AIT}_{NC}$ .
•

We provide detailed theoretical investigation on $\textrm{AIT}_{C}$ , $\textrm{AIT}_{NC}$ and corresponding extension strategies (e.g., proximal prior regularization and acceleration gradient scheme) to ensure the convergence property of AIT without LLS even LLC assumption. Notably, we establish the first convergence guarantee for BLO problems with non-convex LL subproblems.
•

A plentiful of experiments have been conducted based on numerical examples with varying LL properties. Besides, we not only implement typical learning and vision applications (e.g., few-shot classification and data hyper-cleaning), but also demonstrate the generalization performance of AIT in more challenging tasks such as neural architecture search and generative adversarial networks.

2 A Brief Review of Existing Works

In this work, we first provide a brief review of previous works to conclude the general gradient-based BLO scheme, and then we discuss the deficiencies of existing LL iterative trajectory, which have limited the application of existing GBMs to more challenging BLO scenarios.

As mentioned before, existing GBMs have explored different heuristic techniques to calculate the hyper-gradient (i.e., $\nabla_{\mathbf{x}}\varphi(\mathbf{x})$ ) with designed approximation operations. In one of the most representative class of GBMs [17, 18, 19], it first constructs dynamical system to track the optimization path of LL variables, and then explicitly calculates the indirect gradient with respect to UL variables based on ITerative Differentiation (ITD). Typically, RHG and FHG [17, 18] are proposed, in which $\mathbf{y}^{*}(\mathbf{x})$ and thus $\varphi(\mathbf{x})$ are approximated by finite iterative optimization steps, and then the obtained approximated optimization problems are solved. Whereas, both methods suffer from the huge time and space complexity of calculating indirect gradient along the whole iterative trajectory with AD. Then Shanban et al. [2] proposed T-RHG to manually truncate the backpropagation path, thus release the computation burden. For these methods, the iterative dynamics mapping $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ used to construct iterative trajectory is specified as the projected gradient descent. Whereas, the convergence analysis of these methods all require the LL subproblem to be strongly convex, which severely restrict the application scenarios. Recently, Liu et al. [19] proposed BDA, which aggregates the gradient information of both UL and LL subproblems to construct new $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ to approximate $\mathbf{y}^{*}(\mathbf{x})$ and $\varphi(\mathbf{x})$ , thus successfully overcomes the LLS restriction issue. When calculating the nested hyper-gradient with ITD based on different forms of $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ , the gradient-based BLO scheme could be summarized in Alg. 1, where $\varphi_{K}(\mathbf{x})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x}))$ . It can be observed that the iterative update of UL variables to solve the approximated subproblem (i.e., $\varphi(\mathbf{x})$ ) relies on effective iterative trajectory for hyper-gradient calculation (Step 4-6).

Algorithm 1 Gradient-Based BLO Scheme

0: UL iteration

T

and LL iteration

K

1: Initialize

\mathbf{x}^{0}

2: for

t=0\rightarrow T-1

3: Initialize

\mathbf{y}_{0}

4: for

k=0\rightarrow K-1

5: % Update

\mathbf{y}_{k}

with

\mathbf{x}^{t}

\mathbf{y}_{k+1}(\mathbf{x}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t})\right)

7: end for

8: % Update

\mathbf{x}^{t}

with

\mathbf{y}_{K}(\mathbf{x}^{t})

based on ITD.

\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\varphi_{K}(\mathbf{x}^{t}))

10: end for

Since most ITD-based approaches only established their convergence theory based on the LLS restriction, they neither pay attention to the initialization of iterative trajectory nor take serious consideration of how to choose the iterative trajectory for hyper-gradient calculation. Whereas, an improper initialization position chosen for LL variables may slow down the convergence of LL and UL subproblem. More importantly, when multiple solutions exist in $\mathcal{S}(\mathbf{x})$ (i.e., without LLS), the initialization has significant influence on the UL convergence behavior, so there is no guarantee that the iterative trajectory constructed based on the chosen $\mathbf{y}_{0}(\mathbf{x})$ provides iterates optimizing the UL objective while converging to $\mathcal{S}(\mathbf{x})$ . Furthermore, suppose that the LLC assumption is not satisfied, the subsequences of the whole iterative trajectory (i.e., $\mathcal{T}_{0:K}$ ) no longer have consistent convergence behavior as in the LLC case. On this basis, using the whole iterative trajectory for hyper-gradient calculation may lead to the oscillation or divergence of the approximated UL subproblem. Overall, the vanilla iterative trajectory adopted by ITD-based approaches has limited flexibility and becomes the bottleneck of existing gradient-based BLO scheme to cover more challenging scenarios. Though several researches have explored potential techniques, such as aggregating LL and UL gradient information to design new $\mathcal{T}_{k}(\mathbf{x},\cdot)$ to construct a good approximated BLO problem in Eq. (4) and successfully relax the theoretical restrictions on LL subproblem to some extent, the situation where the LLC assumption does not hold still needs to be considered.

In contrast to the ITD-based approaches which compute the hyper-gradient based on the constructed LL iterative trajectories, another branch of the methods [20, 11, 21] instead adopts the implicit function theorem and derives the hyper-gradient of UL variables by solving a linear system. Whereas, computing Hessian matrix and its inverse are much more expensive and challenging especially when ill-conditioned linear system appears. Besides, these methods based on Approximated Implicit Differentiation (AID) require the LL subproblem to be strongly convex, which limits the application scenarios to a great extent. Another line of work [22] addresses the nonconvex issue by constructing value-function-based smooth approximation problems, while extra assumptions introduced on the constrained LL subproblems still remain cumbersome to handle. As for other recent efforts [23, 24] which have designed specific update formats of UL or LL variables to derive new convergence rate analysis, Hong et al. [25] proposed TTSA to introduce cheap estimates of gradients, and update the LL and UL variables simultaneously with SGD and projected SGD. In addition, Khanduri et al. [26] proposed the MSTSA algorithm to estimate the hyper-gradient with assisted momentum. Quan et.al [27] proposed two variants of the alternating implicit gradient SGD approach with improved efficiency to solve equality-constrained BLOs. The above methods also build their theoretical analysis for BLOs with strongly-convex LL subproblems.

TABLE I: We report the LL assumptions required by the mainstream GBMs to derive their convergence results, the technique to compute

\nabla_{\mathbf{x}}\varphi(\mathbf{x})

and whether they are compatible with the accelerated technique (i.e., A.). Note that w/ and w/o are abbreviations for with and without, respectively.

Method	$\nabla_{\mathbf{x}}\varphi(\mathbf{x})$	w/o LLS	w/o LLC	w/ A.
RHG/FHG	ITD	✗	✗	✗
T-RHG		✗	✗	✗
BDA		✔	✗	✗
CG	AID	✗	✗	✗
Neumann	AID	✗	✗	✗
AIT	ITD	✔	✔	✔

In Tab. I, we compare the required LL assumptions of mainstream GBMs to build their convergence analysis and whether they support the acceleration technique. Based on the above analysis of gradient-based BLO scheme for ITD-based approaches, in the next section, we construct our AIT framework to address the theoretical issues and handle more challenging BLO scenarios. As it is shown, AIT not only derives similar convergence results without the LLS and LLC restriction, but also supports more flexible iterative mapping schemes including the acceleration techniques.

3 Augmented Iterative Trajectory (AIT)

Based on the prior analysis of iterative trajectory, to relax the theoretical restriction and improve the efficiency of the gradient-based BLO scheme, we first introduce the initialization auxiliary augmentation technique, which introduces a new auxiliary variable to the initialization of iterative trajectory and then optimize it together with the UL variables. Then we further investigate different extension strategies related to the initialization auxiliary augmentation technique, including prior regularization and acceleration dynamics to construct the Augmented Iterative Trajectory (AIT) for BLOs when LLC assumption is satisfied. To make the gradient-based BLO scheme be workable on BLOs without LLC, we further propose the pessimistic trajectory truncation augmentation technique to dynamically adjust the iterative trajectory for hyper-gradient calculation. Together with the initialization auxiliary, we construct the AIT for solving BLOs with non-convex follower.

3.1 BLOs with Convex LL Subproblem

Typically speaking, the initialization of iterative trajectory, i.e., $\mathbf{y}_{0}(\mathbf{x})$ , is usually chosen based on a fixed manually (randomly or by some specific strategies) set initial value of LL variable, which takes no consideration of the convergence behavior of UL subproblem. As it is mentioned before, when there exist multiple elements in LL subproblem solution set $\mathcal{S}(\mathbf{x})$ , performing projected gradient descent steps (i.e., $\mathcal{T}_{k}(\mathbf{x},\cdot)$ adopted by most GBMs) to construct iterative trajectory with improper initial value $\mathbf{y}_{0}(\mathbf{x})$ usually not necessarily optimizes the UL objective while converging to $\mathcal{S}(\mathbf{x})$ . Besides, an improper initialization position for $\mathcal{T}_{0:K}$ will also slow down the convergence rate of LL subproblem, even when $\mathcal{S}_{\mathbf{x}}$ is a singleton.

To address the issues mentioned above, we suggest to introduce an Initialization Auxiliary (IA) variable $\mathbf{z}$ as the initialization position of LL variable to construct $\mathcal{T}_{0:K}$ for approximating LL subproblem as follows,

\displaystyle\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\quad\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right),

(5)

which leads to the following approximation problem of Eq. (2): $\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}}\varphi_{K}(\mathbf{x},\mathbf{z}):=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z})).$ It is worth noting that, the iterative trajectory constructed based on this optimization dynamics formulates the approximation $\mathbf{y}_{K}(\mathbf{x},\mathbf{z})$ parameterized by $\mathbf{x}$ and $\mathbf{z}$ . By this form, we avoid fixing initialization position chosen with empirical strategies, and update the initialization auxiliary $\mathbf{z}$ together with $\mathbf{x}$ to continuously search for the “best” initial points of LL variables to calculate $\mathbf{y}_{K}(\mathbf{x},\mathbf{z})$ . By taking into consideration of the convergence behavior of UL subproblem, $\mathcal{T}_{k}(\mathbf{x},\cdot)$ with updated initial value can approach solutions to the BLO problem in Eq. (1). The idea of IA variable $\mathbf{z}$ is first proposed for the convergence theory [22] of non-convex first-order optimization methods.

As mentioned above, the initialization auxiliary variable $\mathbf{z}$ is introduced and optimized simultaneously with UL variable $\mathbf{x}$ for helping the LL subproblem dynamics approach the appropriate $\mathbf{y}^{*}(\mathbf{x})$ that minimizes UL objective $F$ . To further guide the update of the IA variable $\mathbf{z}$ and thus boost the convergence of the LL subproblem dynamics, we propose to introduce a constraint set $\mathcal{Z}$ and an extra prior regularization term $g(\mathbf{z})$ to the IA variable $\mathbf{z}$ , which leads to the following approximation problem to Eq. (2),

\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\ \psi_{K}(\mathbf{x},\mathbf{z})=\varphi_{K}(\mathbf{x},\mathbf{z})+g(\mathbf{z}).

(6)

In practice, $g(\mathbf{z})$ can be specified as penalty on distance between $\mathbf{z}$ and the prior determined value $\mathbf{z}_{p}$ , where $\mathbf{z}_{p}$ usually serves as some known state of $\mathbf{z}$ that is very close to the optimal LL variable. As one option for machine learning tasks, we could properly define $\mathbf{z}_{p}$ as the pretrained weights of the LL variables, and verify its effectiveness with BLO applications.

Another feature of this new approximation format is that the iterative dynamics mapping $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ for constructing iterative trajectory can be flexibly specified as different forms proposed in existing works. Here, for general BLOs with convex LL subproblem, we suggest applying the aggregation scheme proposed in BDA [19] to construct $\mathcal{T}_{k+1}(\mathbf{x},\cdot)$ as follows

		$\displaystyle\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)$		(7)
	$\displaystyle=$	$\displaystyle\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\left(\mu\alpha_{k}\mathbf{d}^{{F}}_{k}(\mathbf{x},\mathbf{z})+(1-\mu)\beta_{k}\mathbf{d}^{{f}}_{k}(\mathbf{x},\mathbf{z})\right)\right),$		(7)

where $\mathbf{d}^{{F}}_{k}(\mathbf{x},\mathbf{z})=s_{u}\nabla_{\mathbf{y}}F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))$ and $\mathbf{d}^{{f}}_{k}(\mathbf{x},\mathbf{z})=s_{l}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))$ denote the gradient directions of UL and LL objective with introduced IA, respectively.

Specially, for the case where the LLS condition is satisfied, following the vanilla update format in RHG, $\mathcal{T}_{k+1}\left(\mathbf{x},\cdot\right)$ could be simply chosen as

\displaystyle\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathtt{Proj}_{\mathcal{Y}}(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\alpha_{\mathbf{y}}^{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))),

(8)

where $\alpha_{\mathbf{y}}^{k}$ denotes the step size parameter. By this means, embedding the IA technique still helps optimize the initialization to construct the iterative trajectory, thus further improves the convergence behavior of BLOs under LLS assumption.

Besides, our new scheme with IA technique also supports to consistently improve the existing methods with embedded acceleration dynamics. As one of the practical implementation, we could embed the Nesterov’s acceleration methods [28] with our IA technique, which has been widely used for convex optimization problem to accelerate the convergence of gradient descent, then we obtain the accelerated version as

\begin{cases}\begin{aligned} &\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\quad t_{0}=1,\quad t_{k+1}={\left(1+\sqrt{1+4t_{k}^{2}}\right)}/{2},\\ &\mathbf{u}^{k+1}(\mathbf{x},\mathbf{z})=\mathbf{y}^{k+1}(\mathbf{x},\mathbf{z})+\left(\frac{t_{k}-1}{{t_{k+1}}}\right)(\mathbf{y}^{k+1}(\mathbf{x},\mathbf{z})-\mathbf{y}^{k}(\mathbf{x},\mathbf{z})),\\ &\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{u}^{k}(\mathbf{x},\mathbf{z})-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{u}^{k}(\mathbf{x},\mathbf{z}))\right),\\ &\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}),\\ \end{aligned}\end{cases}

(9)

where $\alpha$ denotes the step size parameter.

Algorithm 2 The Proposed

\textrm{AIT}_{C}

Algorithm

0: UL iteration

T

, LL iteration

K

1: Initialize

\mathbf{x}^{0}

and

\mathbf{z}^{0}

2: for

t=0\rightarrow T-1

\mathbf{y}_{0}=\mathbf{z}^{t}

4: for

k=0\rightarrow K-1

5: % Iterative dynamics mapping.

\mathbf{y}_{k+1}(\mathbf{x}^{t},\mathbf{z}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t})\right)

7: end for

8: % Update

\mathbf{x}

with

\mathbf{y}_{K}(\mathbf{x}^{t},\mathbf{z}^{t})

\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\psi_{K}(\mathbf{x}^{t},\mathbf{z}^{t}))

10: % Update

\mathbf{z}

with

\mathbf{y}_{K}(\mathbf{x}^{t},\mathbf{z}^{t})

11:

\mathbf{z}^{t+1}=\mathtt{Proj}_{\mathcal{Z}}(\mathbf{z}^{t}-\alpha_{\mathbf{z}}\nabla_{\mathbf{z}}\psi_{K}(\mathbf{x}^{t},\mathbf{z}^{t}))

12: end for

In Alg. 2, we illustrate the algorithm of our Augmented Iterative Trajectory for BLOs with Convex LL subproblem ( $\textrm{AIT}_{C}$ for short), where $\psi_{K}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))+g(\mathbf{z})$ . Whereas, to make the gradient-based BLO scheme be efficient and convergent for the BLOs with nonconvex LL subproblem, the proposed IA technique is not enough. In the next part, we will propose another augmentation technique named Pessimistic Trajectory Truncation, and discuss how to use this new technique together with IA to derive a new gradient-based BLO algorithm that is convergent and efficient for the BLOs with nonconvex LL subproblem. The detailed convergence analysis could be found in the next section.

3.2 BLOs with Non-convex LL Subproblem

As another important issue of the existing gradient-based BLO scheme, they used to select the whole iterative trajectory (i.e., $\mathcal{T}_{0:K}$ ) to perform the hyper-gradient calculation. When the LL subproblem is nonconvex, the subsequences of $\mathcal{T}_{0:K}$ show no consistent convergence behavior as in the LLC case, even the initialization is continuously optimized by the IA technique. Therefore, directly using $\mathcal{T}_{0:K}$ results in improper approximation of $F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x}))$ , and $\nabla_{\mathbf{y}}f\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right)$ may not uniformly converge to zero w.r.t. $\mathbf{x}\in\mathcal{X}$ while $F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x}))$ converges to the real BLO objective $\varphi(\mathbf{x})$ . In short, this empirical strategy for constructing $\mathcal{T}_{0:K}$ actually invalidates the convergence and may result in oscillation even divergence of the optimization process in the nonconvex case.

Based on the constructed approximation formats in Eq. (5), we introduce another significant technique, i.e., Pessimistic Trajectory Truncation (PTT) to extend the application scenarios and derive new convergence results without requiring LLC assumption. Note that the idea of PTT operation is inspired by the convergence result of the projected gradient method on nonconvex optimization problem. We define $\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}):=\mathbf{y}-\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\right)$ as the proximal gradient residual mapping to measure the optimality of LL subproblem. When the LL subproblem is nonconvex, $\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y})$ may not uniformly converge to the LL subproblem solution set w.r.t. $\mathbf{x}$ and $\mathbf{z}$ . Therefore, directly embedding $\mathbf{y}_{K}(\mathbf{x},\mathbf{z})$ into the UL objective $F(\mathbf{x},\mathbf{y})$ may not necessarily provide a proper approximation strategy as mentioned before. Fortunately, it is well understood that as $K$ tends to infinity, $\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|$ uniformly converges to zero w.r.t. $\mathbf{x}$ and $\mathbf{z}$ .

Whereas, it still remains to be impractical to explicitly select $i(K):=\arg\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|$ for using $\mathbf{y}_{i(K)}(\mathbf{x},\mathbf{z})$ to construct an approximation. Therefore, we choose to minimize the worst case of ${\mathbf{y}_{i(K)}(\mathbf{x},\mathbf{z})}$ in terms of the UL objective values in a compromise, i.e., $\phi_{\bar{K}}(\mathbf{x},\mathbf{z}):=\max_{1\leq k\leq K}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\}$ , to guarantee the convergence of $\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))$ no matter how this $i(K)$ is chosen. The theoretical convergence guarantee for the above approximation is organized in Section 4. Practically speaking, the varying $\bar{K}$ actually considers the behavior of UL objectives during the LL optimization to truncate the whole iterative trajectory. In this case, the $\mathcal{T}_{0:\bar{K}}$ is more reasonable to represent the whole trajectory for hyper-gradient calculation, which can not be directly reflected by using fixed $K$ without PTT operation.

By introducing both IA and PTT operation, our new algorithm avoids directly using the oscillating iterative trajectory of LL variables for hyper-gradient calculation, and successfully removes the LLC assumption to handle non-convex scenarios in various real-world applications. In addition, the PTT operation always leads to a small $\bar{K}:=\arg\max_{1\leq k\leq K}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\}$ , thus effectively shortens the path to compute the indirect hyper-gradient through backpropagation, which saves the computational cost in practice. We also conduct the running time analysis based on the nonconvex numerical example to confirm this conclusion numerically.

Note that the iterative mapping scheme $\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)$ can be chosen as the projected gradient method in Eq. (8). Besides, we also investigate available accelerated gradient scheme [29] to improve the efficiency for this non-covnex case, which is given as follows

\begin{cases}\begin{aligned} &\mathbf{u}_{0}(\mathbf{x},\mathbf{z})=\mathbf{v}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\\ &\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=(1-\alpha_{k})\mathbf{u}_{k}(\mathbf{x},\mathbf{z})+\alpha_{k}\mathbf{v}_{k}(\mathbf{x},\mathbf{z}),\\ &\mathbf{u}^{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})-\beta_{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}))\right),\\ &\mathbf{v}_{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{v}_{k}(\mathbf{x},\mathbf{z})-\lambda_{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}))\right),\\ &\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}),\end{aligned}\end{cases}

(10)

where $\alpha_{k}=\frac{2}{k+1}$ , $\beta_{k}=\frac{1}{2L_{f}}$ and $\lambda_{k}=\frac{k\beta_{k}}{2}$ , and $k=0,\ldots,K-1$ .

Algorithm 3 The Proposed

\textrm{AIT}_{NC}

Algorithm

0: UL iteration

T

, LL iteration

K

1: Initialize

\mathbf{x}^{0}

and

\mathbf{z}^{0}

2: for

t=0\rightarrow T-1

\mathbf{y}_{0}=\mathbf{z}^{t}

4: for

k=0\rightarrow K-1

5: % Update

y_{k}

with

\mathbf{x}^{t}

and

\mathbf{z}^{t}

\mathbf{y}_{k+1}(\mathbf{x}^{t},\mathbf{z}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t})\right)

7: end for

8: % Pessimistic Trajectory Truncation.

\bar{K}=\arg\max_{1\leq k\leq K}\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t}))\}

10: % Update

\mathbf{x}

with

\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t})

11: Denote

\phi_{\bar{K}}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{\bar{K}}(\mathbf{x},\mathbf{z}))

12:

\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\phi_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t}))

13: % Update

\mathbf{z}

with

\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t})

14:

\mathbf{z}^{t+1}=\mathtt{Proj}_{\mathcal{Z}}(\mathbf{z}^{t}-\alpha_{\mathbf{z}}\nabla_{\mathbf{z}}\phi_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t}))

15: end for

With two proposed augmentation techniques and supported iterative acceleration gradient scheme, the complete algorithm of our AIT for BLOs with Non-convex LL subproblem ( $\textrm{AIT}_{NC}$ for short) is illustrated in Alg. 3. In the following sections, we will provide the asymptotic convergence analysis of $\textrm{AIT}_{C}$ and $\textrm{AIT}_{NC}$ and further discuss the applicable scenarios of different variations.

4 Theoretical Analysis

In this section, we present the asymptotic convergence analysis of $\textrm{AIT}_{C}$ in Alg. 2 and $\textrm{AIT}_{NC}$ in Alg. 3 under mild conditions on the general schematic iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ . And we also show that these conditions on $\mathcal{T}_{k}(\mathbf{x},\cdot)$ are satisfied for some specific choices of $\mathcal{T}_{k}(\mathbf{x},\cdot)$ . Specifically, for convex BLOs, we show the asymptotic convergence of Alg. 2 when $\mathcal{T}_{k}(\mathbf{x},\cdot)$ is chosen as BDA [19] in Eq.(7). And for convex BLOs whose LL solution set is singleton, we show the asymptotic convergence of Algorithm 2 when $\mathcal{T}_{k}(\mathbf{x},\cdot)$ is chosen as projected gradient in Eq. (8) and Nesterov’s acceleration scheme in Eq. (9). For nonconvex BLOs, we show the asymptotic convergence of Algorithm 3 when $\mathcal{T}_{k}(\mathbf{x},\cdot)$ is specifically chosen as projected gradient method and accelerated gradient scheme proposed by [29] in Eq. (10).

4.1 Convergence Analysis for $\textrm{AIT}_{C}$

We first derive the asymptotic convergence analysis of Alg. 2 for convex BLOs with general schematic iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ . Alg. 2 can be regarded as the application of projected gradient method on the following single-level approximation problem of Eq. (1),

		$\displaystyle\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}~{}~{}F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))+g(\mathbf{z})$		(11)
		$\displaystyle s.t.~{}\begin{cases}\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},&\\ \mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})),&\end{cases}$		(11)

where $k=0,\ldots,K-1$ , the function $g:\mathbb{R}^{m}\rightarrow\mathbb{R}$ and the closed set $\mathcal{Z}\subset\mathbb{R}^{m}$ are the function and compact set that are with prior information on lower level variable $\mathbf{y}$ . To show the asymptotic convergence of Alg. 2, i.e., the approximation problem in Eq. (11) converges to the original bilevel problem in Eq. (1), i.e., $\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$ , we make following mild assumptions on BLOs throughout this subsection.

Assumption 4.1.

We make following standing assumptions throughout this subsection:

(1)

$F,\nabla_{\mathbf{y}}F,f,\nabla_{\mathbf{y}}f$ are continuous functions on $\mathcal{X}\times\mathbb{R}^{m}$ .
(2)

$F(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R}$ is $L_{F}$ -smooth, convex and bounded below by $M_{0}$ for any $\mathbf{x}\in\mathcal{X}$ .
(3)

$f(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R}$ is $L_{f}$ -smooth and convex for any $\mathbf{x}\in\mathcal{X}$ . $f(\mathbf{x},\mathbf{y})$ is level-bounded in $\mathbf{y}$ uniformly in $\mathbf{x}\in\mathcal{X}$
(4)

$\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ are compact sets, and $\mathcal{Y}$ is a convex set.
(5)

$\hat{\mathcal{S}}(\mathbf{x}):=\mathrm{argmin}_{\mathbf{y}}\{F(\mathbf{x},\mathbf{y}),\ s.t.\ \mathbf{y}\in\mathcal{Y}\cap\mathcal{S}(\mathbf{x})\}$ is nonempty for all $\mathbf{x}\in\mathcal{X}$ .
(6)

$g(\mathbf{z})$ is continuous on $\mathcal{Z}$ .

Assumption 4.1 is standard for convex BLOs in the related literature, which is satisfied for the convex BLO numerical example given in Section 6.1. Before presenting the convergence result, we recall the auxiliary function $\varphi_{K}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))$ , and rewrite the approximation problem in Eq. (11) into the following compact form,

\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\psi_{K}(\mathbf{x},\mathbf{z})=\varphi_{K}(\mathbf{x},\mathbf{z})+g(\mathbf{z}).

(12)

We propose following two essential properties for general schematic iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ for the convergence analysis:

(1)

UL property with IA: For each $\mathbf{x}\in\mathcal{X}$ , $\mathbf{z}\in\mathcal{Z}$ ,

$\lim\limits_{K\rightarrow\infty}\varphi_{K}(\mathbf{x},\mathbf{z})\rightarrow\varphi(\mathbf{x}).$

(2)

LL property with IA: $\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}$ is uniformly bounded on $\mathcal{X}\times\mathcal{Z}$ , and for any $\epsilon>0$ , there exists $k(\epsilon)>0$ such that whenever $K>k(\epsilon)$ ,

\sup_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\left\{f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x},\mathbf{z})\right\}\leq\epsilon.

Now, we are ready to present the asymptotic convergence result of Alg. 2.

Theorem 4.1.

Suppose both the above UL and LL properties with IA hold. Let $(\mathbf{x}_{K},\mathbf{z}_{K})$ be a minimum of the approximation problem in Eq. (11), i.e.,

\psi_{K}(\mathbf{x}_{K},\mathbf{z}_{K})\leq\psi_{K}(\mathbf{x},\mathbf{z}),\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}.

Then, any limit point $\bar{x}$ of the sequence $\{x_{K}\}$ satisfies that $\bar{x}\in\mathrm{argmin}_{x\in X}\varphi(x)$ , i.e., $\bar{x}$ is a solution to BLO in Eq. (1).

Proof.

For any limit point $\bar{\mathbf{x}}$ of the sequence $\{\mathbf{x}_{K}\}$ , let $\{\mathbf{x}_{l}\}$ be a subsequence of $\{\mathbf{x}_{K}\}$ such that $\mathbf{x}_{l}\rightarrow\bar{\mathbf{x}}\in\mathcal{X}$ . As $\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}$ is uniformly bounded on $\mathcal{X}\times\mathcal{Z}$ and $\{\mathbf{z}_{K}\}\subset\mathcal{Z}$ is bounded, we can further find a subsequence $\{(\mathbf{x}_{m},\mathbf{z}_{m})\}$ of $\{(\mathbf{x}_{l},\mathbf{z}_{l})\}$ satisfying $\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m})\rightarrow\bar{\mathbf{y}}$ and $\mathbf{z}_{m}\rightarrow\bar{\mathbf{z}}$ for some $\bar{\mathbf{y}}$ and $\bar{\mathbf{z}}$ . It follows from the LL property with IA that for any $\epsilon>0$ , there exists $M(\epsilon)>0$ such that for any $m>M(\epsilon)$ , we have

f(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))-f^{\ast}(\mathbf{x}_{m})\leq\epsilon.

Thanks to the continuity of $f(\mathbf{x},\mathbf{y})$ , we have that $f^{\ast}(\mathbf{x}):=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ is Upper Semi-Continuous (USC for short) on $\mathcal{X}$ (see, e.g., [12, Lemma 2]). By letting $m\rightarrow\infty$ , and since $f$ is continuous and $f^{\ast}(\mathbf{x})$ is USC on $\mathcal{X}$ , we have $f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{\ast}(\bar{\mathbf{x}})\leq\epsilon$ . As $\epsilon$ is arbitrarily chosen, we have $f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{\ast}(\bar{\mathbf{x}})\leq 0$ and thus $\bar{\mathbf{y}}\in\mathcal{S}(\bar{\mathbf{x}})$ . Next, as $F$ is continuous at $(\bar{\mathbf{x}},\bar{\mathbf{y}})$ and g is continuous at $\bar{\mathbf{z}}$ , for any $\epsilon>0$ , there exists $M(\epsilon)>0$ such that for any $m>M(\epsilon)$ , it holds

F(\bar{\mathbf{x}},\bar{\mathbf{y}})+g(\bar{\mathbf{z}})\leq F(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))+g(\mathbf{z}_{m})+\epsilon.

Then, we have, for any $m>M(\epsilon)$ and $\mathbf{x}\in\mathcal{X}$ ,

$\displaystyle\ \varphi(\bar{\mathbf{x}})+g(\bar{\mathbf{z}})$	$\displaystyle=\inf_{\mathbf{y}\in\mathcal{S}(\bar{\mathbf{x}})}F(\bar{\mathbf{x}},\mathbf{y})+g(\bar{\mathbf{z}})$	(13)
	$\displaystyle\leq F(\bar{\mathbf{x}},\bar{\mathbf{y}})+g(\bar{\mathbf{z}})$
	$\displaystyle\leq F(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))+g(\mathbf{z}_{m})+\epsilon$
	$\displaystyle\leq\varphi_{m}(\mathbf{x},\mathbf{z})+g(\mathbf{z})+\epsilon.\quad\quad$

Taking $\mathbf{z}=\bar{\mathbf{z}}$ , $m\rightarrow\infty$ and by the UL property with IA, we have

\displaystyle\varphi(\bar{\mathbf{x}})\leq\lim_{m\rightarrow\infty}\varphi_{m}(\mathbf{x},\bar{\mathbf{z}})+\epsilon=\varphi(\mathbf{x})+\epsilon,\ \forall\mathbf{x}\in\mathcal{X}.

By taking $\epsilon\rightarrow 0$ , we have $\varphi(\bar{\mathbf{x}})\leq\varphi(\mathbf{x}),\ \forall\mathbf{x}\in\mathcal{X}$ which implies $\bar{\mathbf{x}}\in\arg\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x})$ . ∎

In the above convergence result, we can see that the convergence of Eq. (11) to the original bilevel problem in Eq. (1) does not require any restrictive assumptions on $g$ and $\mathcal{Z}$ . It means that we can add any prior information into $g$ and $\mathcal{Z}$ to help us speed up the convergence of the method without destroying the convergence of the algorithm.

Next, we show that both the UL and LL properties with IA hold when $\mathcal{T}_{k}(\mathbf{x},\cdot)$ is chosen as some specific schemes. Here we provide the asymptotic convergence analysis of Alg. 2, when $\mathcal{T}_{k}(\mathbf{x},\cdot)$ is defined as BDA[19] in Eq. (7).

Theorem 4.2.

Let $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ be the output generated by (7) with $s_{l}\in(0,1/L_{f})$ , $s_{u}\in(0,1/L_{F})$ , $\mu\in(0,1)$ , $\alpha_{k}=\frac{1}{k+1}$ , $\beta_{k}\in[\underline{\beta},1]$ with some $\underline{\beta}>0$ , $|\beta_{k}-\beta_{k-1}|\leq\frac{c_{\beta}}{(k+1)^{2}}$ with some $c_{\beta}>0$ , then we have that both the LL and UL properties with IA hold.

Proof.

We can easily obtain the UL property with IA from [19, Theorem 3]. Next, since Assumption 4.1 holds, $\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ are compact, by using the convergence rate result established in [19, Theorem 4] and the similar arguments in the proof of [19, Theorem 5], it can be shown that there exists $C>0$ such that for any $\mathbf{x}\in\mathcal{X}$ , $\mathbf{z}\in\mathcal{Z}$ , we have

f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{*}(\mathbf{x})\leq C\sqrt{\frac{1+\ln K}{K^{\frac{1}{4}}}}.

Because $\sqrt{\frac{1+\ln K}{K^{\frac{1}{4}}}}\rightarrow 0$ as $K\rightarrow\infty$ , $\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}\subset\mathcal{Y}$ , and $\mathcal{Y}$ is compact, LL property with IA holds. ∎

4.1.1 Lower-Level Singleton Case

In this part, we focus on the convex BLOs whose lower-level problem solution set is singleton for any $\mathbf{x}\in\mathcal{X}$ . We first consider constructing the iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ by projected gradient method as described in Eq. (8).

Theorem 4.3.

Suppose $\mathcal{S}(\mathbf{x})$ is singleton for all $\mathbf{x}\in\mathcal{X}$ . Let $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ be the output generated by (8) with $\alpha\in(0,1/L_{f})$ , then we have that both the LL and UL convergence properties with IA hold.

Proof.

According to [22, Theorem 10.21], when $f(\mathbf{x},\cdot)$ is convex and $L_{f}$ -smooth for any $\mathbf{x}\in\mathcal{X}$ , and $\alpha\in(0,1/L_{f})$ , $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ satisfies

	$\displaystyle f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x})$	$\displaystyle\leq\frac{\mathrm{dist}(\mathbf{y}_{0}(\mathbf{x},\mathbf{z}),\mathcal{S}(\mathbf{x}))}{2\alpha K}$		(14)
		$\displaystyle=\frac{\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))}{2\alpha K},$		(14)

where $\mathrm{dist}(\mathbf{z},S(\mathbf{x}))$ denotes the distance from $\mathbf{z}$ to the set $S(\mathbf{x})$ . Since $\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ are all compact sets, then there exists $M>0$ such that $\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))\leq M$ for $(\mathbf{x},\mathbf{z})\in\mathcal{X}\times\mathcal{Z}$ . Then we can easily obtained that LL property with IA holds from Eq. (14), $\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}\subset\mathcal{Y}$ and $\mathcal{Y}$ is compact. Next, given any $(\mathbf{x},\mathbf{z})\in\mathcal{X}\times\mathcal{Z}$ , for any limit point $\bar{\mathbf{y}}(\mathbf{x},\mathbf{z})$ of sequence $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ , we have $f(\mathbf{x},\bar{\mathbf{y}}(\mathbf{x},\mathbf{z}))=f^{*}(\mathbf{x})$ and thus $\bar{\mathbf{y}}(\mathbf{x},\mathbf{z})\in\mathcal{S}(\mathbf{x})$ from (14). Therefore, since $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ is bounded and $\mathcal{S}(\mathbf{x})$ is singleton, we have $\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\rightarrow\mathcal{S}(\mathbf{x})=\hat{\mathcal{S}}(\mathbf{x})$ and thus UL property with IA holds. ∎

Next, we consider the iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ as the Nesterov’s acceleration schemes described in Eq. (9).

Theorem 4.4.

Suppose $\mathcal{S}(\mathbf{x})$ is singleton for all $\mathbf{x}\in\mathcal{X}$ . Let $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ be the output generated by Eq. (9) with $\alpha\in(0,1/L_{f})$ , then we have that both the LL and UL convergence properties with IA hold.

Proof.

According to [22, Theorem 10.34], when $f(\mathbf{x},\cdot)$ is convex and $L_{f}$ -smooth for any $\mathbf{x}\in\mathcal{X}$ , and $\alpha\in(0,1/L_{f})$ , $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ admits the following property,

	$\displaystyle f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x})$	$\displaystyle\leq\frac{2\mathrm{dist}(\mathbf{y}_{0}(\mathbf{x},\mathbf{z}),\mathcal{S}(\mathbf{x}))}{\alpha(K+1)^{2}}$		(15)
		$\displaystyle=\frac{2\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))}{\alpha(K+1)^{2}}.$		(15)

Since $\mathcal{X}$ , $\mathcal{Y}$ and $\mathcal{Z}$ are compact sets, by using the above inequality and the similar arguments as in the proof of Theorem 4.3, we can show that both the LL and UL convergence properties with IA hold. ∎

4.2 Convergence Analysis for $\textrm{AIT}_{NC}$

In this part, we first conduct the asymptotic convergence analysis of $\textrm{AIT}_{NC}$ in Alg. 3 with general schematic iterative module $\mathcal{T}_{k}$ . Indeed, Alg. 3 can be regarded as the application of projected gradient method on the following single-level approximation problem of Eq. (1),

		$\displaystyle\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}}~{}~{}\phi_{\bar{K}}(\mathbf{x},\mathbf{z}):=\max_{k}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\}$		(16)
		$\displaystyle s.t.\begin{cases}\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},&\\ \mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})),~{}~{}k=0,\ldots,K-1.&\end{cases}$		(16)

Then we investigate the convergence of solutions of this approximation problem towards those of the original BLO under following mild assumptions.

Assumption 4.2.

We make following standing assumptions throughout this subsection:

(1)

$F,f,\nabla_{\mathbf{y}}f$ are continuous functions on $\mathcal{X}\times\mathbb{R}^{m}$ .
(2)

$f(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R}$ is $L_{f}$ -smooth for any $\mathbf{x}\in\mathcal{X}$ .
(3)

$\mathcal{X}$ and $\mathcal{Y}$ are convex compact sets.
(4)

$\mathcal{S}(\mathbf{x})$ is nonempty for any $\mathbf{x}\in\mathcal{X}$ .
(5)

For any $(\bar{\mathbf{x}},\bar{\mathbf{y}})$ minimizing $F(\mathbf{x},\mathbf{y})$ over constraints $\mathbf{x}\in\mathcal{X},\mathbf{y}\in\mathcal{Y}$ and $\mathbf{y}\in\tilde{\mathcal{S}}(\mathbf{x})$ , it holds that $\bar{\mathbf{y}}\in\mathcal{S}(\mathbf{x})$ .

Note that $\tilde{\mathcal{S}}(\mathbf{x})$ denotes the set of LL stationary points, i.e., $\tilde{\mathcal{S}}(\mathbf{x})=\{\mathbf{y}\in\mathcal{Y}|0=\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})+\mathcal{N}_{\mathcal{Y}}(\mathbf{y})\}.$ Assumption 4.2 is standard in bi-level optimziation related literature, which is satisfied for the numerical example given in Section 6.1.

We propose the weak LL property with IA for the asymptotic convergence analysis of Algorithm 3 with general schematic iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ . We say that weak LL property with IA is satisfies if $\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}$ is uniformly bounded on $\mathcal{X}\times\mathcal{Z}$ , and there exists $\alpha>0$ such that for any $\epsilon>0$ , there exists $k(\epsilon)>0$ such that whenever $K>k(\epsilon)$ ,

\sup_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\left\{\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\right\}\leq\epsilon,

where

\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}):=\mathbf{y}-\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\right).

It should be noticed that $\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y})=0$ if and only if $\mathbf{y}\in\tilde{\mathcal{S}}(\mathbf{x})$ . In the following theorem, we establish the asymptotic convergence of Alg. 3 when the weak LL property with IA is satisfied.

Theorem 4.5.

Suppose the above weak LL property with IA holds. Let $(\mathbf{x}_{K},\mathbf{z}_{K})$ be a minimum of approximation problem in Eq. (16), i.e.,

\phi_{K}(\mathbf{x}_{K},\mathbf{z}_{K})\leq\phi_{K}(\mathbf{x},\mathbf{z}),\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}.

Then, any limit point $\bar{x}$ of the sequence $\{x_{K}\}$ satisfies that $\bar{x}\in\mathrm{argmin}_{x\in X}\varphi(x)$ , i.e., $\bar{x}$ is a solution to BLO in Eq. (1).

Proof.

For any $K>0$ , we define

i(K):=\mathrm{argmin}_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|.

For any limit point $\bar{\mathbf{x}}$ of the sequence $\{\mathbf{x}_{K}\}$ , let $\{\mathbf{x}_{l}\}$ be a subsequence of $\{\mathbf{x}_{K}\}$ such that $\mathbf{x}_{l}\rightarrow\bar{\mathbf{x}}\in\mathcal{X}$ . As $\{\mathbf{y}_{i(K)}(\mathbf{x}_{K},\mathbf{z}_{K})\}\subset\mathcal{Y}$ and $\mathcal{Y}$ is compact, we can find a subsequence $\{\mathbf{x}_{j}\}$ of $\{\mathbf{x}_{l}\}$ satisfying $\mathbf{y}_{i(j)}(\mathbf{x}_{j},\mathbf{z}_{j})\rightarrow\bar{\mathbf{y}}$ for some $\bar{\mathbf{y}}\in\mathcal{Y}$ . It follows from the weak LL convergence property that for any $\epsilon>0$ , there exists $J(\epsilon)>0$ such that for any $j>J(\epsilon)$ , we have

\|\mathcal{R}_{{\alpha}}(\mathbf{x}_{j},\mathbf{y}_{i(j)}(\mathbf{x}_{j},\mathbf{z}_{j}))\|\leq\epsilon.

As $\mathcal{Y}$ is a convex compact set, it follows from [30, Theorem 6.42] that $\mathtt{Proj}_{\mathcal{Y}}$ is continuous. Combined with the assumed continuity of $\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ , we get that $\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y})$ is continuous. Then, by letting $j\rightarrow\infty$ , we have

\|\mathcal{R}_{{\alpha}}(\bar{\mathbf{x}},\bar{\mathbf{y}})\|\leq\epsilon.

As $\epsilon$ is arbitrarily chosen, we have $\|\mathcal{R}_{{\alpha}}(\bar{\mathbf{x}},\bar{\mathbf{y}})\|\leq 0$ and thus $\bar{\mathbf{y}}\in\hat{S}(\bar{\mathbf{x}})$ . Then by using the same arguments as in the proof of [31, Theorem 3.1], we can get the conclusion. ∎

Next, we will show that when $\mathcal{T}_{k}$ is chosen as the classical projected gradient method and the accelerated gradient scheme proposed by [29] in Eq. (10), the weak LL property with IA is satisfied. We first consider constructing the iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ by the classical projected gradient method as,

\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right),

(17)

where $\alpha$ denotes the step size parameter, and $k=0,\ldots,K-1$ .

Theorem 4.6.

Let $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ be the output generated by (17) with $\alpha^{k}_{\mathbf{y}}\in[\underline{\alpha}_{\mathbf{y}},\overline{\alpha}_{\mathbf{y}}]\subset(0,\frac{2}{L_{f}})$ , then we have that the weak LL property with IA holds.

Proof.

It follows from [31, Lemma 3.1] that there exists $C_{f}>0$ such that

\min_{0\leq k\leq K}\|\mathcal{R}_{\underline{\alpha}_{\mathbf{y}}}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\leq\frac{C_{f}}{\sqrt{K+1}},\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}.

Then the conclusion follows immediately. ∎

The iterative module $\mathcal{T}_{k}(\mathbf{x},\cdot)$ can also be the accelerated gradient scheme proposed in [29], which has been described in Eq. (10).

Theorem 4.7.

Let $\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}$ be the output generated by Eq. (10), then we have that the weak LL property with IA holds.

Proof.

Since when $\mathcal{Y}$ is a compact set, $\mathbf{v}_{0}(\mathbf{x},\mathbf{z})$ and $\mathcal{S}(x)$ are both uniformly bounded for any $\mathbf{x}\in\mathcal{X}$ . Then, it follows from [29, Corollary 2] that there exists $C>0$ such that

\min_{0\leq k\leq K}\|\mathcal{R}_{\frac{1}{2L_{f}}}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\leq\frac{C}{\sqrt{K}},\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}.

Then the conclusion follows immediately. ∎

5 Applications

As mentioned above, although many popular hierarchical learning and vision tasks have been well understood as the BLO problem in Eq. (1), almost no solution strategies with high efficiency and accuracy are available for some challenging tasks due to high-dimensional search space and complex theoretical properties. Moreover, many complex deep learning models endowed with nested constrained objectives are still lack of proper reformulation and reliable methodology with grounded theoretical guarantee.

Neural Architecture Search. We first consider one of the most representative large-scale high-dimensional BLO application, i.e., Neural Architecture Search (NAS). Typically, NAS aims to find high-performance neural network structures with automated process. Here, we focus on the gradient-based differentiable NAS methods, which have been widely investigated recently. This type of methods select certain search strategy to find the optimal network architecture based on the well defined architecture search space. DARTS [7], ones of the representative differentiable NAS methods, relaxes the discrete search space to be continuous, and jointly optimize the architecture parameter (i.e., the UL variable $\mathbf{x}$ ) and network weights (i.e., the LL variable $\mathbf{y}$ ) with gradient descent step, which implies a typical BLO problem. In details, the searched architecture consists of multiple computation cells, and each cell is a directed acyclic graph including $N$ nodes. The nodes in each cell, denoted as $d$ , represent the latent feature map, while the directed edges $o_{(i,j)},i,j\in[1,N]$ between nodes $d^{i}$ and $d^{j}$ indicate the candidate operations based on defined search space denoted as $\mathcal{O}$ . Then $\mathbf{x}$ is supposed to be defined as the whole continuous variables $\mathbf{x}=\{\mathbf{x}_{(i,j)}\}$ , where $\mathbf{x}_{(i,j)}$ denotes the vector of the operation mixing weights for nodes pair $(i,j)$ of dimension $\mathcal{O}$ . Hence $o_{(i,j)}$ is supposed to be the ideal operation chosen by $\arg\max$ operation, i.e., $o_{(i,j)}=\arg\max_{o\in\mathcal{O}}\mathbf{x}_{{(i,j)}}$ . For most learning and vision tasks which employ NAS to optimize the network structure, the LL subproblem usually has high-dimensional search space and complex task-specific constraints.

As one of the mostly widely used solution strategies, DARTS introduces finite difference approximation with a single-step gradient update of $\mathbf{y}$ to avoid the complex matrix-vector production. On top of that, a series of variations [32, 33, 34] based on DARTS have been developed and applied in numerous applications. In addition to this approximation operation adopted by DARTS, we could also solve this BLO application with other GBMs such as RHG and CG. Whereas, since the LL variables contain a cumbersome number of parameters from different searching cells, all the above methods suffer from poor theoretical investigation on this non-convex high-dimensional BLO problem. In comparison, by effective augmentation of the iterative trajectory, AIT updates the UL variables with more accurate hyper-gradient, which is more qualified for this BLO problem with non-convex follower.

Refer to caption — Figure 1: The first two subfigures illustrate the convergence behavior of $\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1)$ and $\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\|$ for the baseline (i.e., BDA) and variations with techniques from $\textrm{AIT}_{C}$ including IA (i.e., $+\textrm{IA}$ ) and prior regularization (i.e., $+\textrm{P}$ ). As for the last two subfigures, we compare the convergence behavior of $\textrm{AIT}_{C}$ and mainstream GBMs including CG, Neumann, RHG, BDA. We choose the same representative initialization point for all the methods as $(\mathbf{x}_{0},\mathbf{y}_{0})=(2\mathbf{e},(2\mathbf{e},2\mathbf{e}))$ .

Generative Adversarial Networks. We move one more step to apply our AIT to other challenging tasks with nested constrained learning objectives, i.e., Generative Adversarial Network (GAN), which has been one of the most popular deep learning applications and achieved impressive progress in various computer vision tasks. In general, GAN [35] based model is widely recognized as a minimax game between two adversarial players: the generator network $G(\mathbf{x};\cdot)$ (parameterized by $\mathbf{x}$ ) and the discriminator network $D(\mathbf{y};\cdot)$ (parameterized by $\mathbf{y}$ ). The generative model $G$ depicts the distribution $P_{G}$ , and is supposed to fool $D$ with generated data sampled from $P_{G}$ by optimizing $F$ to minimize the distance between $P_{G}$ and the real data distribution $P_{\mathtt{data}}$ . Thus the generic GAN can be defined as $\min_{\mathbf{x}}\max_{\mathbf{y}}V(\mathbf{x},\mathbf{y})=-\log(D(\mathbf{y};\mathbf{u}))-\log(1-D(\mathbf{y};G(\mathbf{x};\mathbf{v})))$ , where $V(\mathbf{x},\mathbf{y})$ denotes a joint loss function to optimize both objectives, $\mathbf{u}$ and $\mathbf{v}$ are sampled data from some real data distribution $P_{\text{data}}$ and generated fake distribution $P_{G}$ , respectively.

Typically speaking, $G(\mathbf{x};\cdot)$ first generates $\mathbf{v}$ , and then $D(\mathbf{y};\cdot)$ corresponds with predicted probability that $\mathbf{v}$ is sampled from $P_{data}$ . Based on this, most related methods, e.x., Vanilla GAN (VGAN) [36] and Wasserstein GAN (WGAN) [37], alternatively train $G$ and $D$ , and simply treat them on an equal footing, which ignores the intrinsic leader-follower relationship between $G(\mathbf{x};\cdot)$ and $D(\mathbf{y};\cdot)$ . In this work, we first refer to the reformulation in [1] to depict the original GAN as below:

	$\displaystyle F(\mathbf{x},\mathbf{y})$	$\displaystyle=\log(D(\mathbf{y};G(\mathbf{x};\mathbf{v}))),$
	$\displaystyle f(\mathbf{x},\mathbf{y})$	$\displaystyle=\log D(\mathbf{y};\mathbf{u})+\log(1-D(\mathbf{y};G(\mathbf{x};\mathbf{v}))).$

Thus the LL objective $f$ targets on enhancing the ability of $D(\mathbf{y};\cdot)$ to distinguish the fake data generate by $G(\mathbf{x};\cdot)$ , while the UL objective $F$ optimizes $\mathbf{x}$ to generate more “real” data distribution to confuse the discriminator. Under this BLO formulation, most existing methods with alternating training strategies essentially directly compute the hyper-gradient with the final state of trajectory instead of the sequence, which contains more high-order gradient information. In comparison, our methods admit multiple solutions and non-convexity of the LL subproblem, thus could solve the above problem with self contained theoretical guarantee.

6 Experiments

In this section, we report the quantitative and visualization results of the numerical examples, typical learning and vision applications and more challenging tasks. We run the experiments of toy examples based on a PC with Intel Core i7-8700 CPU, 32GB RAM and implement the real-world BLO applications on a Linux server with NVIDIA GeForce RTX 2080Ti GPU (12GB VRAM).

6.1 Numerical Evaluations

At first, we validate the provided theoretical results of our AIT based on numerical examples with different LL assumptions. More specifically, we focus on the UL variable (i.e., $\mathbf{x}$ ) and approximative UL subproblem ( $\Phi$ is used as unified denotation for $\varphi_{K}(\mathbf{x})$ in Eq. (4), $\varphi_{K}(\mathbf{x},\mathbf{z})$ in Eq. (12) and $\phi_{\bar{K}}(\mathbf{x},\mathbf{z})$ in Eq. (16) under convex and non-convex scenarios, respectively), and analyze the convergence behavior of $\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1)$ ( $\left\|\Phi^{*}\right\|$ is equal to $0$ in some cases) and $\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\|$ during the UL optimization. We use the numerical example under LLC assumption to show the effectiveness of $\textrm{AIT}_{C}$ with IA and the prior regularization term, and compare the convergence behavior of $\textrm{AIT}_{C}$ with a series of mainstream GBMs. Next, we use another example with the well defined LLS assumption to implement our $\textrm{AIT}_{C}$ with IA and the Nesterov’s acceleration dynamics.

Moving forward to the toy example with LL non-convexity, we first verify the effectiveness of $\textrm{AIT}_{NC}$ with IA, PTT and the acceleration gradient scheme, and compare $\textrm{AIT}_{NC}$ with the above GBMs to show its great improvement. Furthermore, we depict the loss surfaces and the derived solutions to show the significant improvement of $\textrm{AIT}_{NC}$ under the non-covnex scenario, and analyze the running time influenced by different factors, including initialization points, dimension of LL variables (i.e., $n$ ) and LL iterations (i.e, $K$ ).

6.1.1 Convex Scenario

We first consider the LLC scenario and conduct the numerical experiments with the following example from [12]. With $\mathbf{x}\in\mathbb{R}^{n},[\mathbf{y}]_{1}\in\mathbb{R}^{n}$ and $[\mathbf{y}]_{2}\in\mathbb{R}^{n}$ :

		$\displaystyle\min_{\mathbf{x}\in\mathcal{X}}\\|\mathbf{x}-[\mathbf{y}]_{2}\\|^{4}+\\|[\mathbf{y}]_{1}-\mathbf{e}\\|^{4},$		(18)
		$\displaystyle\text{ s.t. }([\mathbf{y}]_{1},[\mathbf{y}]_{2})\in\underset{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}{\operatorname{argmin}}\frac{1}{2}\\|[\mathbf{y}]_{1}\\|^{2}-\mathbf{x}^{\top}[\mathbf{y}]_{1},$		(18)

where $\mathcal{X}=[-10,10]\times\cdots[-10,10]\subset\mathbb{R}^{n}$ , and $\mathbf{e}$ represents the vector of which the elements are all equal to 1. Note that we use bold text to represent scalars for all the following description. We set $n=20$ , $T=1000$ and choose $(\mathbf{x}_{0},\mathbf{y}_{0})=(2\mathbf{e},(2\mathbf{e},2\mathbf{e}))$ as the initialization point. By simple calculation, we can obtain the unique optimal solution as $\mathbf{x}^{*}=\mathbf{e}$ , $\mathbf{y}^{*}=(\mathbf{e},\mathbf{e})$ . This numerical example satisfies the LLC assumption for BDA while violating the LLS assumption admitted by most GBMs such as RHG, CG [20] and Neumann [21].

To investigate the effectiveness of Alg. 2, we take BDA as the baseline, and implement our $\textrm{AIT}_{C}$ by incrementally adding IA and the prior regularization. In the first two subfigures of Fig. 1, we analyze the convergence behavior of $\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1)$ and $\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\|$ . As it has been proved in [1], the convergence property of BDA can be consistently verified and obtain expected convergence results for this BLO problem. As for the augmentation technique, the embedded IA of $\textrm{AIT}_{C}$ helps dynamically adjust the initialization of $[\mathbf{y}]_{1}$ and $[\mathbf{y}]_{2}$ , and the augmented optimization trajectory further improves the convergence speed of UL variable (i.e., $\mathbf{x}$ ) and approximative UL subproblem (i.e., $\Phi$ ). Besides, with prior regularization to guide the optimization process of $\mathbf{z}$ , our $\textrm{AIT}_{C}$ can obtain higher convergence precision and speed.

In the last two subfigures of Fig. 1, we compare our $\textrm{AIT}_{C}$ with series of mainstream GBMs including CG, Neumann, RHG, and BDA. Since the LLS assumptions of CG, Neumann and RHG can not be satisfied, no convergence property for these methods could be guaranteed in this case. In comparison with BDA, $\textrm{AIT}_{C}$ also gains advantage over convergence speed and precision.

Furthermore, to demonstrate the performance of other iterative formats and acceleration dynamics, we further introduce another toy example with more strict LL property, i.e, the well-defined LLS assumption:

		$\displaystyle\min_{\mathbf{x}\in\mathcal{X}}-\mathbf{e}^{T}\mathbf{x}+\\|\mathbf{x}-[\mathbf{y}]_{2}\\|^{4}+\\|[\mathbf{y}]_{1}-\mathbf{e}\\|^{4},$		(19)
		$\displaystyle\text{ s.t. }([\mathbf{y}]_{1},[\mathbf{y}]_{2})\in\underset{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}{\operatorname{argmin}}\\|[\mathbf{y}]_{1}+[\mathbf{y}]_{2}-\mathbf{x}\\|^{4},$		(19)

where $\mathbf{x}\in\mathbb{R}^{n},[\mathbf{y}]_{1}\in\mathbb{R}^{n}$ , $[\mathbf{y}]_{2}\in\mathbb{R}^{n}$ , $\mathcal{X}=[-1,1]\times\cdots\times[-1,1]\subset\mathbb{R}^{n}$ . Given any $\mathbf{x}\in\mathcal{X}$ , it satisfies $\min_{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}\|[\mathbf{y}]_{1}+[\mathbf{y}]_{2}-\mathbf{x}\|^{4}=0$ , then the unique optimal solution of this BLO problem is $\mathbf{x}^{*}=\mathbf{e}$ , $\mathbf{y}^{*}=(\frac{1}{2}\mathbf{e},\frac{1}{2}\mathbf{e})$ . Since the solution set of this LL subproblem (i.e., $\mathcal{S}(\mathbf{x})$ ) is a singleton, we can verify that the above example can be properly solved with GBMs under LLS assumption.

In this case, we implement our $\textrm{AIT}_{C}$ with RHG as the baseline, and further incorporate the Nesterov’s acceleration dynamics to facilitate the convergence behavior. When an appropriate initialization position is chosen, commonly used GBMs are also supposed to converge with acceptable speed. To demonstrate the effectiveness of IA, we set the initialization point as $(\mathbf{x}_{0},\mathbf{y}_{0})=(-\mathbf{e},(-\mathbf{e},-2\mathbf{e}))$ , where $\mathbf{z}$ is further away from the optimal solution $\mathbf{z}^{*}$ . In Fig 2, it can be seen that the IA technique could effectively accelerate the convergence of UL variables towards the global optimal solution even when RHG converges slower due to improper initialization of LL variable. In addition, the Nesterov’s strategy further helps the LL variable (i.e., $\mathbf{y}$ ) converge faster with embedded acceleration dynamics.

6.1.2 Non-Convex Scenario

To validate the superiority of AIT, we move on to more challenging scenario and introduce another toy example [16] with non-convex LL subproblem, which takes the form of:

		$\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{x}-a\\|^{2}+\\|\mathbf{y}-a-\mathbf{c}\\|^{2},$		(20)
		$\displaystyle\text{ s.t. }[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\operatorname{argmin}}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right),\forall i,$		(20)

where both $a\in\mathbb{R}$ and $\mathbf{c}\in\mathbb{R}^{n}$ are constants, and $[\cdot]_{i}$ denotes the i-th element of the vector. Then the optimal solution of this problem can be calculated in the following form:

\mathbf{x}=\frac{(1-n)a+nC}{1+n},\text{ and }[\mathbf{y}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x},\forall i,

where

C=\underset{C}{\operatorname{argmin}}\left\{\|C-2a\|:C=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

and the corresponding value of UL objective is $F^{*}=\frac{n(C-2a)^{2}}{1+n}$ . Here we set $a=2$ , $[\mathbf{c}]_{i}=2,i=1,...,n$ and $[\mathbf{y}]_{i}\in[-10,10]$ . It can be easily verified that this example satisfies the theoretical assumption of $\textrm{AIT}_{NC}$ while violating the LLS assumption for RHG, CG, Neumann and the LLC assumption for BDA due to existence of multiple local optimal solutions of the LL subproblem. At first, we set $n=1$ to investigate the convergence property based on the simplified form and draw the loss surfaces. Then we use larger $n$ when analyzing the influence of dimension of LL variables on the running time. We can calculate the corresponding optimal solution as $(\mathbf{x}^{*},\mathbf{y}^{*})=(3\pi/4,3\pi/4+2)$ . In consideration of distance between the initialization position and the global optimal solution, we choose two initialization points for $\mathbf{x}$ and $\mathbf{y}$ including $(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0)$ and $(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8)$ .

In Fig. 3, we take RHG as the baseline, and start by incrementally embedding two augmentation techniques and the acceleration gradient scheme of $\textrm{AIT}_{NC}$ into the baseline. It can be seen that adding PTT technique shows consistently better convergence results with respective of $\mathbf{x}$ , while the convergence of $\phi_{\bar{K}}(\mathbf{x},\mathbf{z})$ may be influenced by different initialization points without dynamically optimized $\mathbf{z}$ . When we implement our $\textrm{AIT}_{NC}$ with both techniques, since the theoretical convergence property is guaranteed, our algorithm can uniformly improve the convergence behavior of $\phi_{\bar{K}}(\mathbf{x},\mathbf{z})$ and $\mathbf{x}$ on this non-convex problem. Furthermore, by embedding the acceleration gradient scheme in Eq. (10) under our $\textrm{AIT}_{NC}$ , the convergence rate of gradient descent for this non-convex BLO problem could be further improved.

Furthermore, In Fig 4, we compare $\textrm{AIT}_{NC}$ with mainstream GBMs, including CG, Neumann, RHG, BDA and BVFIM with the same initialization points. As it is expected, all the above methods except BVFIM can be easily stuck at the local optimal solution. Besides, BVFIM may not converge exactly to the true optimal solution influenced by its sensitivity to the added penalty terms. In comparison, under our $\textrm{AIT}_{NC}$ algorithm, by dynamically adjusting the initial value of LL variables and truncation of $\mathcal{T}$ for computation of hyper-gradients, the convergence results without LLC assumption can be consistently verified.

More vividly, in Fig. 5, we mark the initialization point (i.e., $(\mathbf{x}_{0},\mathbf{y}_{0})$ ), optimal solution (i.e., $(\mathbf{x}^{*},\mathbf{y}^{*})$ ), iterative solution (i.e., $(\mathbf{x}^{t},\mathbf{y}^{t})$ ) and derived convergence solutions (i.e., $(\mathbf{x}^{T},\mathbf{y}^{T})$ ) for RHG and $\textrm{AIT}_{NC}$ on the UL and LL surfaces. As confirmed earlier by the convergence analysis, since quantities of local optima exist along the path from $(\mathbf{x}_{0},\mathbf{y}_{0})$ to $(\mathbf{x}^{*},\mathbf{y}^{*})$ , the iterative solutions of RHG are easily stuck at the local optima on the loss surface, which leads to convergence solution distant from $(\mathbf{x}^{*},\mathbf{y}^{*})$ . In contrast, the introduced $\mathbf{z}$ together with PTT under our $\textrm{AIT}_{NC}$ , (i.e., $\mathbf{z}^{t}$ ), goes over the local optima and makes its own way towards the true optimal solution, which further helps find reasonable $\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t})$ for computation of hyper-gradient. Finally, $(\mathbf{x}^{t},\mathbf{y}^{t})$ overcomes the trap of local optima and converges to the global optimal solution, thus demonstrates the superiority of AIT.

In Fig. 6, we report the average running time of hyper-gradient computation with respective to different initialization points in this numerical case. It can be observed that as the PTT operation of $\textrm{AIT}_{NC}$ dynamically truncates the optimization trajectory and selects $\bar{K}\in[1,K]$ after $K$ LL iterations of LL optimization, the computational burden of hyper-gradient w.r.t. $\mathbf{x}$ and $\mathbf{z}$ is continuously eased. In addition, we found that as the curve converges with favorable convergence speed in Fig. 4, $\bar{K}$ tends to stabilize around a fixed number independent of the chosen initialization points.

To evaluate the performance for high-dimensional BLO problems, we further investigate how dimension of LL subproblems (i.e., $n$ ) and LL iteration (i.e., $K$ ) influence the computation efficiency. Known that the computation burden has been evaluated to be less dependent on the dimension of UL subproblem when calculating the hyper-gradient with AD [15] (adopted by classical GBMs, e.g., RHG), so in Fig. 7, we continuously increase the dimension of $\mathbf{y}$ and record the average time for gradient computation of RHG and $\textrm{AIT}_{NC}$ . We can observe that $\bar{K}\in[1,K]$ always decreases the time cost for the UL optimization process compared with RHG, even extra computation for $\bar{K}$ and hyper-gradient of $\mathbf{z}$ is introduced in the $\textrm{AIT}_{NC}$ . Besides, we also evaluate the runtime of RHG and $\textrm{AIT}_{NC}$ with increasing $K$ . Compared with $n$ , the increasing $K$ has more influence on the computation for iterative optimization of $\mathbf{y}_{k}$ and hyper-gradient of $\mathbf{x}$ . Accordingly, the PTT technique chooses smaller $\bar{K}$ for backpropagation ( Step 12-14 in Alg. 3), thus reduces the computation burden of backpropagation (i.e., $\textrm{PT}_{\mathbf{x}}$ and $\textrm{PT}_{\mathbf{z}}$ ).

6.2 Typical Learning and Vision Applications

In the following, we further demonstrate the performance of AIT with typical BLO applications influenced by the nonconvex LL objectives and nonconvex LL network structure, including few-shot learning and data hyper-cleaning.

6.2.1 Data Hyper-Cleaning

The goal of data hyper-cleaning is to cleanup the dataset in which the labels have been corrupted. This BLO problem defines the UL variables $\mathbf{x}$ as a vector, and the dimension of $\mathbf{x}$ equals to the number of corrupted samples, while the LL variable $\mathbf{y}$ corresponds to parameters of the classification model. Following the common problem setting of existing works [2], we first split the dataset into three disjoint subsets: $\mathcal{D}_{\mathtt{tr}}$ , $\mathcal{D}_{\mathtt{val}}$ and $\mathcal{D}_{\mathtt{test}}$ , and then randomly pollute the labels in $\mathcal{D}_{\mathtt{tr}}$ with fixed ratio. Then the UL subproblem can be well defined as $F(\mathbf{x},\mathbf{y})=\sum_{\left(\mathbf{u}_{i},\mathbf{v}_{i}\right)\in\mathcal{D}_{\mathtt{val}}}\ell\left(\mathbf{y}(\mathbf{x});\mathbf{u}_{i},\mathbf{v}_{i}\right),$ where $\left(\mathbf{u}_{i},\mathbf{v}_{i}\right)$ represents the data pairs, $\ell\left(\mathbf{y}(\mathbf{x});\mathbf{u}_{i},\mathbf{v}_{i}\right)$ denotes the cross-entropy loss function with classifier parameter $\mathbf{y}$ and data pairs from different subsets. By introducing $\sigma(\mathbf{x})$ as the element-wise sigmoid function to constrain the element in the range of $\left[0,1\right]$ , the LL subproblem can be well defined as: $f(\mathbf{x},\mathbf{y})=\sum_{\left(\mathbf{u}_{i},\mathbf{v}_{i}\right)\in\mathcal{D}_{\mathtt{tr}}}[\sigma(\mathbf{x})]_{i}\ell\left(\mathbf{y};\mathbf{u}_{i},\mathbf{v}_{i}\right)$ .

TABLE II: With convex network structure, we report results of existing GBMs including CG, Neumann, RHG and BDA for solving data hyper-cleaning tasks. Acc and F1 score denote the test accuracy and the harmonic mean of the precision and recall, respectively.

\textrm{AIT}_{P}

denotes our AIT with embedded prior regularization.

Method	MNIST		CIFAR10
Method	Acc.	F1 score	Acc.	F1 score
CG	$89.12$	$87.51$	$36.80$	$68.56$
Neumann	$88.90$	$89.90$	$35.15$	$67.79$
RHG	$88.96$	$90.49$	$34.80$	$69.75$
BDA	$88.55$	$90.99$	$35.56$	$69.58$
AIT	$89.58$	$91.35$	$36.95$	$\mathbf{72.19}$
$\textrm{AIT}_{P}$	$\mathbf{89.63}$	$\mathbf{91.98}$	$\mathbf{38.08}$	$70.75$

TABLE III: Reporting results of existing methods for solving data hyper-cleaning tasks with non-convex LL network structure.

Method	MNIST		FashionMNIST		CIFAR10
Method	Acc.	F1 score	Acc.	F1 score	Acc.	F1 score
CG	$89.19$	$85.96$	$83.15$	$85.13$	$34.16$	$69.10$
Neumann	$87.54$	$89.58$	$81.37$	$87.28$	$33.45$	$68.87$
RHG	$87.90$	$89.36$	$81.91$	$87.12$	$34.95$	$68.27$
BDA	$87.15$	$90.38$	$79.97$	$88.24$	$36.41$	$67.33$
AIT	$\mathbf{90.88}$	$91.57$	$83.67$	$90.37$	$37.16$	$71.57$
$\textrm{AIT}_{P}$	$90.41$	$\mathbf{91.95}$	$\mathbf{83.80}$	$\mathbf{90.40}$	$\mathbf{37.70}$	$\mathbf{72.74}$

TABLE IV: Mean test accuracy of 5-way classification on MiniImageNet and TieredImageNetwith non-convex LL objective. We use

\pm

to report the accuracy with

95\%

confidence intervals over tasks.

Network	Methods	MiniImagenet		TieredImagenet
Network	Methods	5-way 1-shot	5-way 5-shot	5-way 1-shot	5-way 5-shot
ConvNet-4	MAML	$48.70\pm 0.75\%$	$63.11\pm 0.11\%$	$49.06\pm 0.50\%$	$67.48\pm 0.47\%$
	RHG	$48.89\pm 0.81\%$	$63.02\pm 0.70\%$	$49.63\pm 0.67\%$	$66.14\pm 0.57\%$
	T-RHG	$47.67\pm 0.82\%$	$63.70\pm 0.76\%$	$50.79\pm 0.69\%$	$67.39\pm 0.60\%$
	BDA	$49.08\pm 0.82\%$	$62.17\pm 0.70\%$	$51.56\pm 0.68\%$	$\mathbf{68.21\pm 0.58}\%$
	AIT	$\mathbf{49.80\pm 0.61\%}$	$\mathbf{64.76\pm 0.54\%}$	$\mathbf{51.86\pm 0.68\%}$	$68.01\pm 0.60\%$
ResNet-12	MAML	$51.03\pm 0.50\%$	${68.26\pm 0.47}\%$	${58.58\pm 0.49}\%$	$71.24\pm 0.43\%$
	RHG	${50.54\pm 0.85}\%$	$64.53\pm 0.68\%$	$58.19\pm 0.76\%$	${75.20\pm 0.60}\%$
	AIT	$\mathbf{56.69\pm 0.66}\%$	$\mathbf{70.21\pm 0.55}\%$	$\mathbf{60.71\pm 0.77}\%$	$\mathbf{75.85\pm 0.59}\%$

Typically, to satisfy the LLC assumption admitted by existing GBMs, the LL classification model is usually defined as a single fully-connected layer, and multiple-layer classifier with more parameters are not accessible. Based on the AIT, the constraints could be continuously relaxed thus we have more choices when designing the LL model. Besides, the prior regularization also serves as effective tool to help improve the performance of AIT. To find prior determined value $\mathbf{z}_{p}$ for the IA, (i.e., $\mathbf{z}$ ), we pretrain the LL classification model on $\mathcal{D}_{\textrm{tr}}$ and restore the pretrained model weights as the proximal prior to adjust the updates and facilitate the convergence behavior of $\mathbf{z}$ . Then we could specify the form of prior regularization in Eq. (12). Practically, we implement the prior regularization $g(\mathbf{z})$ as $\ell_{2}$ regularization, and add this term only for the first iteration to investigate how this regularization term influence the latter optimization process. Three well known datasets including MNIST, FashionMNIST and CIFAR10 are used to conduct the experiments. Specifically, 5000, 5000, 10000 samples are randomly selected to construct $\mathcal{D}_{\mathtt{tr}}$ , $\mathcal{D}_{\mathtt{val}}$ and $\mathcal{D}_{\mathtt{test}}$ , then half of the labels in $\mathcal{D}_{\mathtt{tr}}$ are tampered.

We first implement this application with convex LL network structure including a single-layer fully-connected classifier. Then the shape of $\mathbf{y}$ is $\mathbf{y}\in\mathbb{R}^{10\times d}$ , where $d$ denotes the dimension of the flattened input data. In Tab. II, we compare our AIT with CG, Neumann, RHG and BDA. As it has been verified in the numerical experiments before, we can observe that AIT helps improve the performance on both Accuracy and F1 score based on MNIST and CIFAR10 dataset. Besides, though the prior regularization term only exists in the first iteration, it helps correct the initial state of constructed optimization dynamics, and further improves the performance of AIT on both datasets.

In addition, we introduce the non-convex network structure with two layers of fully-connected network to expand the search space of LL subproblem, where $\mathbf{y}\in\mathbb{R}^{10}\times\mathbb{R}^{301\times d}$ . In Tab. III, we compare our AIT and the variation with prior regularization, (i.e., $\textrm{AIT}_{P}$ ) with RHG, BDA, CG and Neumann on validation accuracy, F1 score and running time. As it is shown, our method performs the best compared with other GBMs on both accuracy and F1 score. Besides, although only used as a practical technique for solving non-convex BLOs, the embedded proximal prior also consistently improve the performance of $\textrm{AIT}_{P}$ on three datasets.

6.2.2 Few-Shot Learning

Few-shot learning applications, more specifically, the few-shot classification tasks [38], train the model to classify unseen instances with only a few samples based on prior training data of similar tasks. As for the N-way K-shot classification, we first define the training dataset as $\mathcal{D}=\{\mathcal{D}^{j}\}$ , where $\mathcal{D}^{j}=\mathcal{D}_{\mathtt{tr}}^{j}\bigcup\mathcal{D}_{\mathtt{val}}^{j}$ corresponds to the meta-train and meta-validation set for the $j\textrm{-th}$ task. In the training phase, we select $K$ samples from each of the $N$ classes in $\mathcal{D}_{\mathtt{tr}}$ for meta training, and use more data from these classes in $\mathcal{D}_{\mathtt{val}}$ for meta validation. Then in the test phase, we construct new tasks on test dataset under the same data distribution to test the performance.

We follow the commonly used parameter setting for GBMs, where the UL variables $\mathbf{x}$ correspond to the shared hyper representation module model, while the LL variables $\mathbf{y}$ correspond to the task-specific classifier. For this classification problem, the cross-entropy function (denoted as $\ell$ ) is commonly used for the LL objective $f$ and UL objective $F$ . Besides, $\ell_{q}$ regularization with $0<q<1$ can be applied to stabilize the training and effectively avoid over-fitting, while existing methods only guarantee the convergence for convex scenario where $q\geq 1$ , our AIT supports more flexible non-convex LL objective without the LLC or LLS constraints. Then we add $\|{\mathbf{y}^{j}}\|_{q}$ as the non-convex regularization term to define the LL objectives as $F\left(\mathbf{x},\left\{\mathbf{y}^{j}\right\}\right)=\sum_{j}\ell\left(\mathbf{x},\mathbf{y}^{j};\mathcal{\mathcal{D}}_{\mathtt{val}}^{j}\right),f\left(\mathbf{x},\left\{\mathbf{y}^{j}\right\}\right)=\sum_{j}\ell\left(\mathbf{x},\mathbf{y}^{j};\mathcal{\mathcal{D}}_{\mathtt{tr}}^{j}\right)+\|{\mathbf{y}^{j}}\|_{q}.$ The above implementation of non-convex LL objective implies another class of complex LL subproblem for BLOs, which could be used to verify the significant improvement of AIT.

We implement $5$ -way $1$ -shot and $5$ -way $5$ -shot classification tasks to validate the performance based on MiniImagenet [39] and TieredImagenet [40] datasets. Both datasets are subsets constructed from the larger ILSVRC-12 dataset, while TieredImagenet includes more classes than MiniImagenet. Besides, TieredImagenet categorizes the source data with hierarchical structure to split the training and test datasets and also simulates more real scenarios.

In Tab. IV, we compare our AIT with representative methods including MAML [4], RHG, T-RHG [2] and BDA. We also consider two structures as the backbones of hyper representation module, including ConvNet-4 with $4$ layers of convolutional blocks and the larger ResNet-12 with Residual blocks. We use a single fully-connected layer as the task-specific classifier for both backbones. For MAML, the hyper representation module and the classifier are treated as a whole and update as the initialization parameters. We first use ConvNet-4 as the backbone and compare AIT with RHG, T-RHG, BDA and MAML, then we also implement the ResNet-12 backbone and compare AIT with initialization based (i.e., MAML) and recurrence based methods (i.e., RHG). As it is shown in IV, for the 5-way 1-shot and 5-way 5-shot classification tasks on MiniImagenet and TieredImagenet, AIT shows significant performance improvement on the accuracy with different backbones as the hyper representation module.

6.3 Extensions for More Challenging Tasks

In this subsection, we first implement AIT to solve large-scale and high-dimensional real-world BLO applications, e.g., neural architecture search. Then we move on to other complex deep learning models, e.g., generative adversarial networks, to demonstrate its generalization performance.

6.3.1 Neural Architecture Search

We denote the training set, validation set, and test set as $(\mathcal{D}_{\mathtt{tr}},\mathcal{D}_{\mathtt{val}},\mathcal{D}_{\mathtt{test}})$ , then the UL objective and LL objective can be written as $F\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{val}}\right)$ and $f\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{tr}}\right)$ . Then we follow the standard training strategies in DARTS, which consists of the searching and the inference stages. As for the searching stage, DARTS executes a single-step LL optimization (Step 4 in Alg.1) with $f\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{tr}}\right)$ , and continuously optimizes the architecture parameter $\mathbf{x}$ (Step 9 in Alg.1) to find better cell structures according to the UL objectives $F\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{val}}\right)$ . When it comes to the inference stage, the architecture parameter $\mathbf{x}$ is fixed, and we reuse the searched cell structure to construct larger architecture and train it from scratch based on $\mathcal{D}_{\mathtt{tr}}$ . Finally, we test the performance of trained model at the inference stage on $\mathcal{D}_{\mathtt{test}}$ . In this paper, we conduct the experiments on CIFAR-10 dataset, and use the same search space and similar experimental settings of DARTS. The stacked cell for final architecture has two types, including reduction cells and normal cells. In the reduction cell, operations adjacent to the input nodes use stride 2 to reduce the input shape, while normal cells maintain the original shape with stride 1 for the first two nodes. We use $3$ layers of cells for searching with $50$ epochs and construct larger structure with $8$ layers of the searched cells, and train the model from scratch with 600 epochs in total.

In Fig. 8, we report the validation loss and accuracy after each epoch to evaluate the convergence behavior of DARTS, RHG, CG and our AIT. It can be seen that RHG, CG and DARTS gain similar validation performance on this non-convex BLO application, while AIT significantly improves the convergence results on $\mathcal{D}_{\textrm{val}}$ . Noticed that several works [41, 42] found that DARTS may find the global minimum with very small search space but it starts overfitting the architecture parameters on $\mathcal{D}_{\textrm{val}}$ , which leads to poor generalization performance. To demonstrate that the searched architecture has consistent performance, in Tab. V, we also report the accuracy on $\mathcal{D}_{\mathtt{tr}}$ and $\mathcal{D}_{\mathtt{val}}$ at different stages. It can be observed that only AIT maintains better performance on both $\mathcal{D}_{\textrm{tr}}$ and $\mathcal{D}_{\textrm{val}}$ , and it also improves the test performance of finally constructed network structure by a large margin.

TABLE V: Reporting top 1 accuracy of searching stage, inference stage, final test stage for DARTS, RHG, CG and AIT. We also report the number of parameters of the searched structure (MB) for different methods.

Method	Searching		Inference		Test	Params
Method	Train	Valid	Train	Valid	Test	Params
DARTS	$98.320$	$88.940$	$97.952$	$96.710$	$96.670$	$1.277$
RHG	$98.448$	$89.556$	$98.494$	$97.010$	$96.920$	$1.359$
CG	$\mathbf{99.126}$	$89.298$	$98.286$	$96.690$	$96.640$	$1.268$
AIT	$98.904$	$\mathbf{99.512}$	$\mathbf{98.718}$	$\mathbf{97.470}$	$\mathbf{97.330}$	$\mathbf{1.963}$

Besides, it have been investigated [42] that DARTS tends to choose more parameter-less operations such as skip connection instead of convolution since the skip connection often leads to rapid gradient descent especially on the proxy datasets which are small and easy to fit. As a result, the generated structure may contains less learnable parameters to some extent. In Fig. 9, we illustrate the searched structure for normal cells and reduction cells to show the difference between these GBMs and AIT. As it is depicted, since DARTS derives approximative hyper-gradient w.r.t. $\mathbf{x}$ compared to RHG, both of them found similar structure while RHG uses complex edges with more parameters as reported in Tab. V. Besides, though CG found significantly different structures from DARTS and RHG, it has no guarantee of convergence and generalization performance on the test set. In comparison, with convergence guarantee, AIT constructs normal cells and reduction cells with more complex separable convolutions, thus significantly improve the expression ability and generalization performance of the searched architecture.

6.3.2 Generative Adversarial Networks

In practice, we conduct the experiments with synthesized 2 dimensional mixture of Gaussian distributions. Specifically, we generate 10 Gaussians to form the shape of pentagram with maximum radius $r=8$ . In Fig. 10, we compare the visualization results generated by the VGAN, Unrolled GAN [43] (UGAN), WGAN and AIT. It can be seen that AIT generates more modes than the above methods to fit the target distribution. In comparison, the original GAN methods only capture a small number of distribution, which also shows the effectiveness of AIT to solve GAN based applications. As for the quantitative results, in Tab. VI, we report three commonly used metrics including Frechet Inception Distance (FID) [36], Jensen-Shannon divergence (JS) [44], number of Modes (Mode) . In comparison, our AIT not only always generate all the modes with different random seeds, but also gains significant performance improvement on both JS and FID.

TABLE VI: Reporting results of existing methods for the 2D Pentagram Gaussian distribution bu running the experiment with

3

random seeds. We use different seeds to run the experiments and compare our AIT with VGAN, UGAN and WGAN on three metrics.

Method	2D Pentagram (Max Mode=10)
Method	FID	JS	Mode
VGAN	1.214 $\pm$ 0.37	0.191 $\pm$ 0.094	8.00 $\pm$ 0.00
WGAN	2.404 $\pm$ 3.80	0.746 $\pm$ 0.064	3.33 $\pm$ 2.52
UGAN	2.355 $\pm$ 2.93	0.762 $\pm$ 0.083	3.33 $\pm$ 2.89
AIT	0.256 $\pm$ 0.13	0.187 $\pm$ 0.046	10.00 $\pm$ 0.00

7 Conclusion

To summarize, our approach involves a thorough analysis of the gradient-based BLO scheme, identifying two significant deficiencies of the existing iterative trajectory. To address these deficiencies, we propose two augmentation techniques, namely Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and develop a series of variations based on these techniques, such as prior regularization and different forms of acceleration dynamics, to construct our Augmented Iterative Trajectory (AIT) for both convex and non-convex scenarios (i.e., $\textrm{AIT}_{C}$ and $\textrm{AIT}_{NC}$ ). Theoretically, we establish detailed convergence analysis of AIT for different types of BLOs, which fills the gap in non-convex theory. Finally, we conduct a series of experiments on various BLO settings using numerical examples, and demonstrate the effectiveness of our approach on typical learning and vision tasks as well as more challenging deep learning models such as NAS and GANs.

Acknowledgments

This work is partially supported by the National Key R&D Program of China (2022YFA1004101), the National Natural Science Foundation of China (Nos. U22B2052, 61922019, 12222106), the Shenzhen Science and Technology Program (No. RCYX20200714114700072), the Pacific Institute for the Mathematical Sciences (PIMS), and the Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020082).

References

[1] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin, “Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[2] A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots, “Truncated back-propagation for bilevel optimization,” in AISTATS, 2019.
[3] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv:1803.02999, 2018.
[4] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
[5] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” in IEEE CVPR, 2019, pp. 10 734–10 742.
[6] Y. Hu, X. Wu, and R. He, “Tf-nas: Rethinking three search freedoms of latency-constrained differentiable neural architecture search,” in ECCV, 2020.
[7] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
[8] H. Zhang, W. Chen, Z. Huang, M. Li, Y. Yang, W. Zhang, and J. Wang, “Bi-level actor-critic for multi-agent coordination,” in AAAI, vol. 34, no. 05, 2020, pp. 7325–7332.
[9] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “On the global optimality of model-agnostic meta-learning: Reinforcement learning and supervised learning,” in ICML, 2020.
[10] G. M. Moore, Bilevel programming algorithms for machine learning model selection. Rensselaer Polytechnic Institute, 2010.
[11] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019.
[12] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton,” in International Conference on Machine Learning, 2020, pp. 6305–6315.
[13] L. L. Gao, J. Ye, H. Yin, S. Zeng, and J. Zhang, “Value function based difference-of-convex algorithm for bilevel hyperparameter selection problems,” in International Conference on Machine Learning. PMLR, 2022, pp. 7164–7182.
[14] A. B. Zemkoho and S. Zhou, “Theoretical and numerical comparison of the karush–kuhn–tucker and value function reformulations in bilevel optimization,” Computational Optimization and Applications, pp. 1–50, 2020.
[15] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward and reverse gradient-based hyperparameter optimization,” arXiv:1703.01785, 2017.
[16] R. Liu, X. Liu, S. Zeng, J. Zhang, and Y. Zhang, “Value-function-based sequential minimization for bi-level optimization,” arXiv preprint arXiv:2110.04974, 2021.
[17] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in ICML, 2015.
[18] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in ICML, 2018.
[19] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A general descent aggregation framework for gradient-based bi-level optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[20] F. Pedregosa, “Hyperparameter optimization with approximate gradient,” arXiv:1602.02355, 2016.
[21] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in AISTATS, 2020.
[22] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang, “A value-function-based interior-point method for non-convex bi-level optimization,” in ICML, 2021.
[23] K. Ji, J. Yang, and Y. Liang, “Provably faster algorithms for bilevel optimization and applications to meta-learning,” 2020.
[24] T. Chen, Y. Sun, Q. Xiao, and W. Yin, “A single-timescale method for stochastic bilevel optimization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 2466–2488.
[25] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic,” arXiv preprint arXiv:2007.05170, 2020.
[26] P. Khanduri, S. Zeng, M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A momentum-assisted single-timescale stochastic approximation algorithm for bilevel optimization,” 2021.
[27] Q. Xiao, H. Shen, W. Yin, and T. Chen, “Alternating implicit projected sgd and its efficient variants for equality-constrained bilevel optimization,” arXiv preprint arXiv:2211.07096, 2022.
[28] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” 2009.
[29] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1, pp. 59–99, 2016.
[30] A. Beck, First-order methods in optimization. SIAM, 2017.
[31] R. Liu, Y. Liu, S. Zeng, and J. Zhang, “Towards gradient-based bilevel optimization with non-convex followers and beyond,” Advances in Neural Information Processing Systems, vol. 34, pp. 8662–8675, 2021.
[32] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “Pc-darts: Partial channel connections for memory-efficient architecture search,” arXiv preprint arXiv:1907.05737, 2019.
[33] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1294–1303.
[34] X. Dong and Y. Yang, “Searching for a robust neural architecture in four gpu hours,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1761–1770.
[35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[36] ——, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[37] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017. [Online]. Available: https://arxiv.org/abs/1701.07875
[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in NeurIPS, 2016.
[40] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised few-shot classification,” arXiv:1803.00676, 2018.
[41] X. Chu, T. Zhou, B. Zhang, and J. Li, “Fair darts: Eliminating unfair advantages in differentiable architecture search,” in European conference on computer vision. Springer, 2020, pp. 465–480.
[42] T. E. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter, “Understanding and robustifying differentiable architecture search,” in International Conference on Learning Representations, vol. 2, 2020.
[43] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
[44] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.


(a)	(b)	(c)	(d)

Normal Cell
Normal Cell
	(a) DARTS	(b) RHG

Reduction Cell	(c) CG	(d) AIT
Reduction Cell
	(a) DARTS	(b) RHG

	(c) CG	(d) AIT

2D Pentagram
	Target	(a) VGAN	(b) WGAN	(c) UGAN	(d) AIT


(a) $(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0)$		(b) $(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8)$

Augmenting Iterative Trajectory for Bilevel Optimization: Methodology, Analysis and Extensions

Abstract

Index Terms:

1 Introduction

1.1 Contributions

2 A Brief Review of Existing Works

3 Augmented Iterative Trajectory (AIT)

3.1 BLOs with Convex LL Subproblem

3.2 BLOs with Non-convex LL Subproblem

4 Theoretical Analysis

4.1 Convergence Analysis for AITC\textrm{AIT}_{C}

Assumption 4.1.

Theorem 4.1.

Proof.

Theorem 4.2.

Proof.

4.1.1 Lower-Level Singleton Case

Theorem 4.3.

Proof.

Theorem 4.4.

Proof.

4.2 Convergence Analysis for AITN​C\textrm{AIT}_{NC}

Assumption 4.2.

Theorem 4.5.

Proof.

Theorem 4.6.

Proof.

Theorem 4.7.

Proof.

5 Applications

6 Experiments

6.1 Numerical Evaluations

6.1.1 Convex Scenario

6.1.2 Non-Convex Scenario

6.2 Typical Learning and Vision Applications

6.2.1 Data Hyper-Cleaning

6.2.2 Few-Shot Learning

6.3 Extensions for More Challenging Tasks

6.3.1 Neural Architecture Search

6.3.2 Generative Adversarial Networks

7 Conclusion

Acknowledgments

References

4.1 Convergence Analysis for $\textrm{AIT}_{C}$

4.2 Convergence Analysis for $\textrm{AIT}_{NC}$