This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Augmenting Iterative Trajectory for Bilevel Optimization: Methodology, Analysis and Extensions

Risheng Liu,  Yaohua Liu, Shangzhi Zeng, Jin Zhang R. Liu and Y. Liu are with the DUT-RU International School of Information Science & Engineering, Dalian University of Technology, and also with the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian, Liaoning, P.R., China. R. Liu is also with the Peng Cheng Laboratory, Shenzhen, Guangdong, P.R. China. E-mail: [email protected], [email protected]. S. Zeng is with the Department of Mathematics and Statistics, University of Victoria, Victoria, B.C., Canada. E-mail: [email protected]. J. Zhang is with the Department of Mathematics, Southern University of Science and Technology, and National Center for Applied Mathematics Shenzhen, China, (Corresponding author, E-mail: [email protected].) Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

In recent years, there has been a surge of machine learning applications developed with hierarchical structure, which can be approached from Bi-Level Optimization (BLO) perspective. However, most existing gradient-based methods overlook the interdependence between hyper-gradient calculation and Lower-Level (LL) iterative trajectory, focusing solely on the former. Consequently, convergence theory is constructed with restrictive LL assumptions, which are often challenging to satisfy in real-world scenarios. In this work, we thoroughly analyze the constructed iterative trajectory, and highlight two deficiencies, including empirically chosen initialization and default use of entire trajectory for hyper-gradient calculation. To address these issues, we incrementally introduce two augmentation techniques including Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and investigate various extension strategies such as prior regularization, different iterative mapping schemes and acceleration dynamics to construct Augmented Iterative Trajectory (AIT) for corresponding BLO scenarios (e.g., LL convexity and LL non-convexity). Theoretically, we provide convergence analysis for AIT and its variations under different LL assumptions, and establish the first convergence analysis for BLOs with non-convex LL subproblem. Finally, we demonstrate the effectiveness of AIT through three numerical examples, typical learning and vision applications (e.g., data hyper-cleaning and few-shot learning) and more challenging tasks such as neural architecture search.

Index Terms:
Bilevel optimization, gradient-based method, initialization auxiliary, pessimistic trajectory truncation, deep learning.

1 Introduction

Recently, a variety of rising deep learning applications share similar principles to construct hierarchical optimization models, including Hyperparameter Optimization [1, 2], Meta Learning [3, 4, 3], Neural Architecture Search [5, 6], Reinforcement Learning [7, 8, 9], and have achieved highly competitive performance in different fields. With proper reformulation, these models all can be considered as Bi-Level Optimization (BLO) problems, which optimize two nested subproblems in the form as:

min𝐱𝒳F(𝐱,𝐲), s.t. 𝐲𝒮(𝐱):=argmin𝐲𝒴f(𝐱,𝐲),\min_{\mathbf{x}\in\mathcal{X}}F(\mathbf{x},\mathbf{y}),\text{ s.t. }\mathbf{y}\in\mathcal{S}(\mathbf{x}):=\arg\min_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{x},\mathbf{y}), (1)

where 𝐱m\mathbf{x}\in\mathbb{R}^{m} and 𝐲n\mathbf{y}\in\mathbb{R}^{n} correspond to the Upper-Level (UL) and Lower-Level (LL) variables respectively, the UL objective FF and LL objective ff are continuously differentiable functions and 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is the LL solution set for all 𝐱𝒳\mathbf{x}\in\mathcal{X}.

In general, the BLO problem can be considered as a leader-follower game process, where the follower ( i.e., the LL subproblem) continuously responds to the decision 𝐱\mathbf{x} of the Leader (i.e., the UL subproblem). From the optimistic BLO perspective, we assume that the obtained LL solution 𝐲𝒮(𝐱)\mathbf{y}\in\mathcal{S}(\mathbf{x}) always leads to the optimal solution of UL objective for any given 𝐱\mathbf{x}, then the original BLO in Eq.(1) can be reduced to a simple single-level problem as

min𝐱𝒳φ(𝐱):=inf𝐲𝒮(𝐱)F(𝐱,𝐲),\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}):=\inf_{\mathbf{y}\in\mathcal{S}(\mathbf{x})}F(\mathbf{x},\mathbf{y}), (2)

where φ\varphi is the value function of a simple bilevel problem w.r.t. 𝐲\mathbf{y} for any given 𝐱\mathbf{x}. Then 𝐱\mathbf{x} and 𝐲\mathbf{y} have the cooperative relationship towards minimizing the UL subproblem. Based on this form, various branches of methods [10, 11, 12, 13] have explored how to optimize φ(𝐱)\varphi(\mathbf{x}) with respective to 𝐱\mathbf{x} and 𝐲\mathbf{y} simultaneously, thus solve the above applications captured by this hierarchical structure with high accuracy and efficiency.

Typically, BLO problems remain challenging to be solved caused by the nature of sophisticated dependence between UL and LL variables, especially when multiple LL solutions exist in 𝒮(𝐱)\mathcal{S}(\mathbf{x}). As for classical theory, KKT condition [14] has been considered as efficient tools to solve the above optimization problems. whereas, this type of methods are impractical to be applied in modern machine learning tasks limited by existence of too many multipliers. Meanwhile, Gradient-Based Methods (GBMs) [15] have been widely adopted in various learning and vision tasks to handle real-world applications. Practically speaking, the key of solving Eq. (2) with GBMs falls to accessing the highly coupled nested gradient caused by iterative optimization to solve the LL subproblem in Eq. (1). We denote 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) as the Best Response mapping of the follower for a given UL variable 𝐱\mathbf{x}. By embedding 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) into the simple bilevel problem in Eq. (2) as F(𝐱,𝐲(𝐱))F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right), the hyper-gradient of φ(𝐱)\varphi(\mathbf{x}) can be derived by using the chain rule as

𝐱φ(𝐱)=𝐱F(𝐱,𝐲(𝐱))+G(𝐱),\nabla_{\mathbf{x}}\varphi(\mathbf{x})=\nabla_{\mathbf{x}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right)+G(\mathbf{x}), (3)

where 𝐱F(𝐱,𝐲(𝐱))\nabla_{\mathbf{x}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right) can be regarded as the direct gradient, and G(𝐱)=(𝐱𝐲(𝐱))𝐲F(𝐱,𝐲(𝐱))G(\mathbf{x})=\left(\nabla_{\mathbf{x}}\mathbf{y}^{*}(\mathbf{x})\right)^{\top}\nabla_{\mathbf{y}}F\left(\mathbf{x},\mathbf{y}^{*}(\mathbf{x})\right) denotes the indirect gradient.

Since the calculation of hyper-gradient requires proper approximation of 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) and indirect gradient G(𝐱)G(\mathbf{x}), classical dynamics-embedded GBMs [15] tend to construct the approximations by introducing dynamical system obtained from iterative optimization steps. Typically speaking, this type of gradient-based BLO methods approximate 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) with 𝐲K(𝐱)\mathbf{y}_{K}(\mathbf{x}) constructed by the following optimization dynamics: 𝐲k+1(𝐱)=𝒯k+1(𝐱,𝐲k(𝐱)),k=0,,K1,\mathbf{y}_{k+1}(\mathbf{x})=\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x})\right),\ k=0,\cdots,K-1, where 𝐲0(𝐱)=𝐲0\mathbf{y}_{0}(\mathbf{x})=\mathbf{y}_{0} is a fixed given initial value and 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot) denotes the iterative dynamics mapping at kk-th step derived with specific updating scheme. Based on the above formulation, we use 𝒯0:K\mathcal{T}_{0:K} to denote the historical sequence of 𝐲k(𝐱)\mathbf{y}_{k}(\mathbf{x}), k=0,,Kk=0,...,K, which will be discussed more thoroughly in this work. In essence, 𝒯0:K\mathcal{T}_{0:K} reflects the iterative trajectory of LL variables to obtain 𝐲K(𝐱)\mathbf{y}_{K}(\mathbf{x}), then the indirect gradient G(𝐱)G(\mathbf{x}) can be approximated with this 𝐲K(𝐱)\mathbf{y}_{K}(\mathbf{x}) by (𝐱𝐲K(𝐱))𝐲F(𝐱,𝐲K(𝐱))\left(\nabla_{\mathbf{x}}\mathbf{y}_{K}(\mathbf{x})\right)^{\top}\nabla_{\mathbf{y}}F\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right). By this way, these GBMs can be regarded as using gradient descent methods solving following approximation problem of Eq. (2),

min𝐱𝒳φK(𝐱):=F(𝐱,𝐲K(𝐱)).\min_{\mathbf{x}\in\mathcal{X}}\varphi_{K}(\mathbf{x}):=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})). (4)

It is worth noting that 𝐲K(𝐱)\mathbf{y}_{K}(\mathbf{x}) can be regarded as good approximation as long as 𝐲f(𝐱,𝐲K(𝐱))\nabla_{\mathbf{y}}f\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right) uniformly converges to zero w.r.t. 𝐱𝒳\mathbf{x}\in\mathcal{X} and F(𝐱,𝐲K(𝐱))F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})) converges to the real BLO objective φ(𝐱)\varphi(\mathbf{x}).

To ensure the obtained approximation in Eq. (4) is good enough to guarantee the convergence of solutions towards the original BLO, restrictive assumptions of LL subproblem, such as LL Singleton (LLS) and LL Convexity (LLC), are usually required by existing GBMs. Whereas, as complex network structure with multiple convolutional blocks and loss functions with specialized regularization terms are widely employed in lots of applications, the LL subproblems always have non-convex property, which implies a gap between the provided theoretical guarantee and complex properties of real world problems. To deal with these challenges, several works [12, 16] have spared efforts to design new iterative mapping schemes or approximation theory, thus partly relax the theoretical constraints and obtain some new convergence results. Nevertheless, there is still an urgent need for general BLO framework with solid theoretical analysis to cover the non-convex scenarios.

1.1 Contributions

In this work, we propose the Augmented Iterative Trajectory (AIT) framework with two augmentation techniques, including Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and a series of extension strategies to handle the above limitations of existing GBMs. More specifically, we first analyze the vanilla iterative trajectory constructed by the gradient-based BLO scheme, and highlight two significant deficiencies of iterative trajectory, including empirically choosing initialization of iterative trajectory (i.e., 𝐲0(𝐱))\mathbf{y}_{0}(\mathbf{x})) and using entire trajectory for hyper-gradient calculation by default, both of which are less noticed by existing GBMs.

Aiming at the above deficiencies of iterative trajectory, we first introduce IA to optimize the initialization value for LL trajectory simultaneously with the UL variables. Then we propose prior regularization as warm start to guide the optimization of IA, and investigate different iterative dynamic mapping schemes and acceleration dynamics to construct our AIT framework for BLOs with Convex LL subproblem (AITC\textrm{AIT}_{C} for short). To handle the challenging non-convex scenario, we further propose PTT operation to dynamically adjust the iterative trajectory for calculation of the hyper-gradient, and introduce the acceleration gradient scheme to construct our AIT framework for Non-Convex BLOs (AITNC\textrm{AIT}_{NC} for short). Theoretically, we conduct detailed convergence analysis of AITC\textrm{AIT}_{C}, AITNC\textrm{AIT}_{NC} and its variations to support different challenging scenarios, especially for the unexplored non-convex scenario. Finally, we demonstrate our convergence results and performance improvement with extensive experiments based on the various numerical examples, typical learning and vision applications and other more challenging deep learning tasks. We summarize our main contributions as follows.

  • We conduct a thorough analysis of the iterative trajectory construction for the gradient-based BLO scheme, which is a significant departure from the existing GBMs. We then identify two limitations of the vanilla iterative trajectory adopted by current GBMs, and introduce our AIT to address BLOs with more relaxed LL assumptions.

  • We first propose IA to guide the optimization of LL variables, and then introduce prior regularization based on IA and explore various iterative dynamic mapping schemes (e.g., Nesterov’s acceleration dynamics) to construct AITC\textrm{AIT}_{C} for BLOs with Convex followers. To handle the more challenging non-convex scenarios, we further propose PTT operation to dynamically truncate the trajectory, and investigate acceleration gradient scheme to construct AITNC\textrm{AIT}_{NC}.

  • We provide detailed theoretical investigation on AITC\textrm{AIT}_{C}, AITNC\textrm{AIT}_{NC} and corresponding extension strategies (e.g., proximal prior regularization and acceleration gradient scheme) to ensure the convergence property of AIT without LLS even LLC assumption. Notably, we establish the first convergence guarantee for BLO problems with non-convex LL subproblems.

  • A plentiful of experiments have been conducted based on numerical examples with varying LL properties. Besides, we not only implement typical learning and vision applications (e.g., few-shot classification and data hyper-cleaning), but also demonstrate the generalization performance of AIT in more challenging tasks such as neural architecture search and generative adversarial networks.

2 A Brief Review of Existing Works

In this work, we first provide a brief review of previous works to conclude the general gradient-based BLO scheme, and then we discuss the deficiencies of existing LL iterative trajectory, which have limited the application of existing GBMs to more challenging BLO scenarios.

As mentioned before, existing GBMs have explored different heuristic techniques to calculate the hyper-gradient (i.e., 𝐱φ(𝐱)\nabla_{\mathbf{x}}\varphi(\mathbf{x})) with designed approximation operations. In one of the most representative class of GBMs [17, 18, 19], it first constructs dynamical system to track the optimization path of LL variables, and then explicitly calculates the indirect gradient with respect to UL variables based on ITerative Differentiation (ITD). Typically, RHG and FHG [17, 18] are proposed, in which 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) and thus φ(𝐱)\varphi(\mathbf{x}) are approximated by finite iterative optimization steps, and then the obtained approximated optimization problems are solved. Whereas, both methods suffer from the huge time and space complexity of calculating indirect gradient along the whole iterative trajectory with AD. Then Shanban et al. [2] proposed T-RHG to manually truncate the backpropagation path, thus release the computation burden. For these methods, the iterative dynamics mapping 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot) used to construct iterative trajectory is specified as the projected gradient descent. Whereas, the convergence analysis of these methods all require the LL subproblem to be strongly convex, which severely restrict the application scenarios. Recently, Liu et al. [19] proposed BDA, which aggregates the gradient information of both UL and LL subproblems to construct new 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot) to approximate 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) and φ(𝐱)\varphi(\mathbf{x}), thus successfully overcomes the LLS restriction issue. When calculating the nested hyper-gradient with ITD based on different forms of 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot), the gradient-based BLO scheme could be summarized in Alg. 1, where φK(𝐱)=F(𝐱,𝐲K(𝐱))\varphi_{K}(\mathbf{x})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})). It can be observed that the iterative update of UL variables to solve the approximated subproblem (i.e., φ(𝐱)\varphi(\mathbf{x})) relies on effective iterative trajectory for hyper-gradient calculation (Step 4-6).

Algorithm 1 Gradient-Based BLO Scheme
0:  UL iteration TT and LL iteration KK.
1:  Initialize 𝐱0\mathbf{x}^{0}.
2:  for t=0T1t=0\rightarrow T-1 do
3:     Initialize 𝐲0\mathbf{y}_{0}.
4:     for k=0K1k=0\rightarrow K-1 do
5:        % Update 𝐲k\mathbf{y}_{k} with 𝐱t\mathbf{x}^{t}.
6:        𝐲k+1(𝐱t)=𝒯k+1(𝐱t,𝐲k(𝐱t))\mathbf{y}_{k+1}(\mathbf{x}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t})\right).
7:     end for
8:     % Update 𝐱t\mathbf{x}^{t} with 𝐲K(𝐱t)\mathbf{y}_{K}(\mathbf{x}^{t}) based on ITD.
9:     𝐱t+1=𝙿𝚛𝚘𝚓𝒳(𝐱tα𝐱𝐱φK(𝐱t))\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\varphi_{K}(\mathbf{x}^{t})).
10:  end for

Since most ITD-based approaches only established their convergence theory based on the LLS restriction, they neither pay attention to the initialization of iterative trajectory nor take serious consideration of how to choose the iterative trajectory for hyper-gradient calculation. Whereas, an improper initialization position chosen for LL variables may slow down the convergence of LL and UL subproblem. More importantly, when multiple solutions exist in 𝒮(𝐱)\mathcal{S}(\mathbf{x}) (i.e., without LLS), the initialization has significant influence on the UL convergence behavior, so there is no guarantee that the iterative trajectory constructed based on the chosen 𝐲0(𝐱)\mathbf{y}_{0}(\mathbf{x}) provides iterates optimizing the UL objective while converging to 𝒮(𝐱)\mathcal{S}(\mathbf{x}). Furthermore, suppose that the LLC assumption is not satisfied, the subsequences of the whole iterative trajectory (i.e., 𝒯0:K\mathcal{T}_{0:K}) no longer have consistent convergence behavior as in the LLC case. On this basis, using the whole iterative trajectory for hyper-gradient calculation may lead to the oscillation or divergence of the approximated UL subproblem. Overall, the vanilla iterative trajectory adopted by ITD-based approaches has limited flexibility and becomes the bottleneck of existing gradient-based BLO scheme to cover more challenging scenarios. Though several researches have explored potential techniques, such as aggregating LL and UL gradient information to design new 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) to construct a good approximated BLO problem in Eq. (4) and successfully relax the theoretical restrictions on LL subproblem to some extent, the situation where the LLC assumption does not hold still needs to be considered.

In contrast to the ITD-based approaches which compute the hyper-gradient based on the constructed LL iterative trajectories, another branch of the methods [20, 11, 21] instead adopts the implicit function theorem and derives the hyper-gradient of UL variables by solving a linear system. Whereas, computing Hessian matrix and its inverse are much more expensive and challenging especially when ill-conditioned linear system appears. Besides, these methods based on Approximated Implicit Differentiation (AID) require the LL subproblem to be strongly convex, which limits the application scenarios to a great extent. Another line of work [22] addresses the nonconvex issue by constructing value-function-based smooth approximation problems, while extra assumptions introduced on the constrained LL subproblems still remain cumbersome to handle. As for other recent efforts [23, 24] which have designed specific update formats of UL or LL variables to derive new convergence rate analysis, Hong et al. [25] proposed TTSA to introduce cheap estimates of gradients, and update the LL and UL variables simultaneously with SGD and projected SGD. In addition, Khanduri et al. [26] proposed the MSTSA algorithm to estimate the hyper-gradient with assisted momentum. Quan et.al [27] proposed two variants of the alternating implicit gradient SGD approach with improved efficiency to solve equality-constrained BLOs. The above methods also build their theoretical analysis for BLOs with strongly-convex LL subproblems.

TABLE I: We report the LL assumptions required by the mainstream GBMs to derive their convergence results, the technique to compute 𝐱φ(𝐱)\nabla_{\mathbf{x}}\varphi(\mathbf{x}) and whether they are compatible with the accelerated technique (i.e., A.). Note that w/ and w/o are abbreviations for with and without, respectively.
Method 𝐱φ(𝐱)\nabla_{\mathbf{x}}\varphi(\mathbf{x}) w/o LLS w/o LLC w/ A.
RHG/FHG ITD
T-RHG
BDA
CG AID
Neumann
AIT ITD

In Tab. I, we compare the required LL assumptions of mainstream GBMs to build their convergence analysis and whether they support the acceleration technique. Based on the above analysis of gradient-based BLO scheme for ITD-based approaches, in the next section, we construct our AIT framework to address the theoretical issues and handle more challenging BLO scenarios. As it is shown, AIT not only derives similar convergence results without the LLS and LLC restriction, but also supports more flexible iterative mapping schemes including the acceleration techniques.

3 Augmented Iterative Trajectory (AIT)

Based on the prior analysis of iterative trajectory, to relax the theoretical restriction and improve the efficiency of the gradient-based BLO scheme, we first introduce the initialization auxiliary augmentation technique, which introduces a new auxiliary variable to the initialization of iterative trajectory and then optimize it together with the UL variables. Then we further investigate different extension strategies related to the initialization auxiliary augmentation technique, including prior regularization and acceleration dynamics to construct the Augmented Iterative Trajectory (AIT) for BLOs when LLC assumption is satisfied. To make the gradient-based BLO scheme be workable on BLOs without LLC, we further propose the pessimistic trajectory truncation augmentation technique to dynamically adjust the iterative trajectory for hyper-gradient calculation. Together with the initialization auxiliary, we construct the AIT for solving BLOs with non-convex follower.

3.1 BLOs with Convex LL Subproblem

Typically speaking, the initialization of iterative trajectory, i.e., 𝐲0(𝐱)\mathbf{y}_{0}(\mathbf{x}), is usually chosen based on a fixed manually (randomly or by some specific strategies) set initial value of LL variable, which takes no consideration of the convergence behavior of UL subproblem. As it is mentioned before, when there exist multiple elements in LL subproblem solution set 𝒮(𝐱)\mathcal{S}(\mathbf{x}), performing projected gradient descent steps (i.e., 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) adopted by most GBMs) to construct iterative trajectory with improper initial value 𝐲0(𝐱)\mathbf{y}_{0}(\mathbf{x}) usually not necessarily optimizes the UL objective while converging to 𝒮(𝐱)\mathcal{S}(\mathbf{x}). Besides, an improper initialization position for 𝒯0:K\mathcal{T}_{0:K} will also slow down the convergence rate of LL subproblem, even when 𝒮𝐱\mathcal{S}_{\mathbf{x}} is a singleton.

To address the issues mentioned above, we suggest to introduce an Initialization Auxiliary (IA) variable 𝐳\mathbf{z} as the initialization position of LL variable to construct 𝒯0:K\mathcal{T}_{0:K} for approximating LL subproblem as follows,

𝐲0(𝐱,𝐳)=𝐳,𝐲k+1(𝐱,𝐳)=𝒯k+1(𝐱,𝐲k(𝐱,𝐳)),\displaystyle\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\quad\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right), (5)

which leads to the following approximation problem of Eq. (2): min𝐱𝒳,𝐳𝒴φK(𝐱,𝐳):=F(𝐱,𝐲K(𝐱,𝐳)).\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}}\varphi_{K}(\mathbf{x},\mathbf{z}):=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z})). It is worth noting that, the iterative trajectory constructed based on this optimization dynamics formulates the approximation 𝐲K(𝐱,𝐳)\mathbf{y}_{K}(\mathbf{x},\mathbf{z}) parameterized by 𝐱\mathbf{x} and 𝐳\mathbf{z}. By this form, we avoid fixing initialization position chosen with empirical strategies, and update the initialization auxiliary 𝐳\mathbf{z} together with 𝐱\mathbf{x} to continuously search for the “best” initial points of LL variables to calculate 𝐲K(𝐱,𝐳)\mathbf{y}_{K}(\mathbf{x},\mathbf{z}). By taking into consideration of the convergence behavior of UL subproblem, 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) with updated initial value can approach solutions to the BLO problem in Eq. (1). The idea of IA variable 𝐳\mathbf{z} is first proposed for the convergence theory [22] of non-convex first-order optimization methods.

As mentioned above, the initialization auxiliary variable 𝐳\mathbf{z} is introduced and optimized simultaneously with UL variable 𝐱\mathbf{x} for helping the LL subproblem dynamics approach the appropriate 𝐲(𝐱)\mathbf{y}^{*}(\mathbf{x}) that minimizes UL objective FF. To further guide the update of the IA variable 𝐳\mathbf{z} and thus boost the convergence of the LL subproblem dynamics, we propose to introduce a constraint set 𝒵\mathcal{Z} and an extra prior regularization term g(𝐳)g(\mathbf{z}) to the IA variable 𝐳\mathbf{z}, which leads to the following approximation problem to Eq. (2),

min𝐱𝒳,𝐳𝒵ψK(𝐱,𝐳)=φK(𝐱,𝐳)+g(𝐳).\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\ \psi_{K}(\mathbf{x},\mathbf{z})=\varphi_{K}(\mathbf{x},\mathbf{z})+g(\mathbf{z}). (6)

In practice, g(𝐳)g(\mathbf{z}) can be specified as penalty on distance between 𝐳\mathbf{z} and the prior determined value 𝐳p\mathbf{z}_{p}, where 𝐳p\mathbf{z}_{p} usually serves as some known state of 𝐳\mathbf{z} that is very close to the optimal LL variable. As one option for machine learning tasks, we could properly define 𝐳p\mathbf{z}_{p} as the pretrained weights of the LL variables, and verify its effectiveness with BLO applications.

Another feature of this new approximation format is that the iterative dynamics mapping 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot) for constructing iterative trajectory can be flexibly specified as different forms proposed in existing works. Here, for general BLOs with convex LL subproblem, we suggest applying the aggregation scheme proposed in BDA [19] to construct 𝒯k+1(𝐱,)\mathcal{T}_{k+1}(\mathbf{x},\cdot) as follows

𝒯k+1(𝐱,𝐲k(𝐱,𝐳))\displaystyle\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right) (7)
=\displaystyle= 𝙿𝚛𝚘𝚓𝒴(𝐲k(𝐱,𝐳)(μαk𝐝kF(𝐱,𝐳)+(1μ)βk𝐝kf(𝐱,𝐳))),\displaystyle\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\left(\mu\alpha_{k}\mathbf{d}^{{F}}_{k}(\mathbf{x},\mathbf{z})+(1-\mu)\beta_{k}\mathbf{d}^{{f}}_{k}(\mathbf{x},\mathbf{z})\right)\right),

where 𝐝kF(𝐱,𝐳)=su𝐲F(𝐱,𝐲k(𝐱,𝐳))\mathbf{d}^{{F}}_{k}(\mathbf{x},\mathbf{z})=s_{u}\nabla_{\mathbf{y}}F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})) and 𝐝kf(𝐱,𝐳)=sl𝐲f(𝐱,𝐲k(𝐱,𝐳))\mathbf{d}^{{f}}_{k}(\mathbf{x},\mathbf{z})=s_{l}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})) denote the gradient directions of UL and LL objective with introduced IA, respectively.

Specially, for the case where the LLS condition is satisfied, following the vanilla update format in RHG, 𝒯k+1(𝐱,)\mathcal{T}_{k+1}\left(\mathbf{x},\cdot\right) could be simply chosen as

𝒯k+1(𝐱,𝐲k(𝐱,𝐳))=𝙿𝚛𝚘𝚓𝒴(𝐲k(𝐱,𝐳)α𝐲k𝐲f(𝐱,𝐲k(𝐱,𝐳))),\displaystyle\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathtt{Proj}_{\mathcal{Y}}(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\alpha_{\mathbf{y}}^{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))), (8)

where α𝐲k\alpha_{\mathbf{y}}^{k} denotes the step size parameter. By this means, embedding the IA technique still helps optimize the initialization to construct the iterative trajectory, thus further improves the convergence behavior of BLOs under LLS assumption.

Besides, our new scheme with IA technique also supports to consistently improve the existing methods with embedded acceleration dynamics. As one of the practical implementation, we could embed the Nesterov’s acceleration methods [28] with our IA technique, which has been widely used for convex optimization problem to accelerate the convergence of gradient descent, then we obtain the accelerated version as

{𝐲0(𝐱,𝐳)=𝐳,t0=1,tk+1=(1+1+4tk2)/2,𝐮k+1(𝐱,𝐳)=𝐲k+1(𝐱,𝐳)+(tk1tk+1)(𝐲k+1(𝐱,𝐳)𝐲k(𝐱,𝐳)),𝐲k+1(𝐱,𝐳)=𝙿𝚛𝚘𝚓𝒴(𝐮k(𝐱,𝐳)α𝐲f(𝐱,𝐮k(𝐱,𝐳))),𝒯k+1(𝐱,𝐲k(𝐱,𝐳))=𝐲k+1(𝐱,𝐳),\begin{cases}\begin{aligned} &\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\quad t_{0}=1,\quad t_{k+1}={\left(1+\sqrt{1+4t_{k}^{2}}\right)}/{2},\\ &\mathbf{u}^{k+1}(\mathbf{x},\mathbf{z})=\mathbf{y}^{k+1}(\mathbf{x},\mathbf{z})+\left(\frac{t_{k}-1}{{t_{k+1}}}\right)(\mathbf{y}^{k+1}(\mathbf{x},\mathbf{z})-\mathbf{y}^{k}(\mathbf{x},\mathbf{z})),\\ &\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{u}^{k}(\mathbf{x},\mathbf{z})-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{u}^{k}(\mathbf{x},\mathbf{z}))\right),\\ &\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}),\\ \end{aligned}\end{cases} (9)

where α\alpha denotes the step size parameter.

Algorithm 2 The Proposed AITC\textrm{AIT}_{C} Algorithm
0:  UL iteration TT, LL iteration KK.
1:  Initialize 𝐱0\mathbf{x}^{0} and 𝐳0\mathbf{z}^{0}.
2:  for t=0T1t=0\rightarrow T-1 do
3:     𝐲0=𝐳t\mathbf{y}_{0}=\mathbf{z}^{t}.
4:     for k=0K1k=0\rightarrow K-1 do
5:        % Iterative dynamics mapping.
6:        𝐲k+1(𝐱t,𝐳t)=𝒯k+1(𝐱t,𝐲k(𝐱t,𝐳t))\mathbf{y}_{k+1}(\mathbf{x}^{t},\mathbf{z}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t})\right).
7:     end for
8:     % Update 𝐱\mathbf{x} with 𝐲K(𝐱t,𝐳t)\mathbf{y}_{K}(\mathbf{x}^{t},\mathbf{z}^{t})
9:     𝐱t+1=𝙿𝚛𝚘𝚓𝒳(𝐱tα𝐱𝐱ψK(𝐱t,𝐳t))\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\psi_{K}(\mathbf{x}^{t},\mathbf{z}^{t})).
10:     % Update 𝐳\mathbf{z} with 𝐲K(𝐱t,𝐳t)\mathbf{y}_{K}(\mathbf{x}^{t},\mathbf{z}^{t}).
11:     𝐳t+1=𝙿𝚛𝚘𝚓𝒵(𝐳tα𝐳𝐳ψK(𝐱t,𝐳t))\mathbf{z}^{t+1}=\mathtt{Proj}_{\mathcal{Z}}(\mathbf{z}^{t}-\alpha_{\mathbf{z}}\nabla_{\mathbf{z}}\psi_{K}(\mathbf{x}^{t},\mathbf{z}^{t})).
12:  end for

In Alg. 2, we illustrate the algorithm of our Augmented Iterative Trajectory for BLOs with Convex LL subproblem (AITC\textrm{AIT}_{C} for short), where ψK(𝐱,𝐳)=F(𝐱,𝐲K(𝐱,𝐳))+g(𝐳)\psi_{K}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))+g(\mathbf{z}). Whereas, to make the gradient-based BLO scheme be efficient and convergent for the BLOs with nonconvex LL subproblem, the proposed IA technique is not enough. In the next part, we will propose another augmentation technique named Pessimistic Trajectory Truncation, and discuss how to use this new technique together with IA to derive a new gradient-based BLO algorithm that is convergent and efficient for the BLOs with nonconvex LL subproblem. The detailed convergence analysis could be found in the next section.

3.2 BLOs with Non-convex LL Subproblem

As another important issue of the existing gradient-based BLO scheme, they used to select the whole iterative trajectory (i.e., 𝒯0:K\mathcal{T}_{0:K}) to perform the hyper-gradient calculation. When the LL subproblem is nonconvex, the subsequences of 𝒯0:K\mathcal{T}_{0:K} show no consistent convergence behavior as in the LLC case, even the initialization is continuously optimized by the IA technique. Therefore, directly using 𝒯0:K\mathcal{T}_{0:K} results in improper approximation of F(𝐱,𝐲K(𝐱))F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})), and 𝐲f(𝐱,𝐲K(𝐱))\nabla_{\mathbf{y}}f\left(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})\right) may not uniformly converge to zero w.r.t. 𝐱𝒳\mathbf{x}\in\mathcal{X} while F(𝐱,𝐲K(𝐱))F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x})) converges to the real BLO objective φ(𝐱)\varphi(\mathbf{x}). In short, this empirical strategy for constructing 𝒯0:K\mathcal{T}_{0:K} actually invalidates the convergence and may result in oscillation even divergence of the optimization process in the nonconvex case.

Based on the constructed approximation formats in Eq. (5), we introduce another significant technique, i.e., Pessimistic Trajectory Truncation (PTT) to extend the application scenarios and derive new convergence results without requiring LLC assumption. Note that the idea of PTT operation is inspired by the convergence result of the projected gradient method on nonconvex optimization problem. We define α(𝐱,𝐲):=𝐲𝙿𝚛𝚘𝚓𝒴(𝐲α𝐲f(𝐱,𝐲))\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}):=\mathbf{y}-\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\right) as the proximal gradient residual mapping to measure the optimality of LL subproblem. When the LL subproblem is nonconvex, α(𝐱,𝐲)\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}) may not uniformly converge to the LL subproblem solution set w.r.t. 𝐱\mathbf{x} and 𝐳\mathbf{z}. Therefore, directly embedding 𝐲K(𝐱,𝐳)\mathbf{y}_{K}(\mathbf{x},\mathbf{z}) into the UL objective F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) may not necessarily provide a proper approximation strategy as mentioned before. Fortunately, it is well understood that as KK tends to infinity, min0kKα(𝐱,𝐲k(𝐱,𝐳))\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\| uniformly converges to zero w.r.t. 𝐱\mathbf{x} and 𝐳\mathbf{z}.

Whereas, it still remains to be impractical to explicitly select i(K):=argmin0kKα(𝐱,𝐲k(𝐱,𝐳))i(K):=\arg\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\| for using 𝐲i(K)(𝐱,𝐳)\mathbf{y}_{i(K)}(\mathbf{x},\mathbf{z}) to construct an approximation. Therefore, we choose to minimize the worst case of 𝐲i(K)(𝐱,𝐳){\mathbf{y}_{i(K)}(\mathbf{x},\mathbf{z})} in terms of the UL objective values in a compromise, i.e., ϕK¯(𝐱,𝐳):=max1kK{F(𝐱,𝐲k(𝐱,𝐳))}\phi_{\bar{K}}(\mathbf{x},\mathbf{z}):=\max_{1\leq k\leq K}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\}, to guarantee the convergence of α(𝐱,𝐲k(𝐱,𝐳))\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})) no matter how this i(K)i(K) is chosen. The theoretical convergence guarantee for the above approximation is organized in Section 4. Practically speaking, the varying K¯\bar{K} actually considers the behavior of UL objectives during the LL optimization to truncate the whole iterative trajectory. In this case, the 𝒯0:K¯\mathcal{T}_{0:\bar{K}} is more reasonable to represent the whole trajectory for hyper-gradient calculation, which can not be directly reflected by using fixed KK without PTT operation.

By introducing both IA and PTT operation, our new algorithm avoids directly using the oscillating iterative trajectory of LL variables for hyper-gradient calculation, and successfully removes the LLC assumption to handle non-convex scenarios in various real-world applications. In addition, the PTT operation always leads to a small K¯:=argmax1kK{F(𝐱,𝐲k(𝐱,𝐳))}\bar{K}:=\arg\max_{1\leq k\leq K}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\}, thus effectively shortens the path to compute the indirect hyper-gradient through backpropagation, which saves the computational cost in practice. We also conduct the running time analysis based on the nonconvex numerical example to confirm this conclusion numerically.

Note that the iterative mapping scheme 𝒯k+1(𝐱,𝐲k(𝐱,𝐳))\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right) can be chosen as the projected gradient method in Eq. (8). Besides, we also investigate available accelerated gradient scheme [29] to improve the efficiency for this non-covnex case, which is given as follows

{𝐮0(𝐱,𝐳)=𝐯0(𝐱,𝐳)=𝐳,𝐲k+1(𝐱,𝐳)=(1αk)𝐮k(𝐱,𝐳)+αk𝐯k(𝐱,𝐳),𝐮k+1(𝐱,𝐳)=𝙿𝚛𝚘𝚓𝒴(𝐲k+1(𝐱,𝐳)βk𝐲f(𝐱,𝐲k+1(𝐱,𝐳))),𝐯k+1(𝐱,𝐳)=𝙿𝚛𝚘𝚓𝒴(𝐯k(𝐱,𝐳)λk𝐲f(𝐱,𝐲k+1(𝐱,𝐳))),𝒯k+1(𝐱,𝐲k(𝐱,𝐳))=𝐲k+1(𝐱,𝐳),\begin{cases}\begin{aligned} &\mathbf{u}_{0}(\mathbf{x},\mathbf{z})=\mathbf{v}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},\\ &\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=(1-\alpha_{k})\mathbf{u}_{k}(\mathbf{x},\mathbf{z})+\alpha_{k}\mathbf{v}_{k}(\mathbf{x},\mathbf{z}),\\ &\mathbf{u}^{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})-\beta_{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}))\right),\\ &\mathbf{v}_{k+1}(\mathbf{x},\mathbf{z})=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{v}_{k}(\mathbf{x},\mathbf{z})-\lambda_{k}\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}))\right),\\ &\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathbf{y}_{k+1}(\mathbf{x},\mathbf{z}),\end{aligned}\end{cases} (10)

where αk=2k+1\alpha_{k}=\frac{2}{k+1}, βk=12Lf\beta_{k}=\frac{1}{2L_{f}} and λk=kβk2\lambda_{k}=\frac{k\beta_{k}}{2}, and k=0,,K1k=0,\ldots,K-1.

Algorithm 3 The Proposed AITNC\textrm{AIT}_{NC} Algorithm
0:  UL iteration TT, LL iteration KK.
1:  Initialize 𝐱0\mathbf{x}^{0} and 𝐳0\mathbf{z}^{0}.
2:  for t=0T1t=0\rightarrow T-1 do
3:     𝐲0=𝐳t\mathbf{y}_{0}=\mathbf{z}^{t}.
4:     for k=0K1k=0\rightarrow K-1 do
5:        % Update yky_{k} with 𝐱t\mathbf{x}^{t} and 𝐳t\mathbf{z}^{t}.
6:        𝐲k+1(𝐱t,𝐳t)=𝒯k+1(𝐱t,𝐲k(𝐱t,𝐳t))\mathbf{y}_{k+1}(\mathbf{x}^{t},\mathbf{z}^{t})=\mathcal{T}_{k+1}\left(\mathbf{x}^{t},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t})\right).
7:     end for
8:     % Pessimistic Trajectory Truncation.
9:     K¯=argmax1kK{F(𝐱,𝐲k(𝐱t,𝐳t))}\bar{K}=\arg\max_{1\leq k\leq K}\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x}^{t},\mathbf{z}^{t}))\}.
10:     % Update 𝐱\mathbf{x} with 𝐲K¯(𝐱t,𝐳t)\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t}).
11:     Denote ϕK¯(𝐱,𝐳)=F(𝐱,𝐲K¯(𝐱,𝐳))\phi_{\bar{K}}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{\bar{K}}(\mathbf{x},\mathbf{z})).
12:     𝐱t+1=𝙿𝚛𝚘𝚓𝒳(𝐱tα𝐱𝐱ϕK¯(𝐱t,𝐳t))\mathbf{x}^{t+1}=\mathtt{Proj}_{\mathcal{X}}(\mathbf{x}^{t}-\alpha_{\mathbf{x}}\nabla_{\mathbf{x}}\phi_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t})).
13:     % Update 𝐳\mathbf{z} with 𝐲K¯(𝐱t,𝐳t)\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t}).
14:     𝐳t+1=𝙿𝚛𝚘𝚓𝒵(𝐳tα𝐳𝐳ϕK¯(𝐱t,𝐳t))\mathbf{z}^{t+1}=\mathtt{Proj}_{\mathcal{Z}}(\mathbf{z}^{t}-\alpha_{\mathbf{z}}\nabla_{\mathbf{z}}\phi_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t})).
15:  end for

With two proposed augmentation techniques and supported iterative acceleration gradient scheme, the complete algorithm of our AIT for BLOs with Non-convex LL subproblem (AITNC\textrm{AIT}_{NC} for short) is illustrated in Alg. 3. In the following sections, we will provide the asymptotic convergence analysis of AITC\textrm{AIT}_{C} and AITNC\textrm{AIT}_{NC} and further discuss the applicable scenarios of different variations.

4 Theoretical Analysis

In this section, we present the asymptotic convergence analysis of AITC\textrm{AIT}_{C} in Alg. 2 and AITNC\textrm{AIT}_{NC} in Alg. 3 under mild conditions on the general schematic iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot). And we also show that these conditions on 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) are satisfied for some specific choices of 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot). Specifically, for convex BLOs, we show the asymptotic convergence of Alg. 2 when 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) is chosen as BDA [19] in Eq.(7). And for convex BLOs whose LL solution set is singleton, we show the asymptotic convergence of Algorithm 2 when 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) is chosen as projected gradient in Eq. (8) and Nesterov’s acceleration scheme in Eq. (9). For nonconvex BLOs, we show the asymptotic convergence of Algorithm 3 when 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) is specifically chosen as projected gradient method and accelerated gradient scheme proposed by [29] in Eq. (10).

4.1 Convergence Analysis for AITC\textrm{AIT}_{C}

We first derive the asymptotic convergence analysis of Alg. 2 for convex BLOs with general schematic iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot). Alg. 2 can be regarded as the application of projected gradient method on the following single-level approximation problem of Eq. (1),

min𝐱𝒳,𝐳𝒵F(𝐱,𝐲K(𝐱,𝐳))+g(𝐳)\displaystyle\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}~{}~{}F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))+g(\mathbf{z}) (11)
s.t.{𝐲0(𝐱,𝐳)=𝐳,𝐲k+1(𝐱,𝐳)=𝒯k(𝐱,𝐲k(𝐱,𝐳)),\displaystyle s.t.~{}\begin{cases}\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},&\\ \mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})),&\end{cases}

where k=0,,K1k=0,\ldots,K-1, the function g:mg:\mathbb{R}^{m}\rightarrow\mathbb{R} and the closed set 𝒵m\mathcal{Z}\subset\mathbb{R}^{m} are the function and compact set that are with prior information on lower level variable 𝐲\mathbf{y}. To show the asymptotic convergence of Alg. 2, i.e., the approximation problem in Eq. (11) converges to the original bilevel problem in Eq. (1), i.e., min𝐱𝒳φ(𝐱)\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}), we make following mild assumptions on BLOs throughout this subsection.

Assumption 4.1.

We make following standing assumptions throughout this subsection:

  • (1)

    F,𝐲F,f,𝐲fF,\nabla_{\mathbf{y}}F,f,\nabla_{\mathbf{y}}f are continuous functions on 𝒳×m\mathcal{X}\times\mathbb{R}^{m}.

  • (2)

    F(𝐱,):mF(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R} is LFL_{F}-smooth, convex and bounded below by M0M_{0} for any 𝐱𝒳\mathbf{x}\in\mathcal{X}.

  • (3)

    f(𝐱,):mf(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R} is LfL_{f}-smooth and convex for any 𝐱𝒳\mathbf{x}\in\mathcal{X}. f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is level-bounded in 𝐲\mathbf{y} uniformly in 𝐱𝒳\mathbf{x}\in\mathcal{X}

  • (4)

    𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} are compact sets, and 𝒴\mathcal{Y} is a convex set.

  • (5)

    𝒮^(𝐱):=argmin𝐲{F(𝐱,𝐲),s.t.𝐲𝒴𝒮(𝐱)}\hat{\mathcal{S}}(\mathbf{x}):=\mathrm{argmin}_{\mathbf{y}}\{F(\mathbf{x},\mathbf{y}),\ s.t.\ \mathbf{y}\in\mathcal{Y}\cap\mathcal{S}(\mathbf{x})\} is nonempty for all 𝐱𝒳\mathbf{x}\in\mathcal{X}.

  • (6)

    g(𝐳)g(\mathbf{z}) is continuous on 𝒵\mathcal{Z}.

Assumption 4.1 is standard for convex BLOs in the related literature, which is satisfied for the convex BLO numerical example given in Section 6.1. Before presenting the convergence result, we recall the auxiliary function φK(𝐱,𝐳)=F(𝐱,𝐲K(𝐱,𝐳))\varphi_{K}(\mathbf{x},\mathbf{z})=F(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z})), and rewrite the approximation problem in Eq. (11) into the following compact form,

min𝐱𝒳,𝐳𝒵ψK(𝐱,𝐳)=φK(𝐱,𝐳)+g(𝐳).\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\psi_{K}(\mathbf{x},\mathbf{z})=\varphi_{K}(\mathbf{x},\mathbf{z})+g(\mathbf{z}). (12)

We propose following two essential properties for general schematic iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) for the convergence analysis:

  1. (1)

    UL property with IA: For each 𝐱𝒳\mathbf{x}\in\mathcal{X}, 𝐳𝒵\mathbf{z}\in\mathcal{Z},

    limKφK(𝐱,𝐳)φ(𝐱).\lim\limits_{K\rightarrow\infty}\varphi_{K}(\mathbf{x},\mathbf{z})\rightarrow\varphi(\mathbf{x}).
  2. (2)

    LL property with IA: {𝐲K(𝐱,𝐳)}\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\} is uniformly bounded on 𝒳×𝒵\mathcal{X}\times\mathcal{Z}, and for any ϵ>0\epsilon>0, there exists k(ϵ)>0k(\epsilon)>0 such that whenever K>k(ϵ)K>k(\epsilon),

    sup𝐱𝒳,𝐳𝒵{f(𝐱,𝐲K(𝐱,𝐳))f(𝐱,𝐳)}ϵ.\sup_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\left\{f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x},\mathbf{z})\right\}\leq\epsilon.

Now, we are ready to present the asymptotic convergence result of Alg. 2.

Theorem 4.1.

Suppose both the above UL and LL properties with IA hold. Let (𝐱K,𝐳K)(\mathbf{x}_{K},\mathbf{z}_{K}) be a minimum of the approximation problem in Eq. (11), i.e.,

ψK(𝐱K,𝐳K)ψK(𝐱,𝐳),𝐱𝒳,𝐳𝒵.\psi_{K}(\mathbf{x}_{K},\mathbf{z}_{K})\leq\psi_{K}(\mathbf{x},\mathbf{z}),\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}.

Then, any limit point x¯\bar{x} of the sequence {xK}\{x_{K}\} satisfies that x¯argminxXφ(x)\bar{x}\in\mathrm{argmin}_{x\in X}\varphi(x), i.e., x¯\bar{x} is a solution to BLO in Eq. (1).

Proof.

For any limit point 𝐱¯\bar{\mathbf{x}} of the sequence {𝐱K}\{\mathbf{x}_{K}\}, let {𝐱l}\{\mathbf{x}_{l}\} be a subsequence of {𝐱K}\{\mathbf{x}_{K}\} such that 𝐱l𝐱¯𝒳\mathbf{x}_{l}\rightarrow\bar{\mathbf{x}}\in\mathcal{X}. As {𝐲K(𝐱,𝐳)}\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\} is uniformly bounded on 𝒳×𝒵\mathcal{X}\times\mathcal{Z} and {𝐳K}𝒵\{\mathbf{z}_{K}\}\subset\mathcal{Z} is bounded, we can further find a subsequence {(𝐱m,𝐳m)}\{(\mathbf{x}_{m},\mathbf{z}_{m})\} of {(𝐱l,𝐳l)}\{(\mathbf{x}_{l},\mathbf{z}_{l})\} satisfying 𝐲m(𝐱m,𝐳m)𝐲¯\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m})\rightarrow\bar{\mathbf{y}} and 𝐳m𝐳¯\mathbf{z}_{m}\rightarrow\bar{\mathbf{z}} for some 𝐲¯\bar{\mathbf{y}} and 𝐳¯\bar{\mathbf{z}}. It follows from the LL property with IA that for any ϵ>0\epsilon>0, there exists M(ϵ)>0M(\epsilon)>0 such that for any m>M(ϵ)m>M(\epsilon), we have

f(𝐱m,𝐲m(𝐱m,𝐳m))f(𝐱m)ϵ.f(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))-f^{\ast}(\mathbf{x}_{m})\leq\epsilon.

Thanks to the continuity of f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}), we have that f(𝐱):=min𝐲f(𝐱,𝐲)f^{\ast}(\mathbf{x}):=\min_{\mathbf{y}}f(\mathbf{x},\mathbf{y}) is Upper Semi-Continuous (USC for short) on 𝒳\mathcal{X} (see, e.g., [12, Lemma 2]). By letting mm\rightarrow\infty, and since ff is continuous and f(𝐱)f^{\ast}(\mathbf{x}) is USC on 𝒳\mathcal{X}, we have f(𝐱¯,𝐲¯)f(𝐱¯)ϵf(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{\ast}(\bar{\mathbf{x}})\leq\epsilon. As ϵ\epsilon is arbitrarily chosen, we have f(𝐱¯,𝐲¯)f(𝐱¯)0f(\bar{\mathbf{x}},\bar{\mathbf{y}})-f^{\ast}(\bar{\mathbf{x}})\leq 0 and thus 𝐲¯𝒮(𝐱¯)\bar{\mathbf{y}}\in\mathcal{S}(\bar{\mathbf{x}}). Next, as FF is continuous at (𝐱¯,𝐲¯)(\bar{\mathbf{x}},\bar{\mathbf{y}}) and g is continuous at 𝐳¯\bar{\mathbf{z}}, for any ϵ>0\epsilon>0, there exists M(ϵ)>0M(\epsilon)>0 such that for any m>M(ϵ)m>M(\epsilon), it holds

F(𝐱¯,𝐲¯)+g(𝐳¯)F(𝐱m,𝐲m(𝐱m,𝐳m))+g(𝐳m)+ϵ.F(\bar{\mathbf{x}},\bar{\mathbf{y}})+g(\bar{\mathbf{z}})\leq F(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))+g(\mathbf{z}_{m})+\epsilon.

Then, we have, for any m>M(ϵ)m>M(\epsilon) and 𝐱𝒳\mathbf{x}\in\mathcal{X},

φ(𝐱¯)+g(𝐳¯)\displaystyle\ \varphi(\bar{\mathbf{x}})+g(\bar{\mathbf{z}}) =inf𝐲𝒮(𝐱¯)F(𝐱¯,𝐲)+g(𝐳¯)\displaystyle=\inf_{\mathbf{y}\in\mathcal{S}(\bar{\mathbf{x}})}F(\bar{\mathbf{x}},\mathbf{y})+g(\bar{\mathbf{z}}) (13)
F(𝐱¯,𝐲¯)+g(𝐳¯)\displaystyle\leq F(\bar{\mathbf{x}},\bar{\mathbf{y}})+g(\bar{\mathbf{z}})
F(𝐱m,𝐲m(𝐱m,𝐳m))+g(𝐳m)+ϵ\displaystyle\leq F(\mathbf{x}_{m},\mathbf{y}_{m}(\mathbf{x}_{m},\mathbf{z}_{m}))+g(\mathbf{z}_{m})+\epsilon
φm(𝐱,𝐳)+g(𝐳)+ϵ.\displaystyle\leq\varphi_{m}(\mathbf{x},\mathbf{z})+g(\mathbf{z})+\epsilon.\quad\quad

Taking 𝐳=𝐳¯\mathbf{z}=\bar{\mathbf{z}}, mm\rightarrow\infty and by the UL property with IA, we have

φ(𝐱¯)limmφm(𝐱,𝐳¯)+ϵ=φ(𝐱)+ϵ,𝐱𝒳.\displaystyle\varphi(\bar{\mathbf{x}})\leq\lim_{m\rightarrow\infty}\varphi_{m}(\mathbf{x},\bar{\mathbf{z}})+\epsilon=\varphi(\mathbf{x})+\epsilon,\ \forall\mathbf{x}\in\mathcal{X}.

By taking ϵ0\epsilon\rightarrow 0, we have φ(𝐱¯)φ(𝐱),𝐱𝒳\varphi(\bar{\mathbf{x}})\leq\varphi(\mathbf{x}),\ \forall\mathbf{x}\in\mathcal{X} which implies 𝐱¯argmin𝐱𝒳φ(𝐱)\bar{\mathbf{x}}\in\arg\min_{\mathbf{x}\in\mathcal{X}}\varphi(\mathbf{x}). ∎

In the above convergence result, we can see that the convergence of Eq. (11) to the original bilevel problem in Eq. (1) does not require any restrictive assumptions on gg and 𝒵\mathcal{Z}. It means that we can add any prior information into gg and 𝒵\mathcal{Z} to help us speed up the convergence of the method without destroying the convergence of the algorithm.

Next, we show that both the UL and LL properties with IA hold when 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) is chosen as some specific schemes. Here we provide the asymptotic convergence analysis of Alg. 2, when 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) is defined as BDA[19] in Eq. (7).

Theorem 4.2.

Let {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} be the output generated by (7) with sl(0,1/Lf)s_{l}\in(0,1/L_{f}), su(0,1/LF)s_{u}\in(0,1/L_{F}), μ(0,1)\mu\in(0,1), αk=1k+1\alpha_{k}=\frac{1}{k+1}, βk[β¯,1]\beta_{k}\in[\underline{\beta},1] with some β¯>0\underline{\beta}>0, |βkβk1|cβ(k+1)2|\beta_{k}-\beta_{k-1}|\leq\frac{c_{\beta}}{(k+1)^{2}} with some cβ>0c_{\beta}>0, then we have that both the LL and UL properties with IA hold.

Proof.

We can easily obtain the UL property with IA from [19, Theorem 3]. Next, since Assumption 4.1 holds, 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} are compact, by using the convergence rate result established in [19, Theorem 4] and the similar arguments in the proof of [19, Theorem 5], it can be shown that there exists C>0C>0 such that for any 𝐱𝒳\mathbf{x}\in\mathcal{X}, 𝐳𝒵\mathbf{z}\in\mathcal{Z}, we have

f(𝐱,𝐲K(𝐱,𝐳))f(𝐱)C1+lnKK14.f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{*}(\mathbf{x})\leq C\sqrt{\frac{1+\ln K}{K^{\frac{1}{4}}}}.

Because 1+lnKK140\sqrt{\frac{1+\ln K}{K^{\frac{1}{4}}}}\rightarrow 0 as KK\rightarrow\infty, {𝐲K(𝐱,𝐳)}𝒴\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}\subset\mathcal{Y}, and 𝒴\mathcal{Y} is compact, LL property with IA holds. ∎

4.1.1 Lower-Level Singleton Case

In this part, we focus on the convex BLOs whose lower-level problem solution set is singleton for any 𝐱𝒳\mathbf{x}\in\mathcal{X}. We first consider constructing the iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) by projected gradient method as described in Eq. (8).

Theorem 4.3.

Suppose 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is singleton for all 𝐱𝒳\mathbf{x}\in\mathcal{X}. Let {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} be the output generated by (8) with α(0,1/Lf)\alpha\in(0,1/L_{f}), then we have that both the LL and UL convergence properties with IA hold.

Proof.

According to [22, Theorem 10.21], when f(𝐱,)f(\mathbf{x},\cdot) is convex and LfL_{f}-smooth for any 𝐱𝒳\mathbf{x}\in\mathcal{X}, and α(0,1/Lf)\alpha\in(0,1/L_{f}), {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} satisfies

f(𝐱,𝐲K(𝐱,𝐳))f(𝐱)\displaystyle f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x}) dist(𝐲0(𝐱,𝐳),𝒮(𝐱))2αK\displaystyle\leq\frac{\mathrm{dist}(\mathbf{y}_{0}(\mathbf{x},\mathbf{z}),\mathcal{S}(\mathbf{x}))}{2\alpha K} (14)
=dist(𝐳,𝒮(𝐱))2αK,\displaystyle=\frac{\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))}{2\alpha K},

where dist(𝐳,S(𝐱))\mathrm{dist}(\mathbf{z},S(\mathbf{x})) denotes the distance from 𝐳\mathbf{z} to the set S(𝐱)S(\mathbf{x}). Since 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} are all compact sets, then there exists M>0M>0 such that dist(𝐳,𝒮(𝐱))M\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))\leq M for (𝐱,𝐳)𝒳×𝒵(\mathbf{x},\mathbf{z})\in\mathcal{X}\times\mathcal{Z}. Then we can easily obtained that LL property with IA holds from Eq. (14), {𝐲K(𝐱,𝐳)}𝒴\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\}\subset\mathcal{Y} and 𝒴\mathcal{Y} is compact. Next, given any (𝐱,𝐳)𝒳×𝒵(\mathbf{x},\mathbf{z})\in\mathcal{X}\times\mathcal{Z}, for any limit point 𝐲¯(𝐱,𝐳)\bar{\mathbf{y}}(\mathbf{x},\mathbf{z}) of sequence {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\}, we have f(𝐱,𝐲¯(𝐱,𝐳))=f(𝐱)f(\mathbf{x},\bar{\mathbf{y}}(\mathbf{x},\mathbf{z}))=f^{*}(\mathbf{x}) and thus 𝐲¯(𝐱,𝐳)𝒮(𝐱)\bar{\mathbf{y}}(\mathbf{x},\mathbf{z})\in\mathcal{S}(\mathbf{x}) from (14). Therefore, since {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} is bounded and 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is singleton, we have 𝐲k(𝐱,𝐳)𝒮(𝐱)=𝒮^(𝐱)\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\rightarrow\mathcal{S}(\mathbf{x})=\hat{\mathcal{S}}(\mathbf{x}) and thus UL property with IA holds. ∎

Next, we consider the iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) as the Nesterov’s acceleration schemes described in Eq. (9).

Theorem 4.4.

Suppose 𝒮(𝐱)\mathcal{S}(\mathbf{x}) is singleton for all 𝐱𝒳\mathbf{x}\in\mathcal{X}. Let {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} be the output generated by Eq. (9) with α(0,1/Lf)\alpha\in(0,1/L_{f}), then we have that both the LL and UL convergence properties with IA hold.

Proof.

According to [22, Theorem 10.34], when f(𝐱,)f(\mathbf{x},\cdot) is convex and LfL_{f}-smooth for any 𝐱𝒳\mathbf{x}\in\mathcal{X}, and α(0,1/Lf)\alpha\in(0,1/L_{f}), {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} admits the following property,

f(𝐱,𝐲K(𝐱,𝐳))f(𝐱)\displaystyle f(\mathbf{x},\mathbf{y}_{K}(\mathbf{x},\mathbf{z}))-f^{\ast}(\mathbf{x}) 2dist(𝐲0(𝐱,𝐳),𝒮(𝐱))α(K+1)2\displaystyle\leq\frac{2\mathrm{dist}(\mathbf{y}_{0}(\mathbf{x},\mathbf{z}),\mathcal{S}(\mathbf{x}))}{\alpha(K+1)^{2}} (15)
=2dist(𝐳,𝒮(𝐱))α(K+1)2.\displaystyle=\frac{2\mathrm{dist}(\mathbf{z},\mathcal{S}(\mathbf{x}))}{\alpha(K+1)^{2}}.

Since 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒵\mathcal{Z} are compact sets, by using the above inequality and the similar arguments as in the proof of Theorem 4.3, we can show that both the LL and UL convergence properties with IA hold. ∎

4.2 Convergence Analysis for AITNC\textrm{AIT}_{NC}

In this part, we first conduct the asymptotic convergence analysis of AITNC\textrm{AIT}_{NC} in Alg. 3 with general schematic iterative module 𝒯k\mathcal{T}_{k}. Indeed, Alg. 3 can be regarded as the application of projected gradient method on the following single-level approximation problem of Eq. (1),

min𝐱𝒳,𝐳𝒴ϕK¯(𝐱,𝐳):=maxk{F(𝐱,𝐲k(𝐱,𝐳))}\displaystyle\min_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}}~{}~{}\phi_{\bar{K}}(\mathbf{x},\mathbf{z}):=\max_{k}\left\{F(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right\} (16)
s.t.{𝐲0(𝐱,𝐳)=𝐳,𝐲k+1(𝐱,𝐳)=𝒯k(𝐱,𝐲k(𝐱,𝐳)),k=0,,K1.\displaystyle s.t.\begin{cases}\mathbf{y}_{0}(\mathbf{x},\mathbf{z})=\mathbf{z},&\\ \mathbf{y}_{k+1}(\mathbf{x},\mathbf{z})=\mathcal{T}_{k}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})),~{}~{}k=0,\ldots,K-1.&\end{cases}

Then we investigate the convergence of solutions of this approximation problem towards those of the original BLO under following mild assumptions.

Assumption 4.2.

We make following standing assumptions throughout this subsection:

  • (1)

    F,f,𝐲fF,f,\nabla_{\mathbf{y}}f are continuous functions on 𝒳×m\mathcal{X}\times\mathbb{R}^{m}.

  • (2)

    f(𝐱,):mf(\mathbf{x},\cdot):\mathbb{R}^{m}\rightarrow\mathbb{R} is LfL_{f}-smooth for any 𝐱𝒳\mathbf{x}\in\mathcal{X}.

  • (3)

    𝒳\mathcal{X} and 𝒴\mathcal{Y} are convex compact sets.

  • (4)

    𝒮(𝐱)\mathcal{S}(\mathbf{x}) is nonempty for any 𝐱𝒳\mathbf{x}\in\mathcal{X}.

  • (5)

    For any (𝐱¯,𝐲¯)(\bar{\mathbf{x}},\bar{\mathbf{y}}) minimizing F(𝐱,𝐲)F(\mathbf{x},\mathbf{y}) over constraints 𝐱𝒳,𝐲𝒴\mathbf{x}\in\mathcal{X},\mathbf{y}\in\mathcal{Y} and 𝐲𝒮~(𝐱)\mathbf{y}\in\tilde{\mathcal{S}}(\mathbf{x}), it holds that 𝐲¯𝒮(𝐱)\bar{\mathbf{y}}\in\mathcal{S}(\mathbf{x}).

Note that 𝒮~(𝐱)\tilde{\mathcal{S}}(\mathbf{x}) denotes the set of LL stationary points, i.e., 𝒮~(𝐱)={𝐲𝒴|0=𝐲f(𝐱,𝐲)+𝒩𝒴(𝐲)}.\tilde{\mathcal{S}}(\mathbf{x})=\{\mathbf{y}\in\mathcal{Y}|0=\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})+\mathcal{N}_{\mathcal{Y}}(\mathbf{y})\}. Assumption 4.2 is standard in bi-level optimziation related literature, which is satisfied for the numerical example given in Section 6.1.

We propose the weak LL property with IA for the asymptotic convergence analysis of Algorithm 3 with general schematic iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot). We say that weak LL property with IA is satisfies if {𝐲K(𝐱,𝐳)}\{\mathbf{y}_{K}(\mathbf{x},\mathbf{z})\} is uniformly bounded on 𝒳×𝒵\mathcal{X}\times\mathcal{Z}, and there exists α>0\alpha>0 such that for any ϵ>0\epsilon>0, there exists k(ϵ)>0k(\epsilon)>0 such that whenever K>k(ϵ)K>k(\epsilon),

sup𝐱𝒳,𝐳𝒵{min0kKα(𝐱,𝐲k(𝐱,𝐳))}ϵ,\sup_{\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}}\left\{\min_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\right\}\leq\epsilon,

where

α(𝐱,𝐲):=𝐲𝙿𝚛𝚘𝚓𝒴(𝐲α𝐲f(𝐱,𝐲)).\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}):=\mathbf{y}-\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\right).

It should be noticed that α(𝐱,𝐲)=0\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y})=0 if and only if 𝐲𝒮~(𝐱)\mathbf{y}\in\tilde{\mathcal{S}}(\mathbf{x}). In the following theorem, we establish the asymptotic convergence of Alg. 3 when the weak LL property with IA is satisfied.

Theorem 4.5.

Suppose the above weak LL property with IA holds. Let (𝐱K,𝐳K)(\mathbf{x}_{K},\mathbf{z}_{K}) be a minimum of approximation problem in Eq. (16), i.e.,

ϕK(𝐱K,𝐳K)ϕK(𝐱,𝐳),𝐱𝒳,𝐳𝒵.\phi_{K}(\mathbf{x}_{K},\mathbf{z}_{K})\leq\phi_{K}(\mathbf{x},\mathbf{z}),\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Z}.

Then, any limit point x¯\bar{x} of the sequence {xK}\{x_{K}\} satisfies that x¯argminxXφ(x)\bar{x}\in\mathrm{argmin}_{x\in X}\varphi(x), i.e., x¯\bar{x} is a solution to BLO in Eq. (1).

Proof.

For any K>0K>0, we define

i(K):=argmin0kKα(𝐱,𝐲k(𝐱,𝐳)).i(K):=\mathrm{argmin}_{0\leq k\leq K}\|\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|.

For any limit point 𝐱¯\bar{\mathbf{x}} of the sequence {𝐱K}\{\mathbf{x}_{K}\}, let {𝐱l}\{\mathbf{x}_{l}\} be a subsequence of {𝐱K}\{\mathbf{x}_{K}\} such that 𝐱l𝐱¯𝒳\mathbf{x}_{l}\rightarrow\bar{\mathbf{x}}\in\mathcal{X}. As {𝐲i(K)(𝐱K,𝐳K)}𝒴\{\mathbf{y}_{i(K)}(\mathbf{x}_{K},\mathbf{z}_{K})\}\subset\mathcal{Y} and 𝒴\mathcal{Y} is compact, we can find a subsequence {𝐱j}\{\mathbf{x}_{j}\} of {𝐱l}\{\mathbf{x}_{l}\} satisfying 𝐲i(j)(𝐱j,𝐳j)𝐲¯\mathbf{y}_{i(j)}(\mathbf{x}_{j},\mathbf{z}_{j})\rightarrow\bar{\mathbf{y}} for some 𝐲¯𝒴\bar{\mathbf{y}}\in\mathcal{Y}. It follows from the weak LL convergence property that for any ϵ>0\epsilon>0, there exists J(ϵ)>0J(\epsilon)>0 such that for any j>J(ϵ)j>J(\epsilon), we have

α(𝐱j,𝐲i(j)(𝐱j,𝐳j))ϵ.\|\mathcal{R}_{{\alpha}}(\mathbf{x}_{j},\mathbf{y}_{i(j)}(\mathbf{x}_{j},\mathbf{z}_{j}))\|\leq\epsilon.

As 𝒴\mathcal{Y} is a convex compact set, it follows from [30, Theorem 6.42] that 𝙿𝚛𝚘𝚓𝒴\mathtt{Proj}_{\mathcal{Y}} is continuous. Combined with the assumed continuity of 𝐲f(𝐱,𝐲)\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}), we get that α(𝐱,𝐲)\mathcal{R}_{\alpha}(\mathbf{x},\mathbf{y}) is continuous. Then, by letting jj\rightarrow\infty, we have

α(𝐱¯,𝐲¯)ϵ.\|\mathcal{R}_{{\alpha}}(\bar{\mathbf{x}},\bar{\mathbf{y}})\|\leq\epsilon.

As ϵ\epsilon is arbitrarily chosen, we have α(𝐱¯,𝐲¯)0\|\mathcal{R}_{{\alpha}}(\bar{\mathbf{x}},\bar{\mathbf{y}})\|\leq 0 and thus 𝐲¯S^(𝐱¯)\bar{\mathbf{y}}\in\hat{S}(\bar{\mathbf{x}}). Then by using the same arguments as in the proof of [31, Theorem 3.1], we can get the conclusion. ∎

Next, we will show that when 𝒯k\mathcal{T}_{k} is chosen as the classical projected gradient method and the accelerated gradient scheme proposed by [29] in Eq. (10), the weak LL property with IA is satisfied. We first consider constructing the iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) by the classical projected gradient method as,

𝒯k+1(𝐱,𝐲k(𝐱,𝐳))=𝙿𝚛𝚘𝚓𝒴(𝐲k(𝐱,𝐳)α𝐲f(𝐱,𝐲k(𝐱,𝐳))),\mathcal{T}_{k+1}\left(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\right)=\mathtt{Proj}_{\mathcal{Y}}\left(\mathbf{y}_{k}(\mathbf{x},\mathbf{z})-\alpha\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\right), (17)

where α\alpha denotes the step size parameter, and k=0,,K1k=0,\ldots,K-1.

Theorem 4.6.

Let {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} be the output generated by (17) with α𝐲k[α¯𝐲,α¯𝐲](0,2Lf)\alpha^{k}_{\mathbf{y}}\in[\underline{\alpha}_{\mathbf{y}},\overline{\alpha}_{\mathbf{y}}]\subset(0,\frac{2}{L_{f}}), then we have that the weak LL property with IA holds.

Proof.

It follows from [31, Lemma 3.1] that there exists Cf>0C_{f}>0 such that

min0kKα¯𝐲(𝐱,𝐲k(𝐱,𝐳))CfK+1,𝐱𝒳,𝐳𝒴.\min_{0\leq k\leq K}\|\mathcal{R}_{\underline{\alpha}_{\mathbf{y}}}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\leq\frac{C_{f}}{\sqrt{K+1}},\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}.

Then the conclusion follows immediately. ∎

The iterative module 𝒯k(𝐱,)\mathcal{T}_{k}(\mathbf{x},\cdot) can also be the accelerated gradient scheme proposed in [29], which has been described in Eq. (10).

Theorem 4.7.

Let {𝐲k(𝐱,𝐳)}\{\mathbf{y}_{k}(\mathbf{x},\mathbf{z})\} be the output generated by Eq. (10), then we have that the weak LL property with IA holds.

Proof.

Since when 𝒴\mathcal{Y} is a compact set, 𝐯0(𝐱,𝐳)\mathbf{v}_{0}(\mathbf{x},\mathbf{z}) and 𝒮(x)\mathcal{S}(x) are both uniformly bounded for any 𝐱𝒳\mathbf{x}\in\mathcal{X}. Then, it follows from [29, Corollary 2] that there exists C>0C>0 such that

min0kK12Lf(𝐱,𝐲k(𝐱,𝐳))CK,𝐱𝒳,𝐳𝒴.\min_{0\leq k\leq K}\|\mathcal{R}_{\frac{1}{2L_{f}}}(\mathbf{x},\mathbf{y}_{k}(\mathbf{x},\mathbf{z}))\|\leq\frac{C}{\sqrt{K}},\quad\forall\mathbf{x}\in\mathcal{X},\mathbf{z}\in\mathcal{Y}.

Then the conclusion follows immediately. ∎

5 Applications

As mentioned above, although many popular hierarchical learning and vision tasks have been well understood as the BLO problem in Eq. (1), almost no solution strategies with high efficiency and accuracy are available for some challenging tasks due to high-dimensional search space and complex theoretical properties. Moreover, many complex deep learning models endowed with nested constrained objectives are still lack of proper reformulation and reliable methodology with grounded theoretical guarantee.

Neural Architecture Search. We first consider one of the most representative large-scale high-dimensional BLO application, i.e., Neural Architecture Search (NAS). Typically, NAS aims to find high-performance neural network structures with automated process. Here, we focus on the gradient-based differentiable NAS methods, which have been widely investigated recently. This type of methods select certain search strategy to find the optimal network architecture based on the well defined architecture search space. DARTS [7], ones of the representative differentiable NAS methods, relaxes the discrete search space to be continuous, and jointly optimize the architecture parameter (i.e., the UL variable 𝐱\mathbf{x}) and network weights (i.e., the LL variable 𝐲\mathbf{y}) with gradient descent step, which implies a typical BLO problem. In details, the searched architecture consists of multiple computation cells, and each cell is a directed acyclic graph including NN nodes. The nodes in each cell, denoted as dd, represent the latent feature map, while the directed edges o(i,j),i,j[1,N]o_{(i,j)},i,j\in[1,N] between nodes did^{i} and djd^{j} indicate the candidate operations based on defined search space denoted as 𝒪\mathcal{O}. Then 𝐱\mathbf{x} is supposed to be defined as the whole continuous variables 𝐱={𝐱(i,j)}\mathbf{x}=\{\mathbf{x}_{(i,j)}\}, where 𝐱(i,j)\mathbf{x}_{(i,j)} denotes the vector of the operation mixing weights for nodes pair (i,j)(i,j) of dimension 𝒪\mathcal{O}. Hence o(i,j)o_{(i,j)} is supposed to be the ideal operation chosen by argmax\arg\max operation, i.e., o(i,j)=argmaxo𝒪𝐱(i,j)o_{(i,j)}=\arg\max_{o\in\mathcal{O}}\mathbf{x}_{{(i,j)}}. For most learning and vision tasks which employ NAS to optimize the network structure, the LL subproblem usually has high-dimensional search space and complex task-specific constraints.

As one of the mostly widely used solution strategies, DARTS introduces finite difference approximation with a single-step gradient update of 𝐲\mathbf{y} to avoid the complex matrix-vector production. On top of that, a series of variations  [32, 33, 34] based on DARTS have been developed and applied in numerous applications. In addition to this approximation operation adopted by DARTS, we could also solve this BLO application with other GBMs such as RHG and CG. Whereas, since the LL variables contain a cumbersome number of parameters from different searching cells, all the above methods suffer from poor theoretical investigation on this non-convex high-dimensional BLO problem. In comparison, by effective augmentation of the iterative trajectory, AIT updates the UL variables with more accurate hyper-gradient, which is more qualified for this BLO problem with non-convex follower.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 1: The first two subfigures illustrate the convergence behavior of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\| for the baseline (i.e., BDA) and variations with techniques from AITC\textrm{AIT}_{C} including IA (i.e., +IA+\textrm{IA}) and prior regularization (i.e., +P+\textrm{P}). As for the last two subfigures, we compare the convergence behavior of AITC\textrm{AIT}_{C} and mainstream GBMs including CG, Neumann, RHG, BDA. We choose the same representative initialization point for all the methods as (𝐱0,𝐲0)=(2𝐞,(2𝐞,2𝐞))(\mathbf{x}_{0},\mathbf{y}_{0})=(2\mathbf{e},(2\mathbf{e},2\mathbf{e})).

Generative Adversarial Networks. We move one more step to apply our AIT to other challenging tasks with nested constrained learning objectives, i.e., Generative Adversarial Network (GAN), which has been one of the most popular deep learning applications and achieved impressive progress in various computer vision tasks. In general, GAN [35] based model is widely recognized as a minimax game between two adversarial players: the generator network G(𝐱;)G(\mathbf{x};\cdot) (parameterized by 𝐱\mathbf{x}) and the discriminator network D(𝐲;)D(\mathbf{y};\cdot) (parameterized by 𝐲\mathbf{y}). The generative model GG depicts the distribution PGP_{G}, and is supposed to fool DD with generated data sampled from PGP_{G} by optimizing FF to minimize the distance between PGP_{G} and the real data distribution P𝚍𝚊𝚝𝚊P_{\mathtt{data}}. Thus the generic GAN can be defined as min𝐱max𝐲V(𝐱,𝐲)=log(D(𝐲;𝐮))log(1D(𝐲;G(𝐱;𝐯)))\min_{\mathbf{x}}\max_{\mathbf{y}}V(\mathbf{x},\mathbf{y})=-\log(D(\mathbf{y};\mathbf{u}))-\log(1-D(\mathbf{y};G(\mathbf{x};\mathbf{v}))), where V(𝐱,𝐲)V(\mathbf{x},\mathbf{y}) denotes a joint loss function to optimize both objectives, 𝐮\mathbf{u} and 𝐯\mathbf{v} are sampled data from some real data distribution PdataP_{\text{data}} and generated fake distribution PGP_{G}, respectively.

Typically speaking, G(𝐱;)G(\mathbf{x};\cdot) first generates 𝐯\mathbf{v}, and then D(𝐲;)D(\mathbf{y};\cdot) corresponds with predicted probability that 𝐯\mathbf{v} is sampled from PdataP_{data}. Based on this, most related methods, e.x., Vanilla GAN (VGAN) [36] and Wasserstein GAN (WGAN) [37], alternatively train GG and DD, and simply treat them on an equal footing, which ignores the intrinsic leader-follower relationship between G(𝐱;)G(\mathbf{x};\cdot) and D(𝐲;)D(\mathbf{y};\cdot). In this work, we first refer to the reformulation in [1] to depict the original GAN as below:

F(𝐱,𝐲)\displaystyle F(\mathbf{x},\mathbf{y}) =log(D(𝐲;G(𝐱;𝐯))),\displaystyle=\log(D(\mathbf{y};G(\mathbf{x};\mathbf{v}))),
f(𝐱,𝐲)\displaystyle f(\mathbf{x},\mathbf{y}) =logD(𝐲;𝐮)+log(1D(𝐲;G(𝐱;𝐯))).\displaystyle=\log D(\mathbf{y};\mathbf{u})+\log(1-D(\mathbf{y};G(\mathbf{x};\mathbf{v}))).

Thus the LL objective ff targets on enhancing the ability of D(𝐲;)D(\mathbf{y};\cdot) to distinguish the fake data generate by G(𝐱;)G(\mathbf{x};\cdot), while the UL objective FF optimizes 𝐱\mathbf{x} to generate more “real” data distribution to confuse the discriminator. Under this BLO formulation, most existing methods with alternating training strategies essentially directly compute the hyper-gradient with the final state of trajectory instead of the sequence, which contains more high-order gradient information. In comparison, our methods admit multiple solutions and non-convexity of the LL subproblem, thus could solve the above problem with self contained theoretical guarantee.

6 Experiments

In this section, we report the quantitative and visualization results of the numerical examples, typical learning and vision applications and more challenging tasks. We run the experiments of toy examples based on a PC with Intel Core i7-8700 CPU, 32GB RAM and implement the real-world BLO applications on a Linux server with NVIDIA GeForce RTX 2080Ti GPU (12GB VRAM).

6.1 Numerical Evaluations

At first, we validate the provided theoretical results of our AIT based on numerical examples with different LL assumptions. More specifically, we focus on the UL variable (i.e., 𝐱\mathbf{x}) and approximative UL subproblem (Φ\Phi is used as unified denotation for φK(𝐱)\varphi_{K}(\mathbf{x}) in Eq. (4), φK(𝐱,𝐳)\varphi_{K}(\mathbf{x},\mathbf{z}) in Eq. (12) and ϕK¯(𝐱,𝐳)\phi_{\bar{K}}(\mathbf{x},\mathbf{z}) in Eq. (16) under convex and non-convex scenarios, respectively), and analyze the convergence behavior of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) (Φ\left\|\Phi^{*}\right\| is equal to 0 in some cases) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\| during the UL optimization. We use the numerical example under LLC assumption to show the effectiveness of AITC\textrm{AIT}_{C} with IA and the prior regularization term, and compare the convergence behavior of AITC\textrm{AIT}_{C} with a series of mainstream GBMs. Next, we use another example with the well defined LLS assumption to implement our AITC\textrm{AIT}_{C} with IA and the Nesterov’s acceleration dynamics.

Moving forward to the toy example with LL non-convexity, we first verify the effectiveness of AITNC\textrm{AIT}_{NC} with IA, PTT and the acceleration gradient scheme, and compare AITNC\textrm{AIT}_{NC} with the above GBMs to show its great improvement. Furthermore, we depict the loss surfaces and the derived solutions to show the significant improvement of AITNC\textrm{AIT}_{NC} under the non-covnex scenario, and analyze the running time influenced by different factors, including initialization points, dimension of LL variables (i.e., nn) and LL iterations (i.e, KK).

6.1.1 Convex Scenario

We first consider the LLC scenario and conduct the numerical experiments with the following example from [12]. With 𝐱n,[𝐲]1n\mathbf{x}\in\mathbb{R}^{n},[\mathbf{y}]_{1}\in\mathbb{R}^{n} and [𝐲]2n[\mathbf{y}]_{2}\in\mathbb{R}^{n}:

min𝐱𝒳𝐱[𝐲]24+[𝐲]1𝐞4,\displaystyle\min_{\mathbf{x}\in\mathcal{X}}\|\mathbf{x}-[\mathbf{y}]_{2}\|^{4}+\|[\mathbf{y}]_{1}-\mathbf{e}\|^{4}, (18)
s.t. ([𝐲]1,[𝐲]2)argmin[𝐲]1n,[𝐲]2n12[𝐲]12𝐱[𝐲]1,\displaystyle\text{ s.t. }([\mathbf{y}]_{1},[\mathbf{y}]_{2})\in\underset{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}{\operatorname{argmin}}\frac{1}{2}\|[\mathbf{y}]_{1}\|^{2}-\mathbf{x}^{\top}[\mathbf{y}]_{1},

where 𝒳=[10,10]×[10,10]n\mathcal{X}=[-10,10]\times\cdots[-10,10]\subset\mathbb{R}^{n}, and 𝐞\mathbf{e} represents the vector of which the elements are all equal to 1. Note that we use bold text to represent scalars for all the following description. We set n=20n=20, T=1000T=1000 and choose (𝐱0,𝐲0)=(2𝐞,(2𝐞,2𝐞))(\mathbf{x}_{0},\mathbf{y}_{0})=(2\mathbf{e},(2\mathbf{e},2\mathbf{e})) as the initialization point. By simple calculation, we can obtain the unique optimal solution as 𝐱=𝐞\mathbf{x}^{*}=\mathbf{e}, 𝐲=(𝐞,𝐞)\mathbf{y}^{*}=(\mathbf{e},\mathbf{e}). This numerical example satisfies the LLC assumption for BDA while violating the LLS assumption admitted by most GBMs such as RHG, CG [20] and Neumann [21].

To investigate the effectiveness of Alg. 2, we take BDA as the baseline, and implement our AITC\textrm{AIT}_{C} by incrementally adding IA and the prior regularization. In the first two subfigures of Fig. 1, we analyze the convergence behavior of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\|. As it has been proved in [1], the convergence property of BDA can be consistently verified and obtain expected convergence results for this BLO problem. As for the augmentation technique, the embedded IA of AITC\textrm{AIT}_{C} helps dynamically adjust the initialization of [𝐲]1[\mathbf{y}]_{1} and [𝐲]2[\mathbf{y}]_{2}, and the augmented optimization trajectory further improves the convergence speed of UL variable (i.e., 𝐱\mathbf{x}) and approximative UL subproblem (i.e., Φ\Phi). Besides, with prior regularization to guide the optimization process of 𝐳\mathbf{z}, our AITC\textrm{AIT}_{C} can obtain higher convergence precision and speed.

In the last two subfigures of Fig. 1, we compare our AITC\textrm{AIT}_{C} with series of mainstream GBMs including CG, Neumann, RHG, and BDA. Since the LLS assumptions of CG, Neumann and RHG can not be satisfied, no convergence property for these methods could be guaranteed in this case. In comparison with BDA, AITC\textrm{AIT}_{C} also gains advantage over convergence speed and precision.

Furthermore, to demonstrate the performance of other iterative formats and acceleration dynamics, we further introduce another toy example with more strict LL property, i.e, the well-defined LLS assumption:

min𝐱𝒳𝐞T𝐱+𝐱[𝐲]24+[𝐲]1𝐞4,\displaystyle\min_{\mathbf{x}\in\mathcal{X}}-\mathbf{e}^{T}\mathbf{x}+\|\mathbf{x}-[\mathbf{y}]_{2}\|^{4}+\|[\mathbf{y}]_{1}-\mathbf{e}\|^{4}, (19)
s.t. ([𝐲]1,[𝐲]2)argmin[𝐲]1n,[𝐲]2n[𝐲]1+[𝐲]2𝐱4,\displaystyle\text{ s.t. }([\mathbf{y}]_{1},[\mathbf{y}]_{2})\in\underset{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}{\operatorname{argmin}}\|[\mathbf{y}]_{1}+[\mathbf{y}]_{2}-\mathbf{x}\|^{4},

where 𝐱n,[𝐲]1n\mathbf{x}\in\mathbb{R}^{n},[\mathbf{y}]_{1}\in\mathbb{R}^{n}, [𝐲]2n[\mathbf{y}]_{2}\in\mathbb{R}^{n}, 𝒳=[1,1]××[1,1]n\mathcal{X}=[-1,1]\times\cdots\times[-1,1]\subset\mathbb{R}^{n}. Given any 𝐱𝒳\mathbf{x}\in\mathcal{X}, it satisfies min[𝐲]1n,[𝐲]2n[𝐲]1+[𝐲]2𝐱4=0\min_{[\mathbf{y}]_{1}\in\mathbb{R}^{n},[\mathbf{y}]_{2}\in\mathbb{R}^{n}}\|[\mathbf{y}]_{1}+[\mathbf{y}]_{2}-\mathbf{x}\|^{4}=0, then the unique optimal solution of this BLO problem is 𝐱=𝐞\mathbf{x}^{*}=\mathbf{e}, 𝐲=(12𝐞,12𝐞)\mathbf{y}^{*}=(\frac{1}{2}\mathbf{e},\frac{1}{2}\mathbf{e}). Since the solution set of this LL subproblem (i.e., 𝒮(𝐱)\mathcal{S}(\mathbf{x})) is a singleton, we can verify that the above example can be properly solved with GBMs under LLS assumption.

Refer to caption Refer to caption
(a) (b)
Figure 2: Illustrating the convergence curve of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\| for the baseline (i.e., RHG) and variations with techniques of AITC\textrm{AIT}_{C}, including IA (i.e., +IA+\textrm{IA}) and Nesterov’s acceleration dynamics (i.e., +A+\textrm{A}) during the LL optimization. The initialization point is chosen as (𝐱0,𝐲0)=(𝐞,(𝐞,2𝐞))(\mathbf{x}_{0},\mathbf{y}_{0})=(-\mathbf{e},(-\mathbf{e},-2\mathbf{e})).

In this case, we implement our AITC\textrm{AIT}_{C} with RHG as the baseline, and further incorporate the Nesterov’s acceleration dynamics to facilitate the convergence behavior. When an appropriate initialization position is chosen, commonly used GBMs are also supposed to converge with acceptable speed. To demonstrate the effectiveness of IA, we set the initialization point as (𝐱0,𝐲0)=(𝐞,(𝐞,2𝐞))(\mathbf{x}_{0},\mathbf{y}_{0})=(-\mathbf{e},(-\mathbf{e},-2\mathbf{e})), where 𝐳\mathbf{z} is further away from the optimal solution 𝐳\mathbf{z}^{*}. In Fig 2, it can be seen that the IA technique could effectively accelerate the convergence of UL variables towards the global optimal solution even when RHG converges slower due to improper initialization of LL variable. In addition, the Nesterov’s strategy further helps the LL variable (i.e., 𝐲\mathbf{y}) converge faster with embedded acceleration dynamics.

6.1.2 Non-Convex Scenario

Refer to caption Refer to caption Refer to caption Refer to caption
(a) (𝐱0,𝐲0)=(6,0)(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0) (b) (𝐱0,𝐲0)=(8,8)(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8)
Figure 3: We plot the convergence behavior of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\| for the baseline (i.e., RHG) and the variation with IA (i.e., +IA+\textrm{IA}), PTT (i.e., +PTT+\textrm{PTT}) and the acceleration gradient scheme (i.e., +A+\textrm{A}).
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (𝐱0,𝐲0)=(6,0)(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0) (b) (𝐱0,𝐲0)=(8,8)(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8)
Figure 4: In this figure, we compare the convergence behavior of ΦΦ/(Φ+1)\left\|\Phi-\Phi^{*}\right\|/(\left\|\Phi^{*}\right\|+1) and 𝐱𝐱/𝐱\|\mathbf{x}-\mathbf{x}^{*}\|/\|\mathbf{x}^{*}\| for our AITNC\textrm{AIT}_{NC} and series of mainstream GBMs including CG, Neumann, RHG, BDA and BVFIM.

To validate the superiority of AIT, we move on to more challenging scenario and introduce another toy example [16] with non-convex LL subproblem, which takes the form of:

min𝐱,𝐲n𝐱a2+𝐲a𝐜2,\displaystyle\min_{\mathbf{x}\in\mathbb{R},\mathbf{y}\in\mathbb{R}^{n}}\|\mathbf{x}-a\|^{2}+\|\mathbf{y}-a-\mathbf{c}\|^{2}, (20)
s.t. [𝐲]iargmin[𝐲]isin(𝐱+[𝐲]i[𝐜]i),i,\displaystyle\text{ s.t. }[\mathbf{y}]_{i}\in\underset{[\mathbf{y}]_{i}\in\mathbb{R}}{\operatorname{argmin}}\sin\left(\mathbf{x}+[\mathbf{y}]_{i}-[\mathbf{c}]_{i}\right),\forall i,

where both aa\in\mathbb{R} and 𝐜n\mathbf{c}\in\mathbb{R}^{n} are constants, and []i[\cdot]_{i} denotes the i-th element of the vector. Then the optimal solution of this problem can be calculated in the following form:

𝐱=(1n)a+nC1+n, and [𝐲]i=C+[𝐜]i𝐱,i,\mathbf{x}=\frac{(1-n)a+nC}{1+n},\text{ and }[\mathbf{y}]_{i}=C+[\mathbf{c}]_{i}-\mathbf{x},\forall i,

where

C=argmin𝐶{C2a:C=π2+2kπ,k},C=\underset{C}{\operatorname{argmin}}\left\{\|C-2a\|:C=-\frac{\pi}{2}+2k\pi,k\in\mathbb{Z}\right\},

and the corresponding value of UL objective is F=n(C2a)21+nF^{*}=\frac{n(C-2a)^{2}}{1+n}. Here we set a=2a=2, [𝐜]i=2,i=1,,n[\mathbf{c}]_{i}=2,i=1,...,n and [𝐲]i[10,10][\mathbf{y}]_{i}\in[-10,10]. It can be easily verified that this example satisfies the theoretical assumption of AITNC\textrm{AIT}_{NC} while violating the LLS assumption for RHG, CG, Neumann and the LLC assumption for BDA due to existence of multiple local optimal solutions of the LL subproblem. At first, we set n=1n=1 to investigate the convergence property based on the simplified form and draw the loss surfaces. Then we use larger nn when analyzing the influence of dimension of LL variables on the running time. We can calculate the corresponding optimal solution as (𝐱,𝐲)=(3π/4,3π/4+2)(\mathbf{x}^{*},\mathbf{y}^{*})=(3\pi/4,3\pi/4+2). In consideration of distance between the initialization position and the global optimal solution, we choose two initialization points for 𝐱\mathbf{x} and 𝐲\mathbf{y} including (𝐱0,𝐲0)=(6,0)(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0) and (𝐱0,𝐲0)=(8,8)(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8).

In Fig. 3, we take RHG as the baseline, and start by incrementally embedding two augmentation techniques and the acceleration gradient scheme of AITNC\textrm{AIT}_{NC} into the baseline. It can be seen that adding PTT technique shows consistently better convergence results with respective of 𝐱\mathbf{x}, while the convergence of ϕK¯(𝐱,𝐳)\phi_{\bar{K}}(\mathbf{x},\mathbf{z}) may be influenced by different initialization points without dynamically optimized 𝐳\mathbf{z}. When we implement our AITNC\textrm{AIT}_{NC} with both techniques, since the theoretical convergence property is guaranteed, our algorithm can uniformly improve the convergence behavior of ϕK¯(𝐱,𝐳)\phi_{\bar{K}}(\mathbf{x},\mathbf{z}) and 𝐱\mathbf{x} on this non-convex problem. Furthermore, by embedding the acceleration gradient scheme in Eq. (10) under our AITNC\textrm{AIT}_{NC}, the convergence rate of gradient descent for this non-convex BLO problem could be further improved.

Refer to caption
Refer to caption
Refer to caption
(b) UL Trajectory
Refer to caption
(a) Loss Surface (c) LL Trajectory
Figure 5: In the first subfigure, we plot the surface of two subproblems and illustrate the iterative solutions derived by RHG and AITNC\textrm{AIT}_{NC} with the same initialization point (𝐱0,𝐲0)=(6,0)(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0). As for the subfigures on the right, we plot the trajectory of (𝐱t,𝐳t)(\mathbf{x}^{t},\mathbf{z}^{t}) for our AITNC\textrm{AIT}_{NC} during the UL optimization.

Furthermore, In Fig 4, we compare AITNC\textrm{AIT}_{NC} with mainstream GBMs, including CG, Neumann, RHG, BDA and BVFIM with the same initialization points. As it is expected, all the above methods except BVFIM can be easily stuck at the local optimal solution. Besides, BVFIM may not converge exactly to the true optimal solution influenced by its sensitivity to the added penalty terms. In comparison, under our AITNC\textrm{AIT}_{NC} algorithm, by dynamically adjusting the initial value of LL variables and truncation of 𝒯\mathcal{T} for computation of hyper-gradients, the convergence results without LLC assumption can be consistently verified.

More vividly, in Fig. 5, we mark the initialization point (i.e., (𝐱0,𝐲0)(\mathbf{x}_{0},\mathbf{y}_{0})), optimal solution (i.e., (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*})), iterative solution (i.e., (𝐱t,𝐲t)(\mathbf{x}^{t},\mathbf{y}^{t})) and derived convergence solutions (i.e., (𝐱T,𝐲T)(\mathbf{x}^{T},\mathbf{y}^{T})) for RHG and AITNC\textrm{AIT}_{NC} on the UL and LL surfaces. As confirmed earlier by the convergence analysis, since quantities of local optima exist along the path from (𝐱0,𝐲0)(\mathbf{x}_{0},\mathbf{y}_{0}) to (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}), the iterative solutions of RHG are easily stuck at the local optima on the loss surface, which leads to convergence solution distant from (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}). In contrast, the introduced 𝐳\mathbf{z} together with PTT under our AITNC\textrm{AIT}_{NC}, (i.e., 𝐳t\mathbf{z}^{t}), goes over the local optima and makes its own way towards the true optimal solution, which further helps find reasonable 𝐲K¯(𝐱t,𝐳t)\mathbf{y}_{\bar{K}}(\mathbf{x}^{t},\mathbf{z}^{t}) for computation of hyper-gradient. Finally, (𝐱t,𝐲t)(\mathbf{x}^{t},\mathbf{y}^{t}) overcomes the trap of local optima and converges to the global optimal solution, thus demonstrates the superiority of AIT.

Refer to caption Refer to caption
(a) (𝐱0,𝐲0)=(6,0)(\mathbf{x}_{0},\mathbf{y}_{0})=(-6,0) (b) (𝐱0,𝐲0)=(8,8)(\mathbf{x}_{0},\mathbf{y}_{0})=(-8,-8)
Figure 6: In the two subfigures, we report average K¯\bar{K} and running time per iteration of our AITNC\textrm{AIT}_{NC} during the LL optimization with different initialization points. Note that the part with green background represents that the convergence curve comes to stabilize (𝐱𝐱<1×e3\|\mathbf{x}-\mathbf{x}^{*}\|<1\times e^{-3}).

In Fig. 6, we report the average running time of hyper-gradient computation with respective to different initialization points in this numerical case. It can be observed that as the PTT operation of AITNC\textrm{AIT}_{NC} dynamically truncates the optimization trajectory and selects K¯[1,K]\bar{K}\in[1,K] after KK LL iterations of LL optimization, the computational burden of hyper-gradient w.r.t. 𝐱\mathbf{x} and 𝐳\mathbf{z} is continuously eased. In addition, we found that as the curve converges with favorable convergence speed in Fig. 4, K¯\bar{K} tends to stabilize around a fixed number independent of the chosen initialization points.

Refer to caption Refer to caption
(a) Influence of KK (b) Influence of nn
Figure 7: The two subfigures report the total time of RHG and AITNC\textrm{AIT}_{NC} for a singe UL iteration as the LL iteration KK or dimension of 𝐲\mathbf{y} increases. We use TG()\textrm{TG}_{(\cdot)} to denote the time for gradient computation of corresponding variables, and TF denotes the time for forward propagation used to construct 𝒯\mathcal{T}. Note that we use the finally stabilized K¯\bar{K} for AITNC\textrm{AIT}_{NC} and run 5 steps to calculate the average time for comparison.

To evaluate the performance for high-dimensional BLO problems, we further investigate how dimension of LL subproblems (i.e., nn) and LL iteration (i.e., KK) influence the computation efficiency. Known that the computation burden has been evaluated to be less dependent on the dimension of UL subproblem when calculating the hyper-gradient with AD [15] (adopted by classical GBMs, e.g., RHG), so in Fig. 7, we continuously increase the dimension of 𝐲\mathbf{y} and record the average time for gradient computation of RHG and AITNC\textrm{AIT}_{NC}. We can observe that K¯[1,K]\bar{K}\in[1,K] always decreases the time cost for the UL optimization process compared with RHG, even extra computation for K¯\bar{K} and hyper-gradient of 𝐳\mathbf{z} is introduced in the AITNC\textrm{AIT}_{NC}. Besides, we also evaluate the runtime of RHG and AITNC\textrm{AIT}_{NC} with increasing KK. Compared with nn, the increasing KK has more influence on the computation for iterative optimization of 𝐲k\mathbf{y}_{k} and hyper-gradient of 𝐱\mathbf{x}. Accordingly, the PTT technique chooses smaller K¯\bar{K} for backpropagation ( Step 12-14 in Alg. 3), thus reduces the computation burden of backpropagation (i.e., PT𝐱\textrm{PT}_{\mathbf{x}} and PT𝐳\textrm{PT}_{\mathbf{z}}).

6.2 Typical Learning and Vision Applications

In the following, we further demonstrate the performance of AIT with typical BLO applications influenced by the nonconvex LL objectives and nonconvex LL network structure, including few-shot learning and data hyper-cleaning.

6.2.1 Data Hyper-Cleaning

The goal of data hyper-cleaning is to cleanup the dataset in which the labels have been corrupted. This BLO problem defines the UL variables 𝐱\mathbf{x} as a vector, and the dimension of 𝐱\mathbf{x} equals to the number of corrupted samples, while the LL variable 𝐲\mathbf{y} corresponds to parameters of the classification model. Following the common problem setting of existing works [2], we first split the dataset into three disjoint subsets: 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}, 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}} and 𝒟𝚝𝚎𝚜𝚝\mathcal{D}_{\mathtt{test}}, and then randomly pollute the labels in 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}} with fixed ratio. Then the UL subproblem can be well defined as F(𝐱,𝐲)=(𝐮i,𝐯i)𝒟𝚟𝚊𝚕(𝐲(𝐱);𝐮i,𝐯i),F(\mathbf{x},\mathbf{y})=\sum_{\left(\mathbf{u}_{i},\mathbf{v}_{i}\right)\in\mathcal{D}_{\mathtt{val}}}\ell\left(\mathbf{y}(\mathbf{x});\mathbf{u}_{i},\mathbf{v}_{i}\right), where (𝐮i,𝐯i)\left(\mathbf{u}_{i},\mathbf{v}_{i}\right) represents the data pairs, (𝐲(𝐱);𝐮i,𝐯i)\ell\left(\mathbf{y}(\mathbf{x});\mathbf{u}_{i},\mathbf{v}_{i}\right) denotes the cross-entropy loss function with classifier parameter 𝐲\mathbf{y} and data pairs from different subsets. By introducing σ(𝐱)\sigma(\mathbf{x}) as the element-wise sigmoid function to constrain the element in the range of [0,1]\left[0,1\right], the LL subproblem can be well defined as: f(𝐱,𝐲)=(𝐮i,𝐯i)𝒟𝚝𝚛[σ(𝐱)]i(𝐲;𝐮i,𝐯i)f(\mathbf{x},\mathbf{y})=\sum_{\left(\mathbf{u}_{i},\mathbf{v}_{i}\right)\in\mathcal{D}_{\mathtt{tr}}}[\sigma(\mathbf{x})]_{i}\ell\left(\mathbf{y};\mathbf{u}_{i},\mathbf{v}_{i}\right).

TABLE II: With convex network structure, we report results of existing GBMs including CG, Neumann, RHG and BDA for solving data hyper-cleaning tasks. Acc and F1 score denote the test accuracy and the harmonic mean of the precision and recall, respectively. AITP\textrm{AIT}_{P} denotes our AIT with embedded prior regularization.
Method MNIST CIFAR10
Acc. F1 score Acc. F1 score
CG 89.1289.12 87.5187.51 36.8036.80 68.5668.56
Neumann 88.9088.90 89.9089.90 35.1535.15 67.7967.79
RHG 88.9688.96 90.4990.49 34.8034.80 69.7569.75
BDA 88.5588.55 90.9990.99 35.5635.56 69.5869.58
AIT 89.5889.58 91.3591.35 36.9536.95 72.19\mathbf{72.19}
AITP\textrm{AIT}_{P} 89.63\mathbf{89.63} 91.98\mathbf{91.98} 38.08\mathbf{38.08} 70.7570.75
TABLE III: Reporting results of existing methods for solving data hyper-cleaning tasks with non-convex LL network structure.
Method MNIST FashionMNIST CIFAR10
Acc. F1 score Acc. F1 score Acc. F1 score
CG 89.1989.19 85.9685.96 83.1583.15 85.1385.13 34.1634.16 69.1069.10
Neumann 87.5487.54 89.5889.58 81.3781.37 87.2887.28 33.4533.45 68.8768.87
RHG 87.9087.90 89.3689.36 81.9181.91 87.1287.12 34.9534.95 68.2768.27
BDA 87.1587.15 90.3890.38 79.9779.97 88.2488.24 36.4136.41 67.3367.33
AIT 90.88\mathbf{90.88} 91.5791.57 83.6783.67 90.3790.37 37.1637.16 71.5771.57
AITP\textrm{AIT}_{P} 90.4190.41 91.95\mathbf{91.95} 83.80\mathbf{83.80} 90.40\mathbf{90.40} 37.70\mathbf{37.70} 72.74\mathbf{72.74}
TABLE IV: Mean test accuracy of 5-way classification on MiniImageNet and TieredImageNetwith non-convex LL objective. We use ±\pm to report the accuracy with 95%95\% confidence intervals over tasks.
Network Methods MiniImagenet TieredImagenet
5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
ConvNet-4 MAML 48.70±0.75%48.70\pm 0.75\% 63.11±0.11%63.11\pm 0.11\% 49.06±0.50%49.06\pm 0.50\% 67.48±0.47%67.48\pm 0.47\%
RHG 48.89±0.81%48.89\pm 0.81\% 63.02±0.70%63.02\pm 0.70\% 49.63±0.67%49.63\pm 0.67\% 66.14±0.57%66.14\pm 0.57\%
T-RHG 47.67±0.82%47.67\pm 0.82\% 63.70±0.76%63.70\pm 0.76\% 50.79±0.69%50.79\pm 0.69\% 67.39±0.60%67.39\pm 0.60\%
BDA 49.08±0.82%49.08\pm 0.82\% 62.17±0.70%62.17\pm 0.70\% 51.56±0.68%51.56\pm 0.68\% 68.21±0.58%\mathbf{68.21\pm 0.58}\%
AIT 49.80±0.61%\mathbf{49.80\pm 0.61\%} 64.76±0.54%\mathbf{64.76\pm 0.54\%} 51.86±0.68%\mathbf{51.86\pm 0.68\%} 68.01±0.60%68.01\pm 0.60\%
ResNet-12 MAML 51.03±0.50%51.03\pm 0.50\% 68.26±0.47%{68.26\pm 0.47}\% 58.58±0.49%{58.58\pm 0.49}\% 71.24±0.43%71.24\pm 0.43\%
RHG 50.54±0.85%{50.54\pm 0.85}\% 64.53±0.68%64.53\pm 0.68\% 58.19±0.76%58.19\pm 0.76\% 75.20±0.60%{75.20\pm 0.60}\%
AIT 56.69±0.66%\mathbf{56.69\pm 0.66}\% 70.21±0.55%\mathbf{70.21\pm 0.55}\% 60.71±0.77%\mathbf{60.71\pm 0.77}\% 75.85±0.59%\mathbf{75.85\pm 0.59}\%

Typically, to satisfy the LLC assumption admitted by existing GBMs, the LL classification model is usually defined as a single fully-connected layer, and multiple-layer classifier with more parameters are not accessible. Based on the AIT, the constraints could be continuously relaxed thus we have more choices when designing the LL model. Besides, the prior regularization also serves as effective tool to help improve the performance of AIT. To find prior determined value 𝐳p\mathbf{z}_{p} for the IA, (i.e., 𝐳\mathbf{z}), we pretrain the LL classification model on 𝒟tr\mathcal{D}_{\textrm{tr}} and restore the pretrained model weights as the proximal prior to adjust the updates and facilitate the convergence behavior of 𝐳\mathbf{z}. Then we could specify the form of prior regularization in Eq. (12). Practically, we implement the prior regularization g(𝐳)g(\mathbf{z}) as 2\ell_{2} regularization, and add this term only for the first iteration to investigate how this regularization term influence the latter optimization process. Three well known datasets including MNIST, FashionMNIST and CIFAR10 are used to conduct the experiments. Specifically, 5000, 5000, 10000 samples are randomly selected to construct 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}, 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}} and 𝒟𝚝𝚎𝚜𝚝\mathcal{D}_{\mathtt{test}}, then half of the labels in 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}} are tampered.

We first implement this application with convex LL network structure including a single-layer fully-connected classifier. Then the shape of 𝐲\mathbf{y} is 𝐲10×d\mathbf{y}\in\mathbb{R}^{10\times d}, where dd denotes the dimension of the flattened input data. In Tab. II, we compare our AIT with CG, Neumann, RHG and BDA. As it has been verified in the numerical experiments before, we can observe that AIT helps improve the performance on both Accuracy and F1 score based on MNIST and CIFAR10 dataset. Besides, though the prior regularization term only exists in the first iteration, it helps correct the initial state of constructed optimization dynamics, and further improves the performance of AIT on both datasets.

In addition, we introduce the non-convex network structure with two layers of fully-connected network to expand the search space of LL subproblem, where 𝐲10×301×d\mathbf{y}\in\mathbb{R}^{10}\times\mathbb{R}^{301\times d}. In Tab. III, we compare our AIT and the variation with prior regularization, (i.e., AITP\textrm{AIT}_{P}) with RHG, BDA, CG and Neumann on validation accuracy, F1 score and running time. As it is shown, our method performs the best compared with other GBMs on both accuracy and F1 score. Besides, although only used as a practical technique for solving non-convex BLOs, the embedded proximal prior also consistently improve the performance of AITP\textrm{AIT}_{P} on three datasets.

6.2.2 Few-Shot Learning

Few-shot learning applications, more specifically, the few-shot classification tasks [38], train the model to classify unseen instances with only a few samples based on prior training data of similar tasks. As for the N-way K-shot classification, we first define the training dataset as 𝒟={𝒟j}\mathcal{D}=\{\mathcal{D}^{j}\}, where 𝒟j=𝒟𝚝𝚛j𝒟𝚟𝚊𝚕j\mathcal{D}^{j}=\mathcal{D}_{\mathtt{tr}}^{j}\bigcup\mathcal{D}_{\mathtt{val}}^{j} corresponds to the meta-train and meta-validation set for the j-thj\textrm{-th} task. In the training phase, we select KK samples from each of the NN classes in 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}} for meta training, and use more data from these classes in 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}} for meta validation. Then in the test phase, we construct new tasks on test dataset under the same data distribution to test the performance.

We follow the commonly used parameter setting for GBMs, where the UL variables 𝐱\mathbf{x} correspond to the shared hyper representation module model, while the LL variables 𝐲\mathbf{y} correspond to the task-specific classifier. For this classification problem, the cross-entropy function (denoted as \ell) is commonly used for the LL objective ff and UL objective FF. Besides, q\ell_{q} regularization with 0<q<10<q<1 can be applied to stabilize the training and effectively avoid over-fitting, while existing methods only guarantee the convergence for convex scenario where q1q\geq 1, our AIT supports more flexible non-convex LL objective without the LLC or LLS constraints. Then we add 𝐲jq\|{\mathbf{y}^{j}}\|_{q} as the non-convex regularization term to define the LL objectives as F(𝐱,{𝐲j})=j(𝐱,𝐲j;𝒟𝚟𝚊𝚕j),f(𝐱,{𝐲j})=j(𝐱,𝐲j;𝒟𝚝𝚛j)+𝐲jq.F\left(\mathbf{x},\left\{\mathbf{y}^{j}\right\}\right)=\sum_{j}\ell\left(\mathbf{x},\mathbf{y}^{j};\mathcal{\mathcal{D}}_{\mathtt{val}}^{j}\right),f\left(\mathbf{x},\left\{\mathbf{y}^{j}\right\}\right)=\sum_{j}\ell\left(\mathbf{x},\mathbf{y}^{j};\mathcal{\mathcal{D}}_{\mathtt{tr}}^{j}\right)+\|{\mathbf{y}^{j}}\|_{q}. The above implementation of non-convex LL objective implies another class of complex LL subproblem for BLOs, which could be used to verify the significant improvement of AIT.

We implement 55-way 11-shot and 55-way 55-shot classification tasks to validate the performance based on MiniImagenet [39] and TieredImagenet [40] datasets. Both datasets are subsets constructed from the larger ILSVRC-12 dataset, while TieredImagenet includes more classes than MiniImagenet. Besides, TieredImagenet categorizes the source data with hierarchical structure to split the training and test datasets and also simulates more real scenarios.

In Tab. IV, we compare our AIT with representative methods including MAML [4], RHG, T-RHG [2] and BDA. We also consider two structures as the backbones of hyper representation module, including ConvNet-4 with 44 layers of convolutional blocks and the larger ResNet-12 with Residual blocks. We use a single fully-connected layer as the task-specific classifier for both backbones. For MAML, the hyper representation module and the classifier are treated as a whole and update as the initialization parameters. We first use ConvNet-4 as the backbone and compare AIT with RHG, T-RHG, BDA and MAML, then we also implement the ResNet-12 backbone and compare AIT with initialization based (i.e., MAML) and recurrence based methods (i.e., RHG). As it is shown in IV, for the 5-way 1-shot and 5-way 5-shot classification tasks on MiniImagenet and TieredImagenet, AIT shows significant performance improvement on the accuracy with different backbones as the hyper representation module.

6.3 Extensions for More Challenging Tasks

In this subsection, we first implement AIT to solve large-scale and high-dimensional real-world BLO applications, e.g., neural architecture search. Then we move on to other complex deep learning models, e.g., generative adversarial networks, to demonstrate its generalization performance.

6.3.1 Neural Architecture Search

We denote the training set, validation set, and test set as (𝒟𝚝𝚛,𝒟𝚟𝚊𝚕,𝒟𝚝𝚎𝚜𝚝)(\mathcal{D}_{\mathtt{tr}},\mathcal{D}_{\mathtt{val}},\mathcal{D}_{\mathtt{test}}), then the UL objective and LL objective can be written as F(𝐱,𝐲;𝒟𝚟𝚊𝚕)F\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{val}}\right) and f(𝐱,𝐲;𝒟𝚝𝚛)f\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{tr}}\right). Then we follow the standard training strategies in DARTS, which consists of the searching and the inference stages. As for the searching stage, DARTS executes a single-step LL optimization (Step 4 in Alg.1) with f(𝐱,𝐲;𝒟𝚝𝚛)f\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{tr}}\right), and continuously optimizes the architecture parameter 𝐱\mathbf{x} (Step 9 in Alg.1) to find better cell structures according to the UL objectives F(𝐱,𝐲;𝒟𝚟𝚊𝚕)F\left(\mathbf{x},\mathbf{y};\mathcal{\mathcal{D}}_{\mathtt{val}}\right). When it comes to the inference stage, the architecture parameter 𝐱\mathbf{x} is fixed, and we reuse the searched cell structure to construct larger architecture and train it from scratch based on 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}}. Finally, we test the performance of trained model at the inference stage on 𝒟𝚝𝚎𝚜𝚝\mathcal{D}_{\mathtt{test}}. In this paper, we conduct the experiments on CIFAR-10 dataset, and use the same search space and similar experimental settings of DARTS. The stacked cell for final architecture has two types, including reduction cells and normal cells. In the reduction cell, operations adjacent to the input nodes use stride 2 to reduce the input shape, while normal cells maintain the original shape with stride 1 for the first two nodes. We use 33 layers of cells for searching with 5050 epochs and construct larger structure with 88 layers of the searched cells, and train the model from scratch with 600 epochs in total.

Refer to caption Refer to caption
(a) Validation Loss (b) Validation Accuracy
Figure 8: Comparison of the validation loss and accuracy of DARTS, CG, RHG and AIT in the searching process.

In Fig. 8, we report the validation loss and accuracy after each epoch to evaluate the convergence behavior of DARTS, RHG, CG and our AIT. It can be seen that RHG, CG and DARTS gain similar validation performance on this non-convex BLO application, while AIT significantly improves the convergence results on 𝒟val\mathcal{D}_{\textrm{val}}. Noticed that several works [41, 42] found that DARTS may find the global minimum with very small search space but it starts overfitting the architecture parameters on 𝒟val\mathcal{D}_{\textrm{val}}, which leads to poor generalization performance. To demonstrate that the searched architecture has consistent performance, in Tab. V, we also report the accuracy on 𝒟𝚝𝚛\mathcal{D}_{\mathtt{tr}} and 𝒟𝚟𝚊𝚕\mathcal{D}_{\mathtt{val}} at different stages. It can be observed that only AIT maintains better performance on both 𝒟tr\mathcal{D}_{\textrm{tr}} and 𝒟val\mathcal{D}_{\textrm{val}}, and it also improves the test performance of finally constructed network structure by a large margin.

TABLE V: Reporting top 1 accuracy of searching stage, inference stage, final test stage for DARTS, RHG, CG and AIT. We also report the number of parameters of the searched structure (MB) for different methods.
Method Searching Inference Test Params
Train Valid Train Valid
DARTS 98.32098.320 88.94088.940 97.95297.952 96.71096.710 96.67096.670 1.2771.277
RHG 98.44898.448 89.55689.556 98.49498.494 97.01097.010 96.92096.920 1.3591.359
CG 99.126\mathbf{99.126} 89.29889.298 98.28698.286 96.69096.690 96.64096.640 1.2681.268
AIT 98.90498.904 99.512\mathbf{99.512} 98.718\mathbf{98.718} 97.470\mathbf{97.470} 97.330\mathbf{97.330} 1.963\mathbf{1.963}
Normal Cell
Refer to caption Refer to caption
(a) DARTS (b) RHG
Refer to caption Refer to caption
Reduction Cell (c) CG (d) AIT
Refer to caption Refer to caption
(a) DARTS (b) RHG
Refer to caption Refer to caption
(c) CG (d) AIT
Figure 9: Visualization results of the searched normal cell and reduction cell for DARTS, RHG, CG and AIT. Note that the edges with blue fonts denote operations with less parameters such as skip connections and pooling operations, and edges with red fonts denote more complex operations such as 5×55\times 5 separable convolutions.

2D Pentagram

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Target (a) VGAN (b) WGAN (c) UGAN (d) AIT
Figure 10: We report the results based on the synthesized distribution of 2D Pentagram, and compare AIT with existing GAN methods including VGAN, WGAN and UGAN.

Besides, it have been investigated [42] that DARTS tends to choose more parameter-less operations such as skip connection instead of convolution since the skip connection often leads to rapid gradient descent especially on the proxy datasets which are small and easy to fit. As a result, the generated structure may contains less learnable parameters to some extent. In Fig. 9, we illustrate the searched structure for normal cells and reduction cells to show the difference between these GBMs and AIT. As it is depicted, since DARTS derives approximative hyper-gradient w.r.t. 𝐱\mathbf{x} compared to RHG, both of them found similar structure while RHG uses complex edges with more parameters as reported in Tab. V. Besides, though CG found significantly different structures from DARTS and RHG, it has no guarantee of convergence and generalization performance on the test set. In comparison, with convergence guarantee, AIT constructs normal cells and reduction cells with more complex separable convolutions, thus significantly improve the expression ability and generalization performance of the searched architecture.

6.3.2 Generative Adversarial Networks

In practice, we conduct the experiments with synthesized 2 dimensional mixture of Gaussian distributions. Specifically, we generate 10 Gaussians to form the shape of pentagram with maximum radius r=8r=8. In Fig. 10, we compare the visualization results generated by the VGAN, Unrolled GAN [43] (UGAN), WGAN and AIT. It can be seen that AIT generates more modes than the above methods to fit the target distribution. In comparison, the original GAN methods only capture a small number of distribution, which also shows the effectiveness of AIT to solve GAN based applications. As for the quantitative results, in Tab. VI, we report three commonly used metrics including Frechet Inception Distance (FID) [36], Jensen-Shannon divergence (JS) [44], number of Modes (Mode) . In comparison, our AIT not only always generate all the modes with different random seeds, but also gains significant performance improvement on both JS and FID.

TABLE VI: Reporting results of existing methods for the 2D Pentagram Gaussian distribution bu running the experiment with 33 random seeds. We use different seeds to run the experiments and compare our AIT with VGAN, UGAN and WGAN on three metrics.
Method 2D Pentagram (Max Mode=10)
FID JS Mode
VGAN 1.214 ±\pm 0.37 0.191 ±\pm 0.094 8.00 ±\pm 0.00
WGAN 2.404 ±\pm 3.80 0.746 ±\pm 0.064 3.33 ±\pm 2.52
UGAN 2.355 ±\pm 2.93 0.762 ±\pm 0.083 3.33 ±\pm 2.89
AIT 0.256 ±\pm 0.13 0.187 ±\pm 0.046 10.00 ±\pm 0.00

7 Conclusion

To summarize, our approach involves a thorough analysis of the gradient-based BLO scheme, identifying two significant deficiencies of the existing iterative trajectory. To address these deficiencies, we propose two augmentation techniques, namely Initialization Auxiliary (IA) and Pessimistic Trajectory Truncation (PTT), and develop a series of variations based on these techniques, such as prior regularization and different forms of acceleration dynamics, to construct our Augmented Iterative Trajectory (AIT) for both convex and non-convex scenarios (i.e., AITC\textrm{AIT}_{C} and AITNC\textrm{AIT}_{NC}). Theoretically, we establish detailed convergence analysis of AIT for different types of BLOs, which fills the gap in non-convex theory. Finally, we conduct a series of experiments on various BLO settings using numerical examples, and demonstrate the effectiveness of our approach on typical learning and vision tasks as well as more challenging deep learning models such as NAS and GANs.

Acknowledgments

This work is partially supported by the National Key R&D Program of China (2022YFA1004101), the National Natural Science Foundation of China (Nos. U22B2052, 61922019, 12222106), the Shenzhen Science and Technology Program (No. RCYX20200714114700072), the Pacific Institute for the Mathematical Sciences (PIMS), and the Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020082).

References

  • [1] R. Liu, J. Gao, J. Zhang, D. Meng, and Z. Lin, “Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [2] A. Shaban, C.-A. Cheng, N. Hatch, and B. Boots, “Truncated back-propagation for bilevel optimization,” in AISTATS, 2019.
  • [3] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv:1803.02999, 2018.
  • [4] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017.
  • [5] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” in IEEE CVPR, 2019, pp. 10 734–10 742.
  • [6] Y. Hu, X. Wu, and R. He, “Tf-nas: Rethinking three search freedoms of latency-constrained differentiable neural architecture search,” in ECCV, 2020.
  • [7] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
  • [8] H. Zhang, W. Chen, Z. Huang, M. Li, Y. Yang, W. Zhang, and J. Wang, “Bi-level actor-critic for multi-agent coordination,” in AAAI, vol. 34, no. 05, 2020, pp. 7325–7332.
  • [9] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “On the global optimality of model-agnostic meta-learning: Reinforcement learning and supervised learning,” in ICML, 2020.
  • [10] G. M. Moore, Bilevel programming algorithms for machine learning model selection.   Rensselaer Polytechnic Institute, 2010.
  • [11] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine, “Meta-learning with implicit gradients,” in NeurIPS, 2019.
  • [12] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton,” in International Conference on Machine Learning, 2020, pp. 6305–6315.
  • [13] L. L. Gao, J. Ye, H. Yin, S. Zeng, and J. Zhang, “Value function based difference-of-convex algorithm for bilevel hyperparameter selection problems,” in International Conference on Machine Learning.   PMLR, 2022, pp. 7164–7182.
  • [14] A. B. Zemkoho and S. Zhou, “Theoretical and numerical comparison of the karush–kuhn–tucker and value function reformulations in bilevel optimization,” Computational Optimization and Applications, pp. 1–50, 2020.
  • [15] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, “Forward and reverse gradient-based hyperparameter optimization,” arXiv:1703.01785, 2017.
  • [16] R. Liu, X. Liu, S. Zeng, J. Zhang, and Y. Zhang, “Value-function-based sequential minimization for bi-level optimization,” arXiv preprint arXiv:2110.04974, 2021.
  • [17] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperparameter optimization through reversible learning,” in ICML, 2015.
  • [18] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil, “Bilevel programming for hyperparameter optimization and meta-learning,” in ICML, 2018.
  • [19] R. Liu, P. Mu, X. Yuan, S. Zeng, and J. Zhang, “A general descent aggregation framework for gradient-based bi-level optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [20] F. Pedregosa, “Hyperparameter optimization with approximate gradient,” arXiv:1602.02355, 2016.
  • [21] J. Lorraine, P. Vicol, and D. Duvenaud, “Optimizing millions of hyperparameters by implicit differentiation,” in AISTATS, 2020.
  • [22] R. Liu, X. Liu, X. Yuan, S. Zeng, and J. Zhang, “A value-function-based interior-point method for non-convex bi-level optimization,” in ICML, 2021.
  • [23] K. Ji, J. Yang, and Y. Liang, “Provably faster algorithms for bilevel optimization and applications to meta-learning,” 2020.
  • [24] T. Chen, Y. Sun, Q. Xiao, and W. Yin, “A single-timescale method for stochastic bilevel optimization,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 2466–2488.
  • [25] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic,” arXiv preprint arXiv:2007.05170, 2020.
  • [26] P. Khanduri, S. Zeng, M. Hong, H.-T. Wai, Z. Wang, and Z. Yang, “A momentum-assisted single-timescale stochastic approximation algorithm for bilevel optimization,” 2021.
  • [27] Q. Xiao, H. Shen, W. Yin, and T. Chen, “Alternating implicit projected sgd and its efficient variants for equality-constrained bilevel optimization,” arXiv preprint arXiv:2211.07096, 2022.
  • [28] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” 2009.
  • [29] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1, pp. 59–99, 2016.
  • [30] A. Beck, First-order methods in optimization.   SIAM, 2017.
  • [31] R. Liu, Y. Liu, S. Zeng, and J. Zhang, “Towards gradient-based bilevel optimization with non-convex followers and beyond,” Advances in Neural Information Processing Systems, vol. 34, pp. 8662–8675, 2021.
  • [32] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “Pc-darts: Partial channel connections for memory-efficient architecture search,” arXiv preprint arXiv:1907.05737, 2019.
  • [33] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1294–1303.
  • [34] X. Dong and Y. Yang, “Searching for a robust neural architecture in four gpu hours,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1761–1770.
  • [35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  • [36] ——, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [37] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017. [Online]. Available: https://arxiv.org/abs/1701.07875
  • [38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in NeurIPS, 2016.
  • [40] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised few-shot classification,” arXiv:1803.00676, 2018.
  • [41] X. Chu, T. Zhou, B. Zhang, and J. Li, “Fair darts: Eliminating unfair advantages in differentiable architecture search,” in European conference on computer vision.   Springer, 2020, pp. 465–480.
  • [42] T. E. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter, “Understanding and robustifying differentiable architecture search,” in International Conference on Learning Representations, vol. 2, 2020.
  • [43] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adversarial networks,” arXiv preprint arXiv:1611.02163, 2016.
  • [44] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.