This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Acceleration for Classification-Based Derivative-Free Optimization

Tianyi Han*, Jingya Li, Zhipeng Guo, Yuan Jin*
Abstract

Derivative-free optimization algorithms play an important role in scientific and engineering design optimization problems, especially when derivative information is not accessible. In this paper, we study the framework of sequential classification-based derivative-free optimization algorithms. By introducing learning theoretic concept hypothesis-target shattering rate, we revisit the computational complexity upper bound of SRACOS (Hu, Qian, and Yu 2017). Inspired by the revisited upper bound, we propose an algorithm named RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We further establish theorems showing the acceleration by region shrinking. Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efficiency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyper-parameter tuning guidance.

Introduction

In recent years, there has been a growing interest in the field of derivative-free optimization (DFO) algorithms, also known as zeroth-order optimization. These algorithms aim to optimize objective functions without relying on explicit gradient information, making them suitable for scenarios where obtaining derivatives is either infeasible or computationally expensive (Conn, Scheinberg, and Vicente 2009; Larson, Menickelly, and Wild 2019). For example, DFO techniques can be applied to hyperparameter tuning, which involves optimizing complex objective functions with unavailable first-order information (Falkner, Klein, and Hutter 2018; Akiba et al. 2019; Yang and Shami 2020). Moreover, DFO algorithms find applications in engineering design optimization, where the objective functions are computationally expensive to evaluate and lack explicit derivatives (Ray and Saini 2001; Liao 2010; Akay and Karaboga 2012).

Classical DFO methods such as Nelder-Mead method (Nelder and Mead 1965) or directional direct-search (DDS) method (Céa 1971; Yu 1979) have shown great success on convex problems. However, their performance is compromised when the objective is nonconvex, which is the commonly-faced situation for black-box problems. In this paper, we focus on optimization problems reading as

minxΩf(x),\min_{x\in\Omega}~{}f(x), (1)

where the nn dimensional compact cubic Ωn\Omega\subseteq{\mathbb{R}}^{n} is solution space. In addition, we will not stipulate any convexity, smoothness or separability assumption on f.f.

Recent decades, progresses have been made in the extensive exploration of DFO algorithms for nonconvex problems, and various kinds of algorithms have been established. For example, evolutionary algorithms (Bäck and Schwefel 1993; Fortin et al. 2012; Hansen 2016; Opara and Arabas 2019), Bayesian optimization (BO) methods (Snoek, Larochelle, and Adams 2012; Shahriari et al. 2015; Frazier 2018) and gradient approximation surrogate modeling algorithms (Nesterov and Spokoiny 2017; Chen et al. 2019; Ge et al. 2022; Ragonneau and Zhang 2023).

Given the nonconvex and black-box nature of the problem, the quest for enhanced sample efficiency inevitably presents the classic trade-off between exploration and exploitation. Similar to typical nonlinear/nonconvex numerical issues, the aforementioned algorithms are also susceptible to the curse of dimensionality (Bickel and Levina 2004; Hall, Marron, and Neeman 2005; Fan and Fan 2008; Shi et al. 2021; Scheinberg 2022; Yue et al. 2023). Improving scalability and efficiency can be distilled into a few key areas, such as function decomposition (Wang et al. 2018b), dimension reduction (Wang et al. 2016; Nayebi, Munteanu, and Poloczek 2019) and sample strategy refinement. Regarding the latter, previous algorithms have focused on preventing over-exploitation through, for instance, trying more efficient sampling dynamics (Ros and Hansen 2008; Hensman, Fusi, and Lawrence 2013; Yi et al. 2024), search space discretization (Sobol’ 1967) or search region restriction (Eriksson et al. 2019; Wang, Fonseca, and Tian 2020).

Regardless of the improvements, many of these model-based DFO algorithms share the mechanism that, locally or globally, fits the objective function. The motivation of classification-based DFO algorithms is different: it learns a hypothesis (classifier) to fit the sub-level set Ωαt:={xΩ:f(x)fαt}\Omega_{\alpha_{t}}:=\{x\in\Omega\colon f(x)-f^{*}\leq\alpha_{t}\} at each iteration t=1,,T,t=1,\ldots,T, with which new candidates are generated for the successive round (Michalski 2000; Gotovos 2013; Bogunovic et al. 2016; Yu, Qian, and Hu 2016; Hashimoto, Yadlowsky, and Duchi 2018). Sub-level sets are considered to contain less information than objective functions, since they are already obtained whenever the objective function is known. By this means, sample usage is more efficient. Moreover, it is less sensitive to function oscillation or noise (Liu et al. 2017) and easy to be parallelized (Liu et al. 2017, 2019b; Hashimoto, Yadlowsky, and Duchi 2018).

Training strategies of hypotheses have branched out in various directions. Hashimoto, Yadlowsky, and Duchi pursue accurate hypotheses using a conservative strategy raised by El-Yaniv and Wiener. They demonstrate that for a hypothesis class \mathcal{H} with a VC-dimension of VC(),VC(\mathcal{H}), an ϵ\epsilon-accurate hypothesis requires 𝒪(ϵ1(VC()log(ϵ1)))\mathcal{O}(\epsilon^{-1}(VC(\mathcal{H})\log(\epsilon^{-1}))) samples per batch. While a computationally efficient approximation has indeed been developed, its success was demonstrated only in the context of low-dimensional problems. The extent of its effectiveness in high-dimensional scenarios is yet to be determined. However, a contrasting approach successfully lead to the improvements in sample efficiency. Yu, Qian, and Hu propose to “RAndomly COordinate-wise Shrink” the solution region (RACOS), a more radical approach that diverges from the goal of producing accurate hypotheses. They prove RACOS converges to global minimum in polynomial time when the objective is locally Holder continuous, and experiments indicate success on high-dimensional problems ranging from hundreds to thousands of dimensions. An added benefit is the ease of sampling through the generated hypotheses, as their active regions are coordinate-orthogonal cubics. Building on this, the sequential counterpart of RACOS, known as SRACOS, takes an even more radical stance by sampling only once per iteration and learning from biased data. Despite this, SRACOS has been shown to outperform RACOS under certain mild restriction (Hu, Qian, and Yu 2017).

Outline and Contributions

  • (1)

    Our research extends the groundwork laid by (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), yet we discover that the upper bound they proposed for the SRACOS does not fully encapsulate its behavior, potentially allowing for an overflow. On the other hand, we construct a counter-example (eq. 3) illustrating the upper bound is not tight enough to delineate the convergence accurately.

  • (2)

    We also identify a contentious assumption regarding the concept of error-target dependence that seems to underpin these limitations. In this paper, we introduce a novel learning theoretic concept termed the hypothesis-target η\eta-shattering rate, which serves as a foundation for our reassessment of the upper bound on the computational complexity of SRACOS.

  • (3)

    Inspired by the reanalysis, we recognize that the accuracy of the trained hypotheses is even less critical than previously thought. With this insight, we formulate a novel algorithm, RACE-CARS (an acronym for “RAndomized CoordinatE Classifying And Region Shrinking”), which perpetuates the essence of SRACOS while promises a theoretical enhancement in convergence. We design experiments on both synthetic functions and black-box tuning for LMaaS. These experiments juxtapose RACE-CARS against a selection of DFO algorithms, empirically highlighting the superior efficacy of RACE-CARS.

  • (4)

    In discussion and appendix, we further substantiate the empirical prowess of RACE-CARS beyond the confines of continuity, also supported by theoretical framework. An ablation study is also conducted to demystify the selection of hyperparameters for our proposed algorithm.

The rest of the paper is organized in five sections, sequentially presenting the background, theoretical study, experiments, discussion and conclusion.

Background

Assumption 1.

In eq. 1, we presume f(x)f(x) is lower bounded within Ω\Omega, with f:=minxΩf(x).f^{*}:=\min_{x\in\Omega}~{}f(x).

In our theoretical analysis, we refine the procedure as a stochastic process. Let {\mathcal{F}} denote the Borel σ\sigma-algebra on Ω\Omega, and {\mathbb{P}} the probability measure defined on .{\mathcal{F}}. For instance, if Ω\Omega is continuous space, {\mathbb{P}} is derived from Lebesgue measure mm, such that (B):=m(B)/m(Ω){\mathbb{P}}(B):=m(B)/m(\Omega) for all B.B\in{\mathcal{F}}.

Assumption 2.

Define Ωϵ:={xΩ:f(x)fϵ}\Omega_{\epsilon}:=\{x\in\Omega\colon f(x)-f^{*}\leq\epsilon\}, with the assumption |Ωϵ|:=(Ωϵ)>0|\Omega_{\epsilon}|:={\mathbb{P}}(\Omega_{\epsilon})>0 for all ϵ>0.\epsilon>0.

A hypothesis (or classifier) hh is a function mapping the solution space Ω\Omega to {0,1}.\{0,1\}. Define

Dh(x):={({xΩ:h(x)=1})1,h(x)=10,otherwise,D_{h}(x):=\begin{cases}{\mathbb{P}}(\{x\in\Omega\colon h(x)=1\})^{-1},&h(x)=1\\ 0,&\text{otherwise},\end{cases} (2)

a probability distribution in Ω.\Omega. Let 𝐗h{\mathbf{X}}_{h} be the random vector in (Ω,,)\left(\Omega,{\mathcal{F}},{\mathbb{P}}\right) drawn from Dh,D_{h}, implying that Pr(𝐗hB)=xBDh(x)𝑑\Pr({\mathbf{X}}_{h}\in B)=\int_{x\in B}D_{h}(x)d{\mathbb{P}} for all B.B\in{\mathcal{F}}. Denoted by 𝕋:={1,2,,T}{\mathbb{T}}:=\{1,2,\ldots,T\}, and let 𝔽:=(t)t𝕋\mathbb{F}:=\left({\mathcal{F}}_{t}\right)_{t\in{\mathbb{T}}} be a filtration of σ\sigma-algebras on Ω\Omega indexed by 𝕋{\mathbb{T}}, satisfying 12T.{\mathcal{F}}_{1}\subseteq{\mathcal{F}}_{2}\subseteq\cdots\subseteq{\mathcal{F}}_{T}\subseteq{\mathcal{F}}. A typical classification-based optimization algorithm learns an 𝔽{\mathbb{F}}-adapted stochastic process 𝐗:=(𝐗t)t𝕋,{\mathbf{X}}:=({\mathbf{X}}_{t})_{t\in{\mathbb{T}}}, where each 𝐗t{\mathbf{X}}_{t} is induced by 𝐗ht{\mathbf{X}}_{h_{t}} and hth_{t} is the hypothesis updated at step t.t. Subsequently, it samples new data with a stochastic process 𝐘:=(𝐘t)t𝕋{\mathbf{Y}}:=({\mathbf{Y}}_{t})_{t\in{\mathbb{T}}} generated by 𝐗{\mathbf{X}}. Typically, the new candidates at step tr+1t\geq r+1 are sampled from

𝐘t:={𝐗t,with probability λ𝐗Ω,with probability 1λ,{\mathbf{Y}}_{t}:=\begin{cases}{\mathbf{X}}_{t},&\text{with probability $\lambda$}\\ {\mathbf{X}}_{\Omega},&\text{with probability $1-\lambda$},\end{cases}

where 𝐗Ω{\mathbf{X}}_{\Omega} is random vector drawn from uniform distribution 𝒰Ω{\mathcal{U}}_{\Omega} and λ[0,1]\lambda\in[0,1] is the exploitation rate.

A simplified batch-mode classification-based optimization algorithm is outlined in Algorithm 1. At each step t,t, it selects a positive set 𝒮positive{\mathcal{S}}_{positive} from 𝒮{\mathcal{S}} containing the best mm samples, with the remainder in the negative set 𝒮negative{\mathcal{S}}_{negative}. Then it trains a hypothesis hth_{t} distinguishing between the positive set and negative set, ensuring ht(xj)=0h_{t}(x_{j})=0 for all (xj,yj)𝒮negative.(x_{j},y_{j})\in{\mathcal{S}}_{negative}. Finally, it samples rr new solutions with the sampling random vector 𝐘t.{\mathbf{Y}}_{t}. The sub-procedure ht𝒯(𝒮positive,𝒮negative)h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative}) means hypothesis training.

Algorithm 1 Batch-Mode Classification-Based Optimization Algorithm
  Input: TT: Budget; rr: Training size; mm: Positive size.
  Output: (xbest,ybest).(x_{best},y_{best}).
  Collect: 𝒮={(x10,y10),,(xr0,yr0)}{\mathcal{S}}=\{(x^{0}_{1},y^{0}_{1}),\ldots,(x^{0}_{r},y^{0}_{r})\} i.i.d. from 𝒰Ω{\mathcal{U}}_{\Omega};
  (xbest,ybest)=argmin{y:(x,y)𝒮}(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in{\mathcal{S}}\};
  for t=1,,T/rt=1,\ldots,T/r do
     Classify: (𝒮positive,𝒮negative)𝒮(\mathcal{S}_{positive},\mathcal{S}_{negative})\leftarrow\mathcal{S};
     Train: ht𝒯(𝒮positive,𝒮negative)h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative});
     Sample: {(x1t,y1t),,(xrt,yrt)}\{(x^{t}_{1},y^{t}_{1}),\ldots,(x^{t}_{r},y^{t}_{r})\} i.i.d. with 𝐘t{\mathbf{Y}}_{t};
     Select: 𝒮𝒮{(x1t,y1t),,(xrt,yrt)}\mathcal{S}\leftarrow\mathcal{S}\cup\{(x^{t}_{1},y^{t}_{1}),\ldots,(x^{t}_{r},y^{t}_{r})\};
     (xbest,ybest)=argmin{y:(x,y)𝒮}.(x_{best},y_{best})=\arg\min\bigl{\{}y\colon(x,y)\in\mathcal{S}\bigr{\}}.
  end for
  Return: (xbest,ybest)(x_{best},y_{best})

RACOS is the abbreviation of “RAndomized COordinate Shrinking”. Literally, it trains the hypothesis by this means (Yu, Qian, and Hu 2016), i.e., shrinking coordinates randomly such that all negative samples are excluded from active region of resulting hypothesis. Algorithm 2 shows a continuous version of RACOS.

Algorithm 2 RACOS
  Input: Ω\Omega: Boundary; (𝒮positive,𝒮negative)(\mathcal{S}_{positive},\mathcal{S}_{negative}): Binary sets; 𝕀={1,,n}{\mathbb{I}}=\{1,\ldots,n\}: Index of dimensions.
  Output: hh: Hypothesis.
  Randomly select: x+=(x+1,,x+n)𝒮positivex_{+}=(x^{1}_{+},\ldots,x^{n}_{+})\leftarrow{\mathcal{S}}_{positive};
  Initialize: h(x)1h(x)\equiv 1;
  while x𝒮negative\exists x\in\mathcal{S}_{negative} s.t. h(x)=1h(x)=1 do
     Randomly select: k𝕀k\leftarrow{\mathbb{I}};
     Randomly select: x=(x1,,xn)𝒮negativex_{-}=(x^{1}_{-},\ldots,x^{n}_{-})\leftarrow\mathcal{S}_{negative};
     if x+kxkx^{k}_{+}\leq x^{k}_{-} then
        srandom(x+k,xk){\textnormal{s}}\leftarrow random(x^{k}_{+},x^{k}_{-});
        Shrink: h(x)=0,x{x=(x1,,xn)Ω:xk>s}h(x)=0,~{}\forall x\in\{x=(x^{1},\ldots,x^{n})\in\Omega\colon x^{k}>{\textnormal{s}}\};
     else
        srandom(xk,x+k){\textnormal{s}}\leftarrow random(x^{k}_{-},x^{k}_{+});
        Shrink: h(x)=0,x{x=(x1,,xn)Ω:xk<s}h(x)=0,~{}\forall x\in\{x=(x^{1},\ldots,x^{n})\in\Omega\colon x^{k}<{\textnormal{s}}\};
     end if
     Return: hh
  end while

Aside from sampling only once per iteration, another difference between sequential and batch mode classification-based DFO is that the sequential version will replace the training set with a new one under certain rules to finish step tt (see in in Algorithm 3). In the rest of this paper, we will omit the details of ReplacingReplacing sub-procedure which can be found in (Hu, Qian, and Yu 2017).

Algorithm 3 Sequential-Mode Classification-Based Optimization Algorithm
  Input: TT: Budget; rr: Training size; mm: Positive size; ReplacingReplacing: Replacing sub-procedure.
  Output: (xbest,ybest).(x_{best},y_{best}).
  Collect 𝒮={(x1,y1),,(xr,yr)}\mathcal{S}=\{(x_{1},y_{1}),\ldots,(x_{r},y_{r})\} i.i.d. from 𝒰Ω{\mathcal{U}}_{\Omega};
  (xbest,ybest)=argmin{y:(x,y)𝒮}(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\};
  for t=r+1,,Tt=r+1,\ldots,T do
     Classify: (𝒮positive,𝒮negative)𝒮(\mathcal{S}_{positive},\mathcal{S}_{negative})\leftarrow\mathcal{S};
     Train: ht𝒯(𝒮positive,𝒮negative)h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative});
     Sample: (xt,yt)𝐘t(x_{t},y_{t})\sim{\mathbf{Y}}_{t};
     Replace: 𝒮Replacing((xt,yt),𝒮)\mathcal{S}\leftarrow Replacing((x_{t},y_{t}),\mathcal{S});
     (xbest,ybest)=argmin{y:(x,y)𝒮}(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\};
  end for
  Return: (xbest,ybest)(x_{best},y_{best})

Classification-based DFO algorithms admit a bound on the query complexity (Yu and Qian 2014), quantifying the total number of function evaluations required to identify a solution that achieves an approximation level of ϵ{\epsilon} with a high probability of at least 1δ.1-\delta.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-Query Complexity).

Given f,f, 0<δ<10<\delta<1 and ϵ>0.\epsilon>0. The (ϵ,δ)(\epsilon,\delta)-query complexity of an algorithm 𝒜{\mathcal{A}} is the number of calls to ff such that, with probability at least 1δ,1-\delta, 𝒜{\mathcal{A}} finds at least one solution x~Ω\tilde{x}\in\Omega satisfying

f(x~)fϵ.f(\tilde{x})-f^{*}\leq\epsilon.

Definition 2 and 3 are given by (Yu, Qian, and Hu 2016). The first one characterizes the so-called dependence between classification error and target region, which is expected to be minimized to ensure that the efficiency. The second one characterizes the portion of active region of hypothesis, and is also expected to be as small as possible.

Definition 2 (Error-Target θ\theta-Dependence).

The error-target dependence θ0\theta\geq 0 of a classification-based optimization algorithm is its infimum such that, for any ϵ>0\epsilon>0 and any t=1,,T,t=1,\ldots,T,

||Ωϵ|(t)(Ωϵt)|θ|Ωϵ|,\big{|}|\Omega_{\epsilon}|\cdot{\mathbb{P}}(\mathcal{R}_{t})-{\mathbb{P}}\bigl{(}\Omega_{\epsilon}\cap\mathcal{R}_{t}\bigr{)}\big{|}\leq\theta|\Omega_{\epsilon}|,

where t:=ΩαtΔ{xΩ:ht(x)=1}\mathcal{R}_{t}:=\Omega_{\alpha_{t}}\Delta\{x\in\Omega\colon h_{t}(x)=1\} denotes the relative error, Ωαt\Omega_{\alpha_{t}} is the sub-level set at step tt with αt:=min1itf(xi)f\alpha_{t}:=\min\limits_{1\leq i\leq t}f(x_{i})-f^{*} and the operator Δ\Delta is symmetric difference of two sets defined as A1ΔA2=(A1A2)(A1A2).A_{1}\Delta A_{2}=(A_{1}\cup A_{2})-(A_{1}\cap A_{2}).

Definition 3 (γ\gamma-Shrinking Rate).

The shrinking rate γ>0\gamma>0 of a classification-based optimization algorithm is its infimum such that (xΩ:ht(x)=1)γ|Ωαt|,{\mathbb{P}}(x\in\Omega\colon h_{t}(x)=1)\leq\gamma|\Omega_{\alpha_{t}}|, for all t=1,,T.t=1,\ldots,T.

Theoretical Study

Previous studies gave a general bound of the query complexity of Algorithm 3 based on minor error-target θ\theta-dependence and γ\gamma-shrinking rate assumptions:

Theorem 1.

(Hu, Qian, and Yu 2017) Given 0<δ<10<\delta<1 and ϵ>0,\epsilon>0, if a sequential classification-based optimization algorithm has error-target θ\theta-dependence and γ\gamma-shrinking rate, then its (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{1|Ωϵ|(λ+1λγ(Tr)t=r+1TΦt)1ln1δ,T}),\mathcal{O}\biggl{(}\max\left\{\frac{1}{|\Omega_{\epsilon}|}\bigl{(}\lambda+\frac{1-\lambda}{\gamma(T-r)}\sum^{T}_{t=r+1}\Phi_{t}\bigr{)}^{-1}\ln\frac{1}{\delta},T\right\}\biggr{)},

where Φt=(1θ(Dt)m(Ω)12DKL(Dt𝒰Ω))|Ωαt|1\Phi_{t}=\biggl{(}1-\theta-{\mathbb{P}}(\mathcal{R}_{D_{t}})-m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\|{\mathcal{U}}_{\Omega})}\biggr{)}\cdot|\Omega_{\alpha_{t}}|^{-1} with the notations Dt:=λDht+(1λ)𝒰ΩD_{t}:=\lambda D_{h_{t}}+(1-\lambda){\mathcal{U}}_{\Omega} and (Dt):=tDt𝑑.{\mathbb{P}}(\mathcal{R}_{D_{t}}):=\int_{\mathcal{R}_{t}}D_{t}d{\mathbb{P}}.

Issues Introduced by Error-Target Dependence

  • (1)

    Overflow of the upper bound

    As assumptions entailed in (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), it can be observed that lower values of θ\theta or γ\gamma correlate with improved query complexity. However, this is not a hard-and-fast rule. Even with small values of these parameters, we can encounter scenarios where the expected performance does not materialize. Following the lemma given by (Yu, Qian, and Hu 2016): (t)(Dt)+m(Ω)12DKL(Dt𝒰Ω){\mathbb{P}}(\mathcal{R}_{t})\leq{\mathbb{P}}(\mathcal{R}_{D_{t}})+m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\|{\mathcal{U}}_{\Omega})}, we have the following inequality:

    Φt=\displaystyle\Phi_{t}= (1θ(Dt)m(Ω)12DKL(Dt𝒰Ω))|Ωαt|1\displaystyle\biggl{(}1-\theta-{\mathbb{P}}(\mathcal{R}_{D_{t}})-m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\|{\mathcal{U}}_{\Omega})}\biggr{)}\cdot|\Omega_{\alpha_{t}}|^{-1}
    \displaystyle\leq (1θ(t))|Ωαt|1.\displaystyle(1-\theta-{\mathbb{P}}(\mathcal{R}_{t}))\cdot|\Omega_{\alpha_{t}}|^{-1}.

    The concept of error-target θ\theta-dependence reveals that a small θ\theta does not guarantee small relative error (t).{\mathbb{P}}(\mathcal{R}_{t}). Contrarily, a small θ\theta coupled with a large ((Ωϵt))/|Ωϵ|({\mathbb{P}}(\Omega_{\epsilon}\cap\mathcal{R}_{t}))/|\Omega_{\epsilon}| can result in a significant (t),{\mathbb{P}}(\mathcal{R}_{t}), which can even be 11 as long as Ωαt\Omega_{\alpha_{t}} is totally out of the active region of ht,h_{t}, when the hypothesis hth_{t} is completely wrong. Problematically, as a divisor in the proof of Theorem 1, Φt(1θ(t))|Ωαt|1\Phi_{t}\leq(1-\theta-{\mathbb{P}}(\mathcal{R}_{t}))\cdot|\Omega_{\alpha_{t}}|^{-1} is less or equal to 0, disrupting the established inequality. Nevertheless, in order that the inequality disrupt, (t){\mathbb{P}}(\mathcal{R}_{t}) is not necessary to be 11 as θ\theta is inherently nonnegative, highlighting that a series of inaccurate hypotheses can undermine the validity of the upper bound. This finding challenges the principle of SRACOS’s radical training strategy.

  • (2)

    Tightness of the upper bound

    Consider an extreme but plausible situation where the hypotheses generated at each step are defined as follows:

    ht(x)={1,xΩϵ0,xΩϵ.h_{t}(x)=\begin{cases}1,&x\in\Omega_{\epsilon}\\ 0,&x\notin\Omega_{\epsilon}.\end{cases} (3)

    In the context of sequential-mode classification-based optimization algorithms, where the training sets are not only small but also potentially biased, it’s reasonable to expect large relative errors (t){\mathbb{P}}(\mathcal{R}_{t}). This scenario could lead to a series of hypotheses that are inaccurate with respect to Ωαt\Omega_{\alpha_{t}} but, by chance, accurate with respect to Ωϵ.\Omega_{\epsilon}. Consequently, the error-target dependence θ=max1tT(t)\theta=\max_{1\leq t\leq T}{\mathbb{P}}(\mathcal{R}_{t}) can be unexpectedly large. Even all Φt\Phi_{t} values being positive, the query complexity bound given in Theorem 1 is therefore not optimistic. However, the probability of failing to identify an ϵ\epsilon-minimum:

    Pr(min1tTf(xt)fϵ)\displaystyle\Pr\bigl{(}\min_{1\leq t\leq T}f(x_{t})-f^{*}\geq\epsilon\bigr{)}
    =\displaystyle= (1|Ωϵ|)r((1λ)(1|Ωϵ|))Tr,\displaystyle(1-|\Omega_{\epsilon}|)^{r}\biggl{(}(1-\lambda)(1-|\Omega_{\epsilon}|)\biggr{)}^{T-r},

    is less than δ\delta for a reasonably sized TT. This indicates that the upper bound may not be as tight as initially thought.

Revisit of Query Complexity Upper Bound

It’s evident that even minimal error-target dependence can not encapsulate issues arising from substantial relative error. This is because error-target dependence alone is insufficient to fully account for relative error. Intuitively, one might consider introducing an additional assumption to cap relative error. However, such an assumption would be impractical, given the inherently small and biased nature of training datasets in the process. To this end, we give a new concept that stands apart from the constraints of relative error:

Definition 4 (Hypothesis-Target η\eta-Shattering Rate).

Given η[0,1],\eta\in[0,1], for a family of hypotheses {\mathcal{H}} defined on Ω,\Omega, we say Ωϵ\Omega_{\epsilon} is η\eta-shattered by hh\in{\mathcal{H}} if

(Ωϵ{xΩ:h(x)=1})η|Ωϵ|,{\mathbb{P}}(\Omega_{\epsilon}\cap\{x\in\Omega\colon h(x)=1\})\geq\eta|\Omega_{\epsilon}|,

and η\eta is called hypothesis-target shattering rate.

The hypothesis-target shattering rate mirrors the error-target dependence in its relation to the hypothesis’s target-accuracy. Importantly, it provides a limitation for error-target dependence in conjunction with relative error:

θmax{(t),|1(t)η|}.\theta\leq\max\{{\mathbb{P}}\bigl{(}\mathcal{R}_{t}\bigr{)},|1-{\mathbb{P}}\bigl{(}\mathcal{R}_{t}\bigr{)}-\eta|\}.

This rate, η\eta, measures the overlap between the target set Ωϵ\Omega_{\epsilon} and the active region of a hypothesis. Crucially, it also mitigates the influence of relative error on error-target dependence. Utilizing the concept hypothesis-target shattering rate, we reexamine the upper bound of (ϵ,δ)(\epsilon,\delta)-query complexity in the subsequent theorem.

Theorem 2.

For sequential-mode classification-based DFO Algorithm 3, let 𝐗t=𝐗ht,{\mathbf{X}}_{t}={\mathbf{X}}_{h_{t}}, ϵ>0\epsilon>0 and 0<δ<1.0<\delta<1. When Ωϵ\Omega_{\epsilon} is η\eta-shattered by hth_{t} for all t=r+1,Tt=r+1\ldots,T and maxt=r+1,,T({xΩ:ht(x)=1})p1,\max\limits_{t=r+1,\ldots,T}{\mathbb{P}}(\{x\in\Omega\colon h_{t}(x)=1\})\leq p\leq 1, the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{(ληp+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

The Region-Shrinking Acceleration

In the analysis of (ϵ,δ)(\epsilon,\delta)-query complexity for classification-based optimization, the focus shifts away from minimizing relative error (t){\mathbb{P}}(\mathcal{R}_{t}), as our goal is to identify optima, not to develop a sequence of accurate hypotheses. The counter-example in equation (3), despite a potentially high relative error, represents an optimal hypothesis scenario. This realization allows for the consideration of more radical hypotheses, directing our attention to the overlap between Ωϵ\Omega\epsilon and the active region of the hypotheses, which is quantified by the hypothesis-target shattering rate.

The γ\gamma-shrinking rate, as defined in Definition 3, measures the decay of (xΩ:ht(x)=1){\mathbb{P}}(x\in\Omega\colon h_{t}(x)=1). However, the rapid decrease of |Ωαt||\Omega_{\alpha_{t}}| as αt\alpha_{t} approaches zero makes it impractical to sustain a series of hypotheses with a small γ\gamma through our training process. Thus, the γ\gamma-shrinking assumption is often not feasible for minimal γ\gamma.

Moving beyond the pursuit of minimal relative error and γ\gamma-shrinking relative to |Ωαt||\Omega_{\alpha_{t}}|, we introduce Algorithm 4, which adaptively shrinks the sampling random vector 𝐘t{\mathbf{Y}}_{t}’s active region through a ProjectionProjection sub-procedure.

Algorithm 4 Accelerated Sequential-Mode Classification-Based Optimization Algorithm
  Input: Ω\Omega: Boundary; T+T\in\mathbb{N}^{+}: Budget; r=m+k\quad r=m+k; ReplacingReplacing: Replacing sub-procedure; γ\gamma: Region shrinking rate; ρ\rho: Region shrinking frequency.
  Output: (xbest,ybest).(x_{best},y_{best}).
  Collect 𝒮={(x1,y1),,(xr,yr)}\mathcal{S}=\{(x_{1},y_{1}),\ldots,(x_{r},y_{r})\} i.i.d. from 𝒰Ω{\mathcal{U}}_{\Omega};
  (xbest,ybest)=argmin{y:(x,y)𝒮}(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\};
  Initialize k=1,k=1, Ω~=Ω\tilde{\Omega}=\Omega;
  for t=r+1,,Tt=r+1,\ldots,T do
     Train: ht𝒯(𝒮positive,𝒮negative)h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative});
     srandom(0,1){\textnormal{s}}\leftarrow random(0,1);
     if sρ{\textnormal{s}}\leq\rho then
        Shrink region: Ω~=Ω[xbest12γkΩ,xbest+12γkΩ]\tilde{\Omega}=\Omega\cap[x_{best}-\frac{1}{2}\gamma^{k}\|\Omega\|,x_{best}+\frac{1}{2}\gamma^{k}\|\Omega\|];
        k=k+1k=k+1;
     end if
     Project: 𝐘tProj(ht,Ω~){\mathbf{Y}}_{t}\leftarrow Proj(h_{t},\tilde{\Omega});
     Sample: (xt,yt)𝐘t(x_{t},y_{t})\sim{\mathbf{Y}}_{t};
     Replace: 𝒮Replacing((xt,yt),𝒮)\mathcal{S}\leftarrow Replacing((x_{t},y_{t}),\mathcal{S});
     (xbest,ybest)=argmin{y:(x,y)𝒮}(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\};
  end for
  Return: (xbest,ybest)(x_{best},y_{best})

The operator \|\cdot\| returns a tuple represents the diameter of each dimension of the region. For instance, when Ω=[ω11,ω21]×[ω12,ω22],\Omega=[\omega^{1}_{1},\omega^{1}_{2}]\times[\omega^{2}_{1},\omega^{2}_{2}], we haveΩ=(ω21ω11,ω22ω12).\|\Omega\|=(\omega^{1}_{2}-\omega^{1}_{1},\omega^{2}_{2}-\omega^{2}_{1}). The projection operator Proj(ht,Ω~)Proj(h_{t},\tilde{\Omega}) generates a random vector 𝐗t{\mathbf{X}}_{t} with probability distribution Dh~t:=h~t/({xΩ:h~t(x)=1}),D_{\tilde{h}_{t}}:=\tilde{h}_{t}/{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\}), with h~t(x)=1\tilde{h}_{t}(x)=1 whenever ht(x)=1h_{t}(x)=1 for xΩ~.x\in\tilde{\Omega}. The sampling random vector 𝐘t{\mathbf{Y}}_{t} is induced by 𝐗t.{\mathbf{X}}_{t}. The subsequent theorem presents the upper bound of query complexity for Algorithm 4.

Theorem 3.

For Algorithm 4 with region shrinking rate 0<γ<10<\gamma<1 and region shrinking frequency 0<ρ<1.0<\rho<1. Let ϵ>0\epsilon>0 and 0<δ<1.0<\delta<1. When Ωϵ\Omega_{\epsilon} is η\eta-shattered by h~t\tilde{h}_{t} for all t=r+1,T,t=r+1\ldots,T, the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(\displaystyle\mathcal{O}\bigg{(} max{(γρ+γ(Tr)ρ2λη\displaystyle\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta
+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\displaystyle+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

The condition γ(0,1)\gamma\in(0,1) ensures that the term 2p/(γρ+γ(Tr)ρ)2p/(\gamma^{-\rho}+\gamma^{-(T-r)\rho}) is significantly less than 1. According to Theorem 3, the (ϵ,δ)(\epsilon,\delta)-query complexity of the Algorithm 4 is lower than that of Algorithm 3 providing η>0\eta>0.

Theorem 3 establishes an upper bound on the (ϵ,δ)(\epsilon,\delta)-query complexity that is applicable to a wide range of scenarios, assuming only that the objective function ff is lower-bounded. Building on this, we identify a sufficient condition for acceleration that applies to dimensionally local Holder continuity functions (Definition 5), detailed in the appendix. Within this context, the SRACOS algorithm, which utilizes RACOS for TrainingTraining phase in Algorithm 3, exhibit polynomial convergence (Hu, Qian, and Yu 2017). We adopt the same RACOS approach for TrainingTraining sub-procedure in Algorithm 4, introducing ”RAndomized CoordinatE Classifying And Region Shrinking” (RACE-CARS) algorithm.

Definition 5 (Dimensionally local Holder continuity).

Assume that x=(x1,,xn)x_{*}=(x^{1}_{*},\ldots,x^{n}_{*}) is the unique global minimum such that f(x)=f.f(x_{*})=f^{*}. We call ff dimensionally local Holder continuity if for all i=1,,n,i=1,\ldots,n,

L1i|xixi|β1i|f(x1,,xi,,xn)f|L2i|xixi|β2i,L^{i}_{1}|x^{i}-x^{i}_{*}|^{\beta^{i}_{1}}\leq|f(x^{1}_{*},\ldots,x^{i},\ldots,x^{n}_{*})-f^{*}|\leq L^{i}_{2}|x^{i}-x^{i}_{*}|^{\beta^{i}_{2}},

for all x=(x1,,xn)x=(x^{1},\ldots,x^{n}) in the neighborhood of X,X_{*}, where β1i,\beta^{i}_{1}, β2i,\beta^{i}_{2}, L1i,L^{i}_{1}, L2iL^{i}_{2} are positive constants for i=1,,n.i=1,\ldots,n.

Experiments

In this section, we design experiments to test RACE-CARS on synthetic functions, and language model tasks respectively. We use same budget to compare RACE-CARS with a selection of DFO algorithms, including SRACOS (Hu, Qian, and Yu 2017), zeroth-order adaptive momentum method (ZO-Adam) (Chen et al. 2019), differential evolution (DE) (Opara and Arabas 2019) and covariance matrix adaptation evolution strategies (CMA-ES) (Hansen 2016). All the baseline algorithms are fine-tuned, and the essential hyperparameters of RACE-CARS can be found in Appendix.

On Synthetic Functions

We commence our empirical experiments with four benchmark functions: Ackley, Levy, Rastrigin and Sphere. Their analytic forms and 2-dimensional illustrations are detailed in Appendix. Characterized by extreme non-convexity and numerous local minima and saddle points—with the exception of the Sphere function—each is minimized within the boundary Ω=[10,10]n\Omega=[-10,10]^{n}, with a global minimum value of 0. We choose the dimension of solution space nn to be 5050 and 500,500, with corresponding function evaluation budgets of 50005000 and 5000050000. Notably, as indicates, the convergence of the RACE-CARS requires only a fraction of this budget.

Refer to caption
(a) Ackley
Refer to caption
(b) Levy
Refer to caption
(c) Rastrigin
Refer to caption
(d) Sphere
Figure 1: Comparison of synthetic functions with n=50n=50.

Region shrinking rate is configured to be γ=0.9\gamma=0.9 and 0.950.95, with shrinking frequency of ρ=0.01\rho=0.01 and 0.0010.001 for n=50,500n=50,500, respectively. Each algorithm is executed over 30 trials, and the mean convergence trajectories of the best-so-far values are depicted in Figure 1 and Figure 2. The numerical values adjacent to the algorithm names in the legends represent the mean of the attained minima. It is evident that that RACE-CARS performs the best on both convergence speed and optimal value, with a slight edge to CMA-ES on the strongly convex Sphere function. Yet this comes at the cost of scalability due to CMA-ES’s reliance on an nn-dimensional covariance matrix, which is significantly more computationally intensive compared to the other algorithms.

Refer to caption
(a) Ackley
Refer to caption
(b) Levy
Refer to caption
(c) Rastrigin
Refer to caption
(d) Sphere
Figure 2: Comparison of synthetic functions with n=500n=500.

On Black-Box Tuning for LMaaS

Prompt tuning for extremely large pre-trained language models (PTMs) has shown great power. PTMs such as GPT-3 (Brown et al. 2020) are usually released as a service due to commercial considerations and the potential risk of misuse, allowing users to design individual prompts to query the PTMs through black-box APIs. This scenario is called Language-Model-as-a-Service (LMaaS) (Diao et al. 2022; Sun et al. 2022). In this part we follow the experiments designed by (Sun et al. 2022) 111Code can be found in https://github.com/txsun1997/Black-Box-Tuning, where language understanding task is formulated as a classification task predicting for a batch of PTM-modified input texts XX the labels YY in the PTM vocabulary, namely we need to tune the prompt such that the black-box PTM inference API ff takes a continuous prompt 𝐩{\mathbf{p}} satisfying Y=f(𝐩;X).Y=f({\mathbf{p}};X). Moreover, to handle the high-dimensional prompt 𝐩,{\mathbf{p}}, (Sun et al. 2022) proposed to randomly embed the DD-dimensional prompt 𝐩{\mathbf{p}} into a lower dimensional space d\mathbb{R}^{d} via random projection matrix 𝐀D×d.{\mathbf{A}}\in\mathbb{R}^{D\times d}. Therefore, the objective becomes:

minz𝒵(f(𝐀z+𝐩0;X),Y),\min_{z\in\mathcal{Z}}\mathcal{L}\bigl{(}f({\mathbf{A}}z+{\mathbf{p}}_{0};X),Y\bigr{)},

where 𝒵=[50,50]d\mathcal{Z}=[-50,50]^{d} is the search space and ()\mathcal{L}(\cdot) is cross entropy loss.

In our experimental setup, we configure the search space dimension to d=500d=500 and the prompt length to 5050, with RoBERTa (Liu et al. 2019a) serving as the backbone model. We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG’s News (Zhang, Zhao, and LeCun 2015), and RTE (Wang et al. 2018a). With a fixed API call budget of T=8000,T=8000, we pit RACE-CARS against SRACOS and the default DFO algorithm CMA-ES utilized in (Sun et al. 2022) 222 Our choice to exclude ZO-Adam and DE is based on their suboptimal performance with high-dimensional nonconvex black-box functions, as demonstrated in the last section..

For our tests, the shrinking rate is γ=0.7\gamma=0.7, with shrinking frequency of ρ=0.002.\rho=0.002. Each algorithm is repeated 5 times independently with unique seeds. We assess the algorithms based on the mean and deviation of training loss, training accuracy, development loss and development accuracy. The SST-2 dataset results are highlighted in Figure 3, with additional findings for Yelp Polarity, AG’s News, and RTE detailed in the appendix. The results indicate that RACE-CARS consistently accelerates the convergence of SRACOS. While CMA-ES shows superior performance on Yelp Polarity, AG’s News, and RTE, it also exhibits signs of overfitting. RACE-CARS achieves comparable performance to CMA-ES, despite the latter’s hyperparameters being finely tuned. Notably, the hyperparameters for RACE-CARS were empirically adjusted based on the SST-2 dataset and then applied to the other three datasets without further tuning.

Refer to caption
(a) Training loss
Refer to caption
(b) Training accuracy
Refer to caption
(c) Development loss
Refer to caption
(d) Development accuracy
Figure 3: Comparisons on SST-2

Discussion

Beyond continuity

  • (i)

    For discontinuous objective functions.

    Dimensionally local Holder continuity in Definition 5, imposes certain constraints on the objective through a set of continuous envelopes, whereas the objective is not supposed to be continuous. Beyond the continuous cases discussed in the previous section, RACE-CARS remains applicable to discontinuous objectives as well. Refer to appendix for a comprehensive understanding.

  • (ii)

    For discrete optimization.

    Similar to SRACOS algorithm, RACE-CARS retains the capability to tackle discrete optimization problems. The convergence theorems, presented in Theorem 2 and Theorem 3, encompass this situation by altering the measure of probability space to be for example, induced by counting measure. We extend experiments on mixed-integer programming problems, substantiating the acceleration of RACE-CARS empirically. See appendix for details.

On the Concept Hypothesis-Target Shattering

The concept hypothesis-target shattering, central to our discussion, draws inspiration from the established learning theory notion of shatter and its deep ties to the Vapnik-Chervonenkis (VC) theory (Vapnik et al. 1998). At the heart of VC theory lies the VC dimension, a measure of a hypothesis family’s capacity to distinguish among data points based on their labels. Specifically, for a collection of data points with binary labels, SS, we say a subset SSS^{\prime}\subseteq S is shattered by a hypothesis family \mathcal{H} if there exists a hypothesis hh\in\mathcal{H} that perfectly aligns with the labels of points in SS^{\prime} and contrasts with those outside:

h(x)={1,xS0,xS.h(x)=\begin{cases}1,&x\in S^{\prime}\\ 0,&x\notin S^{\prime}.\end{cases}

The shattering coefficient, 𝒮(,n)\mathcal{S}(\mathcal{H},n), signifies the variety of point-label combinations that \mathcal{H} can produce for nn points. The VC dimension is then defined as VC():={supn:𝒮(,n)=2n}VC(\mathcal{H}):=\{\sup{n:\mathcal{S}(\mathcal{H},n)=2^{n}}\}, indicating the maximum number of points that can be distinctly labeled by \mathcal{H}.

In the context of classification-based DFO, we represent the target-representative capability of a family of hypotheses through hypothesis-target shattering, measuring the overlap of hypothesis’s active region and the target. Therefore the quintessence of algorithm design hinges on maximizing this quantity. discerning the target-representative capacity within the intricate landscape of nonconvex, black-box optimization problems is nontrivial. Nonetheless, the target-representative capability of hypotheses family generated by RACOS, although proved under the previous framework, empirically suggests sufficient efficacy in scenarios where the objective function exhibits locally Holder continuouity. Looking ahead, altering the TrainingTraining and ReplacingReplacing sub-procedures inherited from SRACOS, which may ideally lead to a bigger shattering rate and maintain the easy-to-sample characterization, will be another extension direction of the current study.

Ablation Experiments

While it’s an appealing goal to develop a universally effective DFO algorithm for black-box functions without the need for hyperparameter tuning, this remains an unrealistic aspiration. Our proposed algorithm, RACE-CARS, introduces two hyperparameters: shrinking-rate γ\gamma and shrinking-frequency ρ.\rho. For an nn-dimensional optimization problem, we call γnρ\gamma^{n\rho} shrinking factor of RACE-CARS. We take Ackley for a case study, design ablation experiments on the two hyperparameters of RACE-CARS to reveal the mechanism. We stipulate that we do not aim to identify a optimal combination of hyperparameters for maximizing the overlap with the target hypothesis. Instead, our aim is to provide empirical guidance for tuning these hyperparameters effectively. For further details, the reader is directed to the appendix.

Conclusion

In this paper, we refine the framework of classification-based DFO as a stochastic process, and propose a novel learning concept named hypothesis-target shattering rate. Our research delves into the convergence properties of sequential-mode classification-based DFO algorithms and provides a fresh perspective on their query complexity upper bound. Delighted by the computational complexity upper bound under the new framework, we propose a theoretically grounded region-shrinking technique to accelerate the convergence. In empirical analysis, we study the scalability performance of RACE-CARS on both synthetic functions and black-box tuning for LMaaS, showing its superiority over SRACOS.

References

  • Akay and Karaboga (2012) Akay, B.; and Karaboga, D. 2012. Artificial bee colony algorithm for large-scale problems and engineering design optimization. Journal of intelligent manufacturing, 23: 1001–1014.
  • Akiba et al. (2019) Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623–2631.
  • Bäck and Schwefel (1993) Bäck, T.; and Schwefel, H.-P. 1993. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1): 1–23.
  • Bickel and Levina (2004) Bickel, P. J.; and Levina, E. 2004. Some theory for Fisher’s linear discriminant function,naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6): 989–1010.
  • Bogunovic et al. (2016) Bogunovic, I.; Scarlett, J.; Krause, A.; and Cevher, V. 2016. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. Advances in neural information processing systems, 29.
  • Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  • Céa (1971) Céa, J. 1971. Optimisation: théorie et algorithmes. Jean Céa. Dunod.
  • Chen et al. (2019) Chen, X.; Liu, S.; Xu, K.; Li, X.; Lin, X.; Hong, M.; and Cox, D. 2019. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32.
  • Conn, Scheinberg, and Vicente (2009) Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivative-free optimization. SIAM.
  • Diao et al. (2022) Diao, S.; Huang, Z.; Xu, R.; Li, X.; Lin, Y.; Zhou, X.; and Zhang, T. 2022. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531.
  • El-Yaniv and Wiener (2012) El-Yaniv, R.; and Wiener, Y. 2012. Active Learning via Perfect Selective Classification. Journal of Machine Learning Research, 13(2).
  • Eriksson et al. (2019) Eriksson, D.; Pearce, M.; Gardner, J.; Turner, R. D.; and Poloczek, M. 2019. Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32.
  • Falkner, Klein, and Hutter (2018) Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International conference on machine learning, 1437–1446. PMLR.
  • Fan and Fan (2008) Fan, J.; and Fan, Y. 2008. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6): 2605.
  • Fortin et al. (2012) Fortin, F.-A.; De Rainville, F.-M.; Gardner, M.-A. G.; Parizeau, M.; and Gagné, C. 2012. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research, 13(1): 2171–2175.
  • Frazier (2018) Frazier, P. I. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
  • Ge et al. (2022) Ge, D.; Liu, T.; Liu, J.; Tan, J.; and Ye, Y. 2022. SOLNP+: A Derivative-Free Solver for Constrained Nonlinear Optimization. arXiv preprint arXiv:2210.07160.
  • Gotovos (2013) Gotovos, A. 2013. Active learning for level set estimation. Master’s thesis, Eidgenössische Technische Hochschule Zürich, Department of Computer Science,.
  • Hall, Marron, and Neeman (2005) Hall, P.; Marron, J. S.; and Neeman, A. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(3): 427–444.
  • Hansen (2016) Hansen, N. 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772.
  • Hashimoto, Yadlowsky, and Duchi (2018) Hashimoto, T.; Yadlowsky, S.; and Duchi, J. 2018. Derivative free optimization via repeated classification. In International Conference on Artificial Intelligence and Statistics, 2027–2036. PMLR.
  • Hensman, Fusi, and Lawrence (2013) Hensman, J.; Fusi, N.; and Lawrence, N. D. 2013. Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
  • Hu, Qian, and Yu (2017) Hu, Y.-Q.; Qian, H.; and Yu, Y. 2017. Sequential classification-based optimization for direct policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
  • Larson, Menickelly, and Wild (2019) Larson, J.; Menickelly, M.; and Wild, S. M. 2019. Derivative-free optimization methods. Acta Numerica, 28: 287–404.
  • Liao (2010) Liao, T. W. 2010. Two hybrid differential evolution algorithms for engineering design optimization. Applied Soft Computing, 10(4): 1188–1199.
  • Liu et al. (2019a) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019a. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Liu et al. (2017) Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; Qian, C.; and Yu, Y. 2017. Zoopt: Toolbox for derivative-free optimization. arXiv preprint arXiv:1801.00329.
  • Liu et al. (2019b) Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; and Yu, Y. 2019b. Asynchronous classification-based optimization. In Proceedings of the First International Conference on Distributed Artificial Intelligence, 1–8.
  • Michalski (2000) Michalski, R. S. 2000. Learnable evolution model: Evolutionary processes guided by machine learning. Machine learning, 38: 9–40.
  • Nayebi, Munteanu, and Poloczek (2019) Nayebi, A.; Munteanu, A.; and Poloczek, M. 2019. A framework for Bayesian optimization in embedded subspaces. In International Conference on Machine Learning, 4752–4761. PMLR.
  • Nelder and Mead (1965) Nelder, J. A.; and Mead, R. 1965. A simplex method for function minimization. The computer journal, 7(4): 308–313.
  • Nesterov and Spokoiny (2017) Nesterov, Y.; and Spokoiny, V. 2017. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17: 527–566.
  • Opara and Arabas (2019) Opara, K. R.; and Arabas, J. 2019. Differential Evolution: A survey of theoretical analyses. Swarm and evolutionary computation, 44: 546–558.
  • Ragonneau and Zhang (2023) Ragonneau, T. M.; and Zhang, Z. 2023. PDFO–A Cross-Platform Package for Powell’s Derivative-Free Optimization Solver. arXiv preprint arXiv:2302.13246.
  • Ray and Saini (2001) Ray, T.; and Saini, P. 2001. Engineering design optimization using a swarm with an intelligent information sharing among individuals. Engineering Optimization, 33(6): 735–748.
  • Ros and Hansen (2008) Ros, R.; and Hansen, N. 2008. A simple modification in CMA-ES achieving linear time and space complexity. In International conference on parallel problem solving from nature, 296–305. Springer.
  • Scheinberg (2022) Scheinberg, K. 2022. Finite Difference Gradient Approximation: To Randomize or Not? INFORMS Journal on Computing, 34(5): 2384–2388.
  • Shahriari et al. (2015) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148–175.
  • Shi et al. (2021) Shi, H.-J. M.; Xuan, M. Q.; Oztoprak, F.; and Nocedal, J. 2021. On the numerical performance of derivative-free optimization methods based on finite-difference approximations. arXiv preprint arXiv:2102.09762.
  • Snoek, Larochelle, and Adams (2012) Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
  • Sobol’ (1967) Sobol’, I. M. 1967. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 7(4): 784–802.
  • Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631–1642.
  • Sun et al. (2022) Sun, T.; Shao, Y.; Qian, H.; Huang, X.; and Qiu, X. 2022. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, 20841–20855. PMLR.
  • Vapnik et al. (1998) Vapnik, V.; et al. 1998. Statistical learning theory.
  • Wang et al. (2018a) Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018a. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  • Wang, Fonseca, and Tian (2020) Wang, L.; Fonseca, R.; and Tian, Y. 2020. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33: 19511–19522.
  • Wang et al. (2018b) Wang, Z.; Gehring, C.; Kohli, P.; and Jegelka, S. 2018b. Batched large-scale Bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, 745–754. PMLR.
  • Wang et al. (2016) Wang, Z.; Hutter, F.; Zoghi, M.; Matheson, D.; and De Feitas, N. 2016. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55: 361–387.
  • Yang and Shami (2020) Yang, L.; and Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415: 295–316.
  • Yi et al. (2024) Yi, Z.; Wei, Y.; Cheng, C. X.; He, K.; and Sui, Y. 2024. Improving sample efficiency of high dimensional Bayesian optimization with MCMC. arXiv preprint arXiv:2401.02650.
  • Yu (1979) Yu, W.-C. 1979. Positive basis and a class of direct search techniques. Scientia Sinica, Special Issue of Mathematics, 1(26): 53–68.
  • Yu and Qian (2014) Yu, Y.; and Qian, H. 2014. The sampling-and-learning framework: A statistical view of evolutionary algorithms. In 2014 IEEE Congress on Evolutionary Computation (CEC), 149–158. IEEE.
  • Yu, Qian, and Hu (2016) Yu, Y.; Qian, H.; and Hu, Y.-Q. 2016. Derivative-free optimization via classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
  • Yue et al. (2023) Yue, P.; Yang, L.; Fang, C.; and Lin, Z. 2023. Zeroth-order Optimization with Weak Dimension Dependency. In The Thirty Sixth Annual Conference on Learning Theory, 4429–4472. PMLR.
  • Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

Appendix A Appendix

Synthetic functions

  • Ackley:

    f(x)=\displaystyle f(x)= 20exp(0.2i=1n(xi0.2)2/n)\displaystyle-20\exp(-0.2\sqrt{\sum_{i=1}^{n}(x_{i}-0.2)^{2}/n})
    exp(i=1ncos2πxi/n)+e+20.\displaystyle-\exp(\sum_{i=1}^{n}\cos 2\pi x_{i}/n)+e+20.
  • Levy:

    f(x)=\displaystyle f(x)= sin2(πω1)+i=1n1(ωi1)2(1+10sin2(πωi+1))\displaystyle\sin^{2}(\pi\omega_{1})+\sum_{i=1}^{n-1}(\omega_{i}-1)^{2}\bigl{(}1+10\sin^{2}(\pi\omega_{i}+1)\bigr{)}
    +(ωn1)2(1+sin2(2πωn))\displaystyle+(\omega_{n}-1)^{2}\bigl{(}1+\sin^{2}(2\pi\omega_{n})\bigr{)}

    where ωi=1+xi14.\omega_{i}=1+\frac{x_{i}-1}{4}.

  • Rastrigin:

    f(x)=10n+i=1n(xi210cos(2πxi)).f(x)=10n+\sum_{i=1}^{n}\bigl{(}x_{i}^{2}-10\cos(2\pi x_{i})\bigr{)}.
  • Sphere:

    f(x)=i=1n(xi0.2)2.f(x)=\sum_{i=1}^{n}(x_{i}-0.2)^{2}.
Refer to caption
(a) Ackley
Refer to caption
(b) Levy
Refer to caption
(c) Rastrigin
Refer to caption
(d) Sphere
Figure 4: Synthetic functions with n=2n=2.

Beyond continuity

For discontinuous objective functions

We design experiments on discontinuous objective functions by adding random perturbation to synthetic functions. For example, the perturbation is set to be:

P(x)=i=1mϵiδ(xi,0.5)(x).P(x)=\sum_{i=1}^{m}\epsilon_{i}\cdot\delta_{\mathcal{B}(x_{i},0.5)}(x).

(xi,0.5)\mathcal{B}(x_{i},0.5) is the open ball centering at xix_{i} with radius equals to 0.5, with xi,x_{i}, i=1,,m,i=1,\ldots,m, randomly generated within the solution region. δ(xi,0.5)(x)\delta_{\mathcal{B}(x_{i},0.5)}(x) is an indicator function, equaling to 1 when x(xi,0.5)x\in\mathcal{B}(x_{i},0.5) otherwise 0. The perturbations ϵi\epsilon_{i} are uniformly sampled from [0,1][0,1] for every single ball center xi,x_{i}, i=1,,m.i=1,\ldots,m. The objective function is set to be

f~(x):=f(x)+P(x),\tilde{f}(x):=f(x)+P(x),

which is lower semi-continuous. We use the same settings as in the body sections with dimension n=50,n=50, the perturbation size m=5nm=5n and budget T=100n.T=100n. Similarly, region shrinking rate is set to be γ=0.9\gamma=0.9 and region shrinking frequency is ρ=0.01.\rho=0.01. Each of the algorithm is repeated 30 runs and the mean convergence trajectories of the best-so-far values are presented in Figure 5. The numbers attached to the algorithm names in the legend of figures are the mean value of obtained minimum. It can be observed that the acceleration of RACE-CARS to SRACOS is still valid. Comparing with baselines, RACE-CARS performs the best on both convergence, and obtain the best optimal value. As we anticipated, the performance of SRACOS and RACE-CARS are almost impervious to discontinuity, whereas the other three baselines, whose convergence relies on the continuity, suffers from oscillation or early-stopping to different extent.

Refer to caption
(a) Ackley
Refer to caption
(b) Levy
Refer to caption
(c) Rastrigin
Refer to caption
(d) Sphere
Figure 5: Comparison on discontinuous objectives

For discrete optimization

In order to transfer RACE-CARS to discrete optimization, TrainingTraining and ProjectionProjection sub-procedures in Algorithm 4 should be modified. In all cases, we employ the discrete version of RACOS for TrainingTraining (Yu, Qian, and Hu 2016). Furthermore, we presume counting measure #\# as the inducing measure of probability space (Ω,,),(\Omega,\mathcal{F},\mathbb{P}), where (B):=#(B)/#(Ω)\mathbb{P}(B):=\#(B)/\#(\Omega) for all B.B\in\mathcal{F}. The ProjectionProjection is therefore similar only to set the operator \|\cdot\| return the count of each dimension of the region.

We design experiments on the following formulation:

min\displaystyle\min~{} f(x,y)\displaystyle f(x,y) (4)
s.t.\displaystyle s.t.~{} xΩc\displaystyle x\in\Omega_{c}
yΩd,\displaystyle y\in\Omega_{d},

where Ωc\Omega_{c} is the continuous solution subspace and Ωd\Omega_{d} is discrete. Equation (4) encompasses a wide range of continuous, discrete and mixed-integer programming problems. In our experiments, we specify equation (4) as a mixed-integer programming problem:

min\displaystyle\min~{} Ackley(x)+LTabs(y)\displaystyle Ackley(x)+L^{T}abs(y)
s.t.\displaystyle s.t.~{} x[1,1]n1\displaystyle x\in[-1,1]^{n_{1}}
y{10,9,,9,10}n2,\displaystyle y\in\{-10,-9,\ldots,9,10\}^{n_{2}},

where Ln2L\in\mathbb{R}^{n_{2}} is sampled uniformly from [1,2]n2,[1,2]^{n_{2}}, thus the global optimal value is 0. We choose the dimension of solution space as n1=n2=50n_{1}=n_{2}=50 and 250,250, the budget of function evaluation is set to be 30003000 and 1000010000 respectively. Region shrinking rate is set to be γ=0.95\gamma=0.95 and region shrinking frequency is ρ=0.01,\rho=0.01, 0.0050.005 respectively. Each of the algorithm is repeated 30 runs and the convergence trajectories of mean of the best-so-far value are presented in Figure 6. As results show, RACE-CARS maintains acceleration to SRACOS in discrete situation.

Refer to caption
(a) n1=n2=50n_{1}=n_{2}=50
Refer to caption
(b) n1=n2=250n_{1}=n_{2}=250
Figure 6: Mixed-integer programming.

Ablation experiments

  • (i)

    Relationship between shrinking frequency ρ\mathbf{\rho} and dimension 𝐧\mathbf{n}.

    For Ackley on Ω=[10,10]n,\Omega=[-10,10]^{n}, we fix shrinking rate ρ=0.95\rho=0.95 and compare the performance of RACE-CARS between different shrinking frequency ρ\rho and dimension n.n. The shrinking frequencies ρ\rho ranges from 0.0020.002 to 0.20.2 and dimension nn ranges from 5050 to 500.500. The function calls budget is set to be T=30nT=30n for fair. Experiments are repeated 5 times for each hyperparameter and results are recorded in Table 1 and the normalized results is the heatmap figure in the right. The black curve represents the trajectory of best shrinking frequency with respect to dimension. The horizontal axis is dimension and the vertical axis is shrinking frequency. Results indicate the best ρ\rho is in reverse proportion to n,n, therefore maintaining nρn\rho as constant is preferred.

    [Uncaptioned image]
  • (ii)

    Relationship between shrinking factor γ𝐧ρ\mathbf{\gamma^{n\rho}} and dimension 𝐧\mathbf{n} of solution space.

    For Ackley on Ω=[r,r]n,\Omega=[-r,r]^{n}, we compare the performance of RACE-CARS between different shrinking factors and radius r.r. Different shrinking factors are generated by varying shrinking rate γ\gamma and dimension times shrinking frequency nρ.n\rho. We design experiments on 4 different dimensions nn with 4 radii r.r. The function calls budget is set to be T=30n.T=30n. Experiments are repeated 5 times for each hyperparameter and results are presented in heatmap format in Figure 7. According to the results, the best shrinking factor is insensitive to the variation of dimension. Considering that the best nρn\rho maintains constant as nn varying, slightly variation of the corresponding best γ\gamma is preferred. This observation is in line with what we anticipated as in section Experiments.

    Refer to caption
    (a) Radius r=1r=1
    Refer to caption
    (b) Radius r=5r=5
    Refer to caption
    (c) Radius r=10r=10
    Refer to caption
    (d) Radius r=25r=25
    Figure 7: Comparison of shrinking factor γ𝐧ρ\mathbf{\gamma^{n\rho}} and dimension 𝐧\mathbf{n} of the solution space 𝛀=[𝐫,𝐫]𝐧\mathbf{\Omega=[-r,r]^{n}}. Results of different solution space radius are presented in each subfigure respectively. In each subfigure, the horizontal axis is the dimension and the vertical axis is shrinking factor. Each pixel represents the heat of y-wise normalized mean function value at the 30n30n step. The black curve is the best shrinking factor in each dimension.
  • (iii)

    Relationship between shrinking factor γ𝐧ρ\mathbf{\gamma^{n\rho}} and radius 𝐫\mathbf{r} of solution space.

    For Ackley on Ω=[r,r]n,\Omega=[-r,r]^{n}, we compare the performance of RACE-CARS between different shrinking factors and radius r.r. Different shrinking factors are generated by varying shrinking rate γ\gamma and dimension times shrinking frequency nρ.n\rho. We design experiments on 4 different radii rr with 4 dimensions n.n. The function calls budget is set to be T=30n.T=30n. Experiments are repeated 5 times for each hyperparameter and results are presented in heatmap format in Figure 8. According to the results, the best shrinking factor γnρ\gamma^{n\rho} should be decreased as radius rr increases.

    Refer to caption
    (a) Dimension n=50n=50
    Refer to caption
    (b) Dimension n=100n=100
    Refer to caption
    (c) Dimension n=250n=250
    Refer to caption
    (d) Dimension n=500n=500
    Figure 8: Comparison of shrinking factor γ𝐧ρ\mathbf{\gamma^{n\rho}} and radius 𝐫\mathbf{r} of the solution space 𝛀=[𝐫,𝐫]𝐧\mathbf{\Omega=[-r,r]^{n}}. Results of different dimension are presented in each subfigure respectively. In each subfigure, the horizontal axis is the radius of solution space and the vertical axis is shrinking factor. Each pixel represents the heat of y-wise normalized mean function value at the 30n30n step. The black curve is the best shrinking factor of each solution space radius.
Table 1: Comparison of shrinking frequencies ρ\mathbf{\rho} for Ackley on 𝛀=[𝟏𝟎,𝟏𝟎]𝐧\mathbf{\Omega=[-10,10]^{n}} with shrinking rate γ=0.95\mathbf{\gamma=0.95}. Mean and standard deviation of function value at the 30n30n step are listed in the table. The first row of the table with ρ=0\rho=0 is the results of RACOS for reference. We omit the results of ρ\rho bigger than 0.1 for concision. The bold fonts are relative better results in each dimension.
ρ\mathbf{\rho} 𝐧\mathbf{n} 50 100 150 200 250 300 350 400 450 500
0 3.8±0.23.8\pm 0.2 3.9±0.23.9\pm 0.2 5.9±0.15.9\pm 0.1 5.7±0.15.7\pm 0.1 5.8±0.25.8\pm 0.2 5.8±0.15.8\pm 0.1 5.9±0.05.9\pm 0.0 5.8±0.05.8\pm 0.0 5.9±0.15.9\pm 0.1 5.8±0.15.8\pm 0.1
0.002 3.7±0.23.7\pm 0.2 3.5±0.23.5\pm 0.2 4.3±0.34.3\pm 0.3 4.4±0.34.4\pm 0.3 4.0±0.64.0\pm 0.6 3.9±0.53.9\pm 0.5 3.7±0.43.7\pm 0.4 3.3±0.43.3\pm 0.4 3.3±0.33.3\pm 0.3 2.6±0.42.6\pm 0.4
0.004 3.4±0.33.4\pm 0.3 3.2±0.13.2\pm 0.1 3.8±0.63.8\pm 0.6 3.7±0.23.7\pm 0.2 2.8±0.52.8\pm 0.5 2.1±0.52.1\pm 0.5 1.9±0.2\mathbf{1.9\pm 0.2} 1.9±0.2\mathbf{1.9\pm 0.2} 1.8±0.6\mathbf{1.8\pm 0.6} 1.7±0.4\mathbf{1.7\pm 0.4}
0.006 3.3±0.33.3\pm 0.3 2.9±0.22.9\pm 0.2 2.9±0.42.9\pm 0.4 2.0±0.12.0\pm 0.1 2.1±0.42.1\pm 0.4 1.8±0.2\mathbf{1.8\pm 0.2} 1.9±0.2\mathbf{1.9\pm 0.2} 2.4±0.32.4\pm 0.3 2.7±0.32.7\pm 0.3 3.1±1.53.1\pm 1.5
0.008 3.0±0.33.0\pm 0.3 2.4±0.42.4\pm 0.4 2.3±0.52.3\pm 0.5 1.7±0.2\mathbf{1.7\pm 0.2} 1.8±0.4\mathbf{1.8\pm 0.4} 2.3±0.72.3\pm 0.7 2.5±0.42.5\pm 0.4 3.9±0.83.9\pm 0.8 4.5±0.94.5\pm 0.9 5.5±1.05.5\pm 1.0
0.010 2.8±0.42.8\pm 0.4 1.8±0.5\mathbf{1.8\pm 0.5} 1.9±0.4\mathbf{1.9\pm 0.4} 2.1±0.22.1\pm 0.2 2.3±0.62.3\pm 0.6 3.6±0.83.6\pm 0.8 4.1±0.64.1\pm 0.6 4.9±0.94.9\pm 0.9 7.5±0.77.5\pm 0.7 6.3±0.86.3\pm 0.8
0.012 2.5±0.12.5\pm 0.1 1.4±0.2\mathbf{1.4\pm 0.2} 1.6±0.4\mathbf{1.6\pm 0.4} 1.8±0.2\mathbf{1.8\pm 0.2} 3.1±0.73.1\pm 0.7 4.4±1.14.4\pm 1.1 5.4±0.55.4\pm 0.5 5.6±1.45.6\pm 1.4 6.6±0.86.6\pm 0.8 7.5±0.77.5\pm 0.7
0.014 2.6±0.32.6\pm 0.3 1.5±0.3\mathbf{1.5\pm 0.3} 2.5±0.52.5\pm 0.5 2.2±0.52.2\pm 0.5 3.9±0.63.9\pm 0.6 5.0±0.85.0\pm 0.8 7.1±1.27.1\pm 1.2 6.3±0.66.3\pm 0.6 8.3±0.78.3\pm 0.7 8.3±0.58.3\pm 0.5
0.016 2.6±0.42.6\pm 0.4 1.3±0.4\mathbf{1.3\pm 0.4} 2.0±0.32.0\pm 0.3 3.4±1.33.4\pm 1.3 4.4±0.94.4\pm 0.9 6.1±1.06.1\pm 1.0 7.0±1.17.0\pm 1.1 6.9±0.86.9\pm 0.8 8.0±0.68.0\pm 0.6 8.9±0.88.9\pm 0.8
0.018 2.3±0.22.3\pm 0.2 1.3±0.4\mathbf{1.3\pm 0.4} 2.8±0.82.8\pm 0.8 4.1±0.64.1\pm 0.6 5.1±0.95.1\pm 0.9 6.9±1.06.9\pm 1.0 7.1±1.37.1\pm 1.3 8.3±0.58.3\pm 0.5 9.1±0.99.1\pm 0.9 9.3±0.49.3\pm 0.4
0.020 2.0±0.62.0\pm 0.6 2.0±0.52.0\pm 0.5 3.2±1.23.2\pm 1.2 4.4±0.84.4\pm 0.8 6.3±1.36.3\pm 1.3 7.2±1.17.2\pm 1.1 7.4±0.67.4\pm 0.6 9.1±0.89.1\pm 0.8 10.2±0.910.2\pm 0.9 10.0±0.610.0\pm 0.6
0.022 1.7±0.3\mathbf{1.7\pm 0.3} 2.0±1.02.0\pm 1.0 3.9±1.33.9\pm 1.3 5.3±1.05.3\pm 1.0 6.8±1.06.8\pm 1.0 7.5±1.17.5\pm 1.1 8.8±0.68.8\pm 0.6 9.1±0.89.1\pm 0.8 10.7±0.510.7\pm 0.5 10.0±0.310.0\pm 0.3
0.024 1.9±0.2\mathbf{1.9\pm 0.2} 3.3±0.93.3\pm 0.9 4.3±1.14.3\pm 1.1 6.0±0.96.0\pm 0.9 7.0±0.37.0\pm 0.3 8.5±0.78.5\pm 0.7 9.3±0.49.3\pm 0.4 10.1±0.710.1\pm 0.7 10.7±0.510.7\pm 0.5 10.8±0.610.8\pm 0.6
0.026 1.5±0.4\mathbf{1.5\pm 0.4} 2.7±1.22.7\pm 1.2 4.3±0.64.3\pm 0.6 7.0±0.77.0\pm 0.7 8.3±0.48.3\pm 0.4 9.0±0.49.0\pm 0.4 9.8±0.79.8\pm 0.7 10.1±0.710.1\pm 0.7 10.7±0.610.7\pm 0.6 11.6±0.411.6\pm 0.4
0.028 1.3±0.2\mathbf{1.3\pm 0.2} 3.8±0.63.8\pm 0.6 4.9±0.74.9\pm 0.7 7.2±0.77.2\pm 0.7 8.8±1.08.8\pm 1.0 8.8±0.98.8\pm 0.9 9.5±0.59.5\pm 0.5 10.7±0.410.7\pm 0.4 10.9±0.210.9\pm 0.2 11.3±0.411.3\pm 0.4
0.030 1.3±0.4\mathbf{1.3\pm 0.4} 4.0±0.44.0\pm 0.4 5.0±0.45.0\pm 0.4 7.1±0.97.1\pm 0.9 8.2±1.08.2\pm 1.0 9.2±0.69.2\pm 0.6 10.3±0.510.3\pm 0.5 10.6±1.010.6\pm 1.0 10.9±0.410.9\pm 0.4 11.8±0.411.8\pm 0.4
0.032 1.5±0.5\mathbf{1.5\pm 0.5} 6.1±1.26.1\pm 1.2 6.2±1.16.2\pm 1.1 7.3±0.67.3\pm 0.6 9.1±0.89.1\pm 0.8 9.9±0.59.9\pm 0.5 10.6±0.410.6\pm 0.4 10.5±0.810.5\pm 0.8 11.2±0.611.2\pm 0.6 12.0±0.312.0\pm 0.3
0.034 1.9±0.6\mathbf{1.9\pm 0.6} 5.1±0.85.1\pm 0.8 5.9±0.75.9\pm 0.7 8.2±0.78.2\pm 0.7 9.0±0.39.0\pm 0.3 10.4±0.410.4\pm 0.4 10.6±0.610.6\pm 0.6 11.2±0.411.2\pm 0.4 11.9±0.511.9\pm 0.5 12.1±0.412.1\pm 0.4
0.036 1.5±0.2\mathbf{1.5\pm 0.2} 6.4±0.56.4\pm 0.5 6.8±0.66.8\pm 0.6 7.9±1.07.9\pm 1.0 9.3±1.09.3\pm 1.0 10.4±0.410.4\pm 0.4 10.9±0.310.9\pm 0.3 11.3±0.511.3\pm 0.5 11.9±0.311.9\pm 0.3 12.0±0.312.0\pm 0.3
0.038 2.2±1.62.2\pm 1.6 5.1±0.85.1\pm 0.8 6.1±0.66.1\pm 0.6 8.2±1.08.2\pm 1.0 9.6±0.89.6\pm 0.8 10.8±0.510.8\pm 0.5 10.9±0.510.9\pm 0.5 11.6±0.311.6\pm 0.3 11.8±0.411.8\pm 0.4 12.3±0.512.3\pm 0.5
0.040 2.2±1.12.2\pm 1.1 7.3±1.67.3\pm 1.6 7.6±0.47.6\pm 0.4 8.7±0.48.7\pm 0.4 9.3±0.79.3\pm 0.7 10.4±0.510.4\pm 0.5 11.4±0.411.4\pm 0.4 11.7±0.611.7\pm 0.6 12.3±0.212.3\pm 0.2 12.2±0.312.2\pm 0.3
0.042 2.2±0.82.2\pm 0.8 6.4±1.26.4\pm 1.2 7.6±0.87.6\pm 0.8 9.3±0.89.3\pm 0.8 10.0±0.810.0\pm 0.8 10.4±0.510.4\pm 0.5 11.6±0.311.6\pm 0.3 12.3±0.412.3\pm 0.4 12.0±0.512.0\pm 0.5 12.5±0.512.5\pm 0.5
0.044 1.8±0.6\mathbf{1.8\pm 0.6} 6.8±1.06.8\pm 1.0 6.9±0.96.9\pm 0.9 9.6±0.59.6\pm 0.5 10.3±0.410.3\pm 0.4 11.0±1.011.0\pm 1.0 11.1±0.311.1\pm 0.3 12.0±0.312.0\pm 0.3 12.4±0.412.4\pm 0.4 12.6±0.312.6\pm 0.3
0.046 2.1±0.32.1\pm 0.3 7.4±0.77.4\pm 0.7 8.1±0.88.1\pm 0.8 9.4±0.39.4\pm 0.3 10.4±0.910.4\pm 0.9 11.3±0.411.3\pm 0.4 11.6±0.511.6\pm 0.5 12.1±0.412.1\pm 0.4 12.4±0.212.4\pm 0.2 12.7±0.212.7\pm 0.2
0.048 3.2±1.13.2\pm 1.1 6.6±1.06.6\pm 1.0 7.5±1.37.5\pm 1.3 9.9±0.49.9\pm 0.4 10.1±0.410.1\pm 0.4 11.3±0.211.3\pm 0.2 12.2±0.412.2\pm 0.4 12.0±0.512.0\pm 0.5 12.3±0.312.3\pm 0.3 12.5±0.312.5\pm 0.3
0.050 3.5±1.03.5\pm 1.0 7.7±0.87.7\pm 0.8 8.6±0.38.6\pm 0.3 10.0±0.310.0\pm 0.3 10.6±0.610.6\pm 0.6 11.1±0.511.1\pm 0.5 12.4±0.212.4\pm 0.2 12.5±0.412.5\pm 0.4 12.6±0.312.6\pm 0.3 12.8±0.212.8\pm 0.2
0.052 3.6±0.73.6\pm 0.7 8.8±0.98.8\pm 0.9 8.1±0.68.1\pm 0.6 10.0±0.810.0\pm 0.8 11.0±0.411.0\pm 0.4 11.6±0.711.6\pm 0.7 12.7±0.212.7\pm 0.2 12.5±0.112.5\pm 0.1 12.8±0.412.8\pm 0.4 13.3±0.113.3\pm 0.1
0.054 3.8±1.53.8\pm 1.5 7.3±0.97.3\pm 0.9 8.5±0.38.5\pm 0.3 10.1±0.910.1\pm 0.9 11.3±0.311.3\pm 0.3 11.7±0.211.7\pm 0.2 12.7±0.312.7\pm 0.3 12.9±0.212.9\pm 0.2 12.8±0.212.8\pm 0.2 13.0±0.313.0\pm 0.3
0.056 3.8±1.33.8\pm 1.3 8.8±1.08.8\pm 1.0 9.1±0.89.1\pm 0.8 10.5±0.710.5\pm 0.7 11.1±0.711.1\pm 0.7 11.8±0.411.8\pm 0.4 12.4±0.212.4\pm 0.2 12.6±0.412.6\pm 0.4 13.1±0.113.1\pm 0.1 13.0±0.213.0\pm 0.2
0.058 4.1±1.04.1\pm 1.0 9.1±1.19.1\pm 1.1 9.3±1.39.3\pm 1.3 10.7±0.510.7\pm 0.5 10.9±0.210.9\pm 0.2 11.9±0.311.9\pm 0.3 12.1±0.412.1\pm 0.4 12.7±0.412.7\pm 0.4 13.1±0.213.1\pm 0.2 13.2±0.313.2\pm 0.3
0.060 4.1±1.04.1\pm 1.0 8.8±0.48.8\pm 0.4 9.1±0.99.1\pm 0.9 10.8±0.510.8\pm 0.5 11.2±0.511.2\pm 0.5 11.8±0.311.8\pm 0.3 12.5±0.212.5\pm 0.2 12.8±0.212.8\pm 0.2 13.0±0.313.0\pm 0.3 13.2±0.213.2\pm 0.2
0.062 4.1±1.74.1\pm 1.7 9.2±1.09.2\pm 1.0 9.1±0.69.1\pm 0.6 10.9±0.510.9\pm 0.5 11.9±0.511.9\pm 0.5 12.1±0.312.1\pm 0.3 12.4±0.212.4\pm 0.2 12.9±0.412.9\pm 0.4 13.0±0.313.0\pm 0.3 13.3±0.113.3\pm 0.1
0.064 4.5±1.24.5\pm 1.2 8.6±0.58.6\pm 0.5 9.7±0.49.7\pm 0.4 11.1±0.811.1\pm 0.8 11.7±0.211.7\pm 0.2 12.3±0.512.3\pm 0.5 12.6±0.312.6\pm 0.3 13.1±0.213.1\pm 0.2 13.3±0.213.3\pm 0.2 13.4±0.213.4\pm 0.2
0.066 4.7±0.34.7\pm 0.3 9.5±0.99.5\pm 0.9 9.2±0.49.2\pm 0.4 11.0±0.511.0\pm 0.5 12.0±0.312.0\pm 0.3 12.1±0.312.1\pm 0.3 12.8±0.412.8\pm 0.4 12.9±0.212.9\pm 0.2 13.3±0.213.3\pm 0.2 13.3±0.213.3\pm 0.2
0.068 4.7±0.74.7\pm 0.7 9.2±1.09.2\pm 1.0 9.7±0.79.7\pm 0.7 11.0±0.711.0\pm 0.7 11.7±0.411.7\pm 0.4 12.8±0.412.8\pm 0.4 12.5±0.612.5\pm 0.6 13.0±0.213.0\pm 0.2 13.4±0.113.4\pm 0.1 13.4±0.213.4\pm 0.2
0.070 5.4±1.55.4\pm 1.5 9.5±0.99.5\pm 0.9 10.1±0.610.1\pm 0.6 10.8±0.410.8\pm 0.4 12.3±0.312.3\pm 0.3 12.4±0.412.4\pm 0.4 12.5±0.712.5\pm 0.7 13.1±0.413.1\pm 0.4 13.3±0.213.3\pm 0.2 13.5±0.113.5\pm 0.1
0.072 5.3±1.25.3\pm 1.2 9.1±0.89.1\pm 0.8 10.1±0.410.1\pm 0.4 11.7±0.511.7\pm 0.5 12.2±0.612.2\pm 0.6 12.5±0.412.5\pm 0.4 13.0±0.413.0\pm 0.4 13.3±0.213.3\pm 0.2 13.2±0.213.2\pm 0.2 13.7±0.213.7\pm 0.2
0.074 5.8±1.05.8\pm 1.0 10.1±0.810.1\pm 0.8 10.3±0.410.3\pm 0.4 11.4±0.311.4\pm 0.3 12.0±0.412.0\pm 0.4 12.5±0.412.5\pm 0.4 12.7±0.312.7\pm 0.3 13.0±0.413.0\pm 0.4 13.3±0.313.3\pm 0.3 13.7±0.113.7\pm 0.1
0.076 5.2±1.35.2\pm 1.3 9.8±0.59.8\pm 0.5 10.6±0.510.6\pm 0.5 11.3±0.911.3\pm 0.9 12.3±0.312.3\pm 0.3 12.7±0.512.7\pm 0.5 13.0±0.213.0\pm 0.2 13.2±0.213.2\pm 0.2 13.5±0.313.5\pm 0.3 13.7±0.213.7\pm 0.2
0.078 5.9±0.55.9\pm 0.5 10.3±0.510.3\pm 0.5 9.8±0.69.8\pm 0.6 11.8±0.111.8\pm 0.1 12.1±0.112.1\pm 0.1 12.7±0.412.7\pm 0.4 13.2±0.313.2\pm 0.3 13.3±0.313.3\pm 0.3 13.6±0.113.6\pm 0.1 13.7±0.213.7\pm 0.2
0.080 5.6±0.75.6\pm 0.7 10.1±0.110.1\pm 0.1 10.6±0.410.6\pm 0.4 11.4±0.411.4\pm 0.4 12.3±0.512.3\pm 0.5 13.0±0.313.0\pm 0.3 13.1±0.113.1\pm 0.1 13.4±0.313.4\pm 0.3 13.5±0.413.5\pm 0.4 13.7±0.213.7\pm 0.2
0.082 4.3±1.34.3\pm 1.3 10.3±0.810.3\pm 0.8 10.2±0.710.2\pm 0.7 11.8±0.511.8\pm 0.5 12.5±0.412.5\pm 0.4 12.8±0.312.8\pm 0.3 13.1±0.313.1\pm 0.3 13.4±0.213.4\pm 0.2 13.5±0.313.5\pm 0.3 13.8±0.213.8\pm 0.2
0.084 6.7±0.96.7\pm 0.9 10.6±0.310.6\pm 0.3 10.9±0.210.9\pm 0.2 11.8±0.311.8\pm 0.3 12.5±0.312.5\pm 0.3 13.0±0.313.0\pm 0.3 13.3±0.213.3\pm 0.2 13.4±0.313.4\pm 0.3 13.6±0.213.6\pm 0.2 13.8±0.213.8\pm 0.2
0.086 4.9±0.64.9\pm 0.6 10.2±0.610.2\pm 0.6 11.0±0.411.0\pm 0.4 11.9±0.311.9\pm 0.3 12.4±0.212.4\pm 0.2 12.6±0.512.6\pm 0.5 13.0±0.313.0\pm 0.3 13.5±0.213.5\pm 0.2 13.7±0.213.7\pm 0.2 13.9±0.213.9\pm 0.2
0.088 5.8±1.05.8\pm 1.0 10.7±0.610.7\pm 0.6 10.9±0.210.9\pm 0.2 11.7±0.611.7\pm 0.6 12.5±0.212.5\pm 0.2 13.0±0.513.0\pm 0.5 13.3±0.213.3\pm 0.2 13.6±0.313.6\pm 0.3 13.6±0.113.6\pm 0.1 13.9±0.113.9\pm 0.1
0.090 6.6±1.46.6\pm 1.4 10.2±0.610.2\pm 0.6 11.1±0.411.1\pm 0.4 12.1±0.312.1\pm 0.3 12.6±0.512.6\pm 0.5 13.0±0.213.0\pm 0.2 13.5±0.113.5\pm 0.1 13.4±0.213.4\pm 0.2 13.6±0.213.6\pm 0.2 13.8±0.213.8\pm 0.2
0.092 7.0±1.07.0\pm 1.0 10.4±0.610.4\pm 0.6 11.1±0.711.1\pm 0.7 12.2±0.312.2\pm 0.3 12.8±0.312.8\pm 0.3 13.0±0.213.0\pm 0.2 13.3±0.213.3\pm 0.2 13.5±0.313.5\pm 0.3 13.7±0.213.7\pm 0.2 13.8±0.213.8\pm 0.2
0.094 7.9±0.57.9\pm 0.5 10.2±0.210.2\pm 0.2 11.2±0.611.2\pm 0.6 12.3±0.212.3\pm 0.2 12.5±0.312.5\pm 0.3 12.8±0.312.8\pm 0.3 13.2±0.213.2\pm 0.2 13.5±0.213.5\pm 0.2 13.6±0.213.6\pm 0.2 13.9±0.213.9\pm 0.2
0.096 6.7±0.56.7\pm 0.5 10.9±0.610.9\pm 0.6 11.1±0.211.1\pm 0.2 12.2±0.512.2\pm 0.5 12.8±0.212.8\pm 0.2 13.1±0.313.1\pm 0.3 13.4±0.313.4\pm 0.3 13.4±0.213.4\pm 0.2 13.8±0.313.8\pm 0.3 13.9±0.113.9\pm 0.1
0.098 7.6±0.57.6\pm 0.5 10.7±0.510.7\pm 0.5 11.1±0.211.1\pm 0.2 12.2±0.312.2\pm 0.3 12.6±0.312.6\pm 0.3 13.0±0.413.0\pm 0.4 13.3±0.213.3\pm 0.2 13.6±0.213.6\pm 0.2 13.7±0.213.7\pm 0.2 14.0±0.114.0\pm 0.1
0.100 8.2±1.08.2\pm 1.0 10.8±0.310.8\pm 0.3 11.3±0.811.3\pm 0.8 11.9±0.411.9\pm 0.4 13.0±0.213.0\pm 0.2 13.3±0.313.3\pm 0.3 13.4±0.113.4\pm 0.1 13.6±0.213.6\pm 0.2 13.8±0.113.8\pm 0.1 13.8±0.213.8\pm 0.2
Refer to caption
(a) Yelp P.
Refer to caption
(b) AG’s News
Refer to caption
(c) RTE
Refer to caption
(d) Yelp P.
Refer to caption
(e) AG’s News
Refer to caption
(f) RTE
Refer to caption
(g) Yelp P.
Refer to caption
(h) AG’s News
Refer to caption
(i) RTE
Refer to caption
(j) Yelp P.
Refer to caption
(k) AG’s News
Refer to caption
(l) RTE
Figure 9: Black-Box Tuning for LMaaS. Results of Training Loss, Training Accuracy, Development Loss and Development Accuracy on Yelp P, AG’s News and RTE.

Appendix B Theory Supplementary

Proofs of Theorems

Theorem 2.

For sequential-mode classification-based DFO Algorithm 3, let 𝐗t=𝐗ht,{\mathbf{X}}_{t}={\mathbf{X}}_{h_{t}}, ϵ>0\epsilon>0 and 0<δ<1.0<\delta<1. When Ωϵ\Omega_{\epsilon} is η\eta-shattered by hth_{t} for all t=r+1,Tt=r+1\ldots,T and maxt=r+1,,T({xΩ:ht(x)=1})p1,\max\limits_{t=r+1,\ldots,T}{\mathbb{P}}(\{x\in\Omega\colon h_{t}(x)=1\})\leq p\leq 1, the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{(ληp+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.
Proof.

Let x~:=argmint=1,,Tf(xt),\tilde{x}:=\arg\min\limits_{t=1,\ldots,T}~{}f(x_{t}), then

Pr(f(x~)f>ϵ)\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]]\displaystyle\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]\right]
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘T2Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]]\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-2}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\bigr{]}
=\displaystyle= \displaystyle\cdots\cdots
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘rΩϵc}𝔼[𝕀{𝐘r+1Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]|r]].\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\cdots|{\mathcal{F}}_{r}\big{]}\bigr{]}.

Where 𝕀B(x){\mathbb{I}}_{B}(x) is the identical function on BB\in{\mathcal{F}} such that 𝕀B(x)1{\mathbb{I}}_{B}(x)\equiv 1 for all xBx\in B and 𝕀B(x)0{\mathbb{I}}_{B}(x)\equiv 0 otherwise. At step tr+1,t\geq r+1, since 𝐗Ω{\mathbf{X}}_{\Omega} is independent to t1,{\mathcal{F}}_{t-1}, it holds

𝔼[𝕀{𝐘tΩϵc}|t1]=\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]= 𝔼[𝕀{λ𝐗t+(1λ)𝐗ΩΩϵc}|t1]\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]
=\displaystyle= λ(1𝔼[𝕀{𝐗htΩϵ}|t1])+(1λ)(1|Ωϵ|).\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-|\Omega_{\epsilon}|).

Under the assumption that Ωϵ\Omega_{\epsilon} is η\eta-shattered by ht,h_{t}, it holds the relation that

𝔼[𝕀{𝐗htΩϵ}|t1]=({xΩϵ:ht(x)=1})({xΩ:ht(x)=1})ηp|Ωϵ|.\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}]=\frac{{\mathbb{P}}(\{x\in\Omega_{\epsilon}\colon h_{t}(x)=1\})}{{\mathbb{P}}(\{x\in\Omega\colon h_{t}(x)=1\})}\geq\frac{\eta}{p}|\Omega_{\epsilon}|.

Therefore,

𝔼[𝕀{𝐘tΩϵc}|t1]=\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]= λ(1𝔼[𝕀{𝐗htΩϵ}|t1])+(1λ)(1|Ωϵ|)\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-|\Omega_{\epsilon}|)
\displaystyle\leq 1(ληp+(1λ)|)Ωϵ|.\displaystyle 1-\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)|\bigr{)}\Omega_{\epsilon}|.

Apparently, the upper bound of 𝔼[𝕀{𝐘tΩϵc}|t1]\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}] satisfies 0<1(ληp+(1λ))|Ωϵ|<1,0<1-\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}|\Omega_{\epsilon}|<1, thus

𝔼[𝕀{𝐘tΩϵc}𝔼[𝕀{𝐘t+1Ωϵc}|t]|t1]\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t+1}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t}]|{\mathcal{F}}_{t-1}\big{]}\leq (1(ληp+(1λ))|Ωϵ|)𝔼[𝕀{YtΩϵc}|t1]\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))|\Omega_{\epsilon}|\bigr{)}\mathbb{E}[{\mathbb{I}}_{\{Y_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]
\displaystyle\leq (1(ληp+(1λ))|Ωϵ|)2.\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))|\Omega_{\epsilon}|\bigr{)}^{2}.

Moreover,

Pr(f(x~)f>ϵ)\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘rΩϵc}𝔼[𝕀{𝐘r+1Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]|r]]\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\cdots|{\mathcal{F}}_{r}\big{]}\bigr{]}
\displaystyle\leq (1(ληp+(1λ))|Ωϵ|)Tr𝔼[𝐘1,,𝐘rΩϵc]\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))|\Omega_{\epsilon}|\bigr{)}^{T-r}\mathbb{E}[{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}]
=\displaystyle= (1(ληp+(1λ))|Ωϵ|)Tr(1|Ωϵ|)r\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))|\Omega_{\epsilon}|\bigr{)}^{T-r}(1-|\Omega_{\epsilon}|)^{r}
\displaystyle\leq exp{((Tr)(ληp+(1λ))+r)|Ωϵ|}.\displaystyle\exp\left\{-\left((T-r)(\lambda\frac{\eta}{p}+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}.

In order that Pr(f(x~)f>ϵ)δ,\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)\leq\delta, it suffices that

exp{((Tr)(ληp+(1λ))+r)|Ωϵ|}δ,\exp\left\{-\left((T-r)(\lambda\frac{\eta}{p}+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}\leq\delta,

hence the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{(ληp+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

Theorem 3.

For Algorithm 4 with region shrinking rate 0<γ<10<\gamma<1 and region shrinking frequency 0<ρ<1.0<\rho<1. Let ϵ>0\epsilon>0 and 0<δ<1.0<\delta<1. When Ωϵ\Omega_{\epsilon} is η\eta-shattered by h~t\tilde{h}_{t} for all t=r+1,T,t=r+1\ldots,T, the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{(γρ+γ(Tr)ρ2λη+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\mathcal{O}\bigg{(}\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.
Proof.

Let x~:=argmint=1,,Tf(xt),\tilde{x}:=\arg\min\limits_{t=1,\ldots,T}~{}f(x_{t}), then

Pr(f(x~)f>ϵ)\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]]\displaystyle\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]\right]
=\displaystyle= 𝔼[𝕀{𝐘1,,YT2Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]]\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,Y_{T-2}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\bigr{]}
=\displaystyle= \displaystyle\cdots\cdots
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘rΩϵc}𝔼[𝕀{𝐘r+1Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]|r]].\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\cdots|{\mathcal{F}}_{r}\big{]}\bigr{]}.

At step tr+1,t\geq r+1, since 𝐗Ω{\mathbf{X}}_{\Omega} is independent to t1,{\mathcal{F}}_{t-1}, it holds

𝔼[𝕀{𝐘tΩϵc}|t1]=\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]= 𝔼[𝕀{λ𝐗t+(1λ)𝐗ΩΩϵc}|t1]\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]
=\displaystyle= λ(1𝔼[𝕀{𝐗tΩϵ}|t1])+(1λ)(1|Ωϵ|).\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-|\Omega_{\epsilon}|).

The expectation of probability that h~t\tilde{h}_{t} hits active region is upper bounded by

𝔼[({xΩ:h~t(x)=1})|t1]γ(tr)ρ[Ω]=γ(tr)ρ.\mathbb{E}\bigl{[}{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\})|{\mathcal{F}}_{t-1}\bigr{]}\leq\gamma^{(t-r)\rho}{\mathbb{P}}[\Omega]=\gamma^{(t-r)\rho}.

Under the assumption that Ωϵ\Omega_{\epsilon} is η\eta-shattered by h~t,\tilde{h}_{t}, it holds the relation that

𝔼[𝕀{𝐗tΩϵ}|t1]=\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}\big{]}= ({xΩϵ:h~t(x)=1})𝔼[({xΩ:h~t(x)=1})|t1]\displaystyle\frac{{\mathbb{P}}\bigl{(}\{x\in\Omega_{\epsilon}\colon\tilde{h}_{t}(x)=1\}\bigr{)}}{\mathbb{E}\bigl{[}{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\})|{\mathcal{F}}_{t-1}\bigr{]}}
\displaystyle\geq γ(tr)ρη|Ωϵ|.\displaystyle\gamma^{-(t-r)\rho}\eta|\Omega_{\epsilon}|.

Therefore,

𝔼[𝕀{𝐘tΩϵc}|t1]1(λγ(tr)ρη+(1λ)|)Ωϵ|.\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]\leq 1-\bigl{(}\lambda\gamma^{-(t-r)\rho}\eta+(1-\lambda)|\bigr{)}\Omega_{\epsilon}|.

Moreover,

Pr(f(x~)f>ϵ)\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)
=\displaystyle= 𝔼[𝕀{𝐘1,,𝐘rΩϵc}𝔼[𝕀{𝐘r+1Ωϵc}𝔼[𝕀{𝐘T1Ωϵc}𝔼[𝕀{𝐘TΩϵc}|T1]|T2]|r]]\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{T-1}]|{\mathcal{F}}_{T-2}\right]\cdots|{\mathcal{F}}_{r}\big{]}\bigr{]}
\displaystyle\leq t=r+1T(1(λγ(tr)ρη+(1λ)|)Ωϵ|)(1|Ωϵ|)r\displaystyle\prod_{t=r+1}^{T}\bigl{(}1-\bigl{(}\lambda\gamma^{-(t-r)\rho}\eta+(1-\lambda)|\bigr{)}\Omega_{\epsilon}|\bigr{)}(1-|\Omega_{\epsilon}|)^{r}
\displaystyle\leq exp{(t=r+1Tλγ(tr)ρη+(Tr)((1λ))+r)|Ωϵ|}\displaystyle\exp\left\{-\left(\sum_{t=r+1}^{T}\lambda\gamma^{-(t-r)\rho}\eta+(T-r)((1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}
=\displaystyle= exp{((Tr)(γρ+γ(Tr)ρ2λη+(1λ))+r)|Ωϵ|}.\displaystyle\exp\left\{-\left((T-r)(\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}.

In order that Pr(f(x~)f>ϵ)δ,\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)\leq\delta, it suffices that

exp{((Tr)(γρ+γ(Tr)ρ2λη+(1λ))+r)|Ωϵ|}δ,\exp\left\{-\left((T-r)(\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}\leq\delta,

hence the (ϵ,δ)(\epsilon,\delta)-query complexity is upper bounded by

𝒪(max{(γρ+γ(Tr)ρ2λη+(1λ))1(1|Ωϵ|ln1δr)+r,T}).\mathcal{O}\bigg{(}\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

Sufficient Condition for Acceleration

Under the assumption that ff is dimensionally locally Holder continuous, it is obvious that

Ωϵi=1n[xi(ϵL1i)β1i,xi+(ϵL1i)β1i].\Omega_{\epsilon}\subseteq\prod_{i=1}^{n}[x^{i}_{*}-(\frac{\epsilon}{L^{i}_{1}})^{-\beta^{i}_{1}},x^{i}_{*}+(\frac{\epsilon}{L^{i}_{1}})^{-\beta^{i}_{1}}].

Denoted by x~t=(x~t1,,x~tn):=argminj=1,,tf(xj)\tilde{x}_{t}=(\tilde{x}^{1}_{t},\ldots,\tilde{x}^{n}_{t}):=\arg\min_{j=1,\ldots,t}f(x_{j}). The subsequent sufficient condition gives a lower bound of region shrinking rate γ\gamma and shrinking frequency ρ,\rho, such that RACE-CARS achieves acceleration over SRACOS.

Proposition 1.

For a dimensionally local Holder continuous objective f.f. Assume that for ϵ>0,\epsilon>0, Ωϵ\Omega_{\epsilon} is η\eta-shattered by hth_{t} for all t=r+1,,T.t=r+1,\ldots,T. In order that Ωϵ\Omega_{\epsilon} being η\eta-shattered by h~t,\tilde{h}_{t}, it is sufficient when the region shrinking rate γ\gamma and shrinking frequency ρ\rho satisfy:

12γtρΩ(x~t1x1+(ϵL11)β11,,x~tnxn+(ϵL1n)β1n).\frac{1}{2}\gamma^{t\rho}\|\Omega\|\geq\biggl{(}\tilde{x}^{1}_{t}-x^{1}_{*}+(\frac{\epsilon}{L^{1}_{1}})^{-\beta^{1}_{1}},\ldots,\tilde{x}^{n}_{t}-x^{n}_{*}+(\frac{\epsilon}{L^{n}_{1}})^{-\beta^{n}_{1}}\biggr{)}.