Scalable Acceleration for Classification-Based Derivative-Free Optimization

Tianyi Han^*, Jingya Li, Zhipeng Guo, Yuan Jin^*

Abstract

Derivative-free optimization algorithms play an important role in scientific and engineering design optimization problems, especially when derivative information is not accessible. In this paper, we study the framework of sequential classification-based derivative-free optimization algorithms. By introducing learning theoretic concept hypothesis-target shattering rate, we revisit the computational complexity upper bound of SRACOS (Hu, Qian, and Yu 2017). Inspired by the revisited upper bound, we propose an algorithm named RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We further establish theorems showing the acceleration by region shrinking. Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efficiency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyper-parameter tuning guidance.

Introduction

In recent years, there has been a growing interest in the field of derivative-free optimization (DFO) algorithms, also known as zeroth-order optimization. These algorithms aim to optimize objective functions without relying on explicit gradient information, making them suitable for scenarios where obtaining derivatives is either infeasible or computationally expensive (Conn, Scheinberg, and Vicente 2009; Larson, Menickelly, and Wild 2019). For example, DFO techniques can be applied to hyperparameter tuning, which involves optimizing complex objective functions with unavailable first-order information (Falkner, Klein, and Hutter 2018; Akiba et al. 2019; Yang and Shami 2020). Moreover, DFO algorithms find applications in engineering design optimization, where the objective functions are computationally expensive to evaluate and lack explicit derivatives (Ray and Saini 2001; Liao 2010; Akay and Karaboga 2012).

Classical DFO methods such as Nelder-Mead method (Nelder and Mead 1965) or directional direct-search (DDS) method (Céa 1971; Yu 1979) have shown great success on convex problems. However, their performance is compromised when the objective is nonconvex, which is the commonly-faced situation for black-box problems. In this paper, we focus on optimization problems reading as

\min_{x\in\Omega}~{}f(x),

(1)

where the $n$ dimensional compact cubic $\Omega\subseteq{\mathbb{R}}^{n}$ is solution space. In addition, we will not stipulate any convexity, smoothness or separability assumption on $f.$

Recent decades, progresses have been made in the extensive exploration of DFO algorithms for nonconvex problems, and various kinds of algorithms have been established. For example, evolutionary algorithms (Bäck and Schwefel 1993; Fortin et al. 2012; Hansen 2016; Opara and Arabas 2019), Bayesian optimization (BO) methods (Snoek, Larochelle, and Adams 2012; Shahriari et al. 2015; Frazier 2018) and gradient approximation surrogate modeling algorithms (Nesterov and Spokoiny 2017; Chen et al. 2019; Ge et al. 2022; Ragonneau and Zhang 2023).

Given the nonconvex and black-box nature of the problem, the quest for enhanced sample efficiency inevitably presents the classic trade-off between exploration and exploitation. Similar to typical nonlinear/nonconvex numerical issues, the aforementioned algorithms are also susceptible to the curse of dimensionality (Bickel and Levina 2004; Hall, Marron, and Neeman 2005; Fan and Fan 2008; Shi et al. 2021; Scheinberg 2022; Yue et al. 2023). Improving scalability and efficiency can be distilled into a few key areas, such as function decomposition (Wang et al. 2018b), dimension reduction (Wang et al. 2016; Nayebi, Munteanu, and Poloczek 2019) and sample strategy refinement. Regarding the latter, previous algorithms have focused on preventing over-exploitation through, for instance, trying more efficient sampling dynamics (Ros and Hansen 2008; Hensman, Fusi, and Lawrence 2013; Yi et al. 2024), search space discretization (Sobol’ 1967) or search region restriction (Eriksson et al. 2019; Wang, Fonseca, and Tian 2020).

Regardless of the improvements, many of these model-based DFO algorithms share the mechanism that, locally or globally, fits the objective function. The motivation of classification-based DFO algorithms is different: it learns a hypothesis (classifier) to fit the sub-level set $\Omega_{\alpha_{t}}:=\{x\in\Omega\colon f(x)-f^{*}\leq\alpha_{t}\}$ at each iteration $t=1,\ldots,T,$ with which new candidates are generated for the successive round (Michalski 2000; Gotovos 2013; Bogunovic et al. 2016; Yu, Qian, and Hu 2016; Hashimoto, Yadlowsky, and Duchi 2018). Sub-level sets are considered to contain less information than objective functions, since they are already obtained whenever the objective function is known. By this means, sample usage is more efficient. Moreover, it is less sensitive to function oscillation or noise (Liu et al. 2017) and easy to be parallelized (Liu et al. 2017, 2019b; Hashimoto, Yadlowsky, and Duchi 2018).

Training strategies of hypotheses have branched out in various directions. Hashimoto, Yadlowsky, and Duchi pursue accurate hypotheses using a conservative strategy raised by El-Yaniv and Wiener. They demonstrate that for a hypothesis class $\mathcal{H}$ with a VC-dimension of $VC(\mathcal{H}),$ an $\epsilon$ -accurate hypothesis requires $\mathcal{O}(\epsilon^{-1}(VC(\mathcal{H})\log(\epsilon^{-1})))$ samples per batch. While a computationally efficient approximation has indeed been developed, its success was demonstrated only in the context of low-dimensional problems. The extent of its effectiveness in high-dimensional scenarios is yet to be determined. However, a contrasting approach successfully lead to the improvements in sample efficiency. Yu, Qian, and Hu propose to “RAndomly COordinate-wise Shrink” the solution region (RACOS), a more radical approach that diverges from the goal of producing accurate hypotheses. They prove RACOS converges to global minimum in polynomial time when the objective is locally Holder continuous, and experiments indicate success on high-dimensional problems ranging from hundreds to thousands of dimensions. An added benefit is the ease of sampling through the generated hypotheses, as their active regions are coordinate-orthogonal cubics. Building on this, the sequential counterpart of RACOS, known as SRACOS, takes an even more radical stance by sampling only once per iteration and learning from biased data. Despite this, SRACOS has been shown to outperform RACOS under certain mild restriction (Hu, Qian, and Yu 2017).

Outline and Contributions

(1)

Our research extends the groundwork laid by (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), yet we discover that the upper bound they proposed for the SRACOS does not fully encapsulate its behavior, potentially allowing for an overflow. On the other hand, we construct a counter-example (eq. 3) illustrating the upper bound is not tight enough to delineate the convergence accurately.
(2)

We also identify a contentious assumption regarding the concept of error-target dependence that seems to underpin these limitations. In this paper, we introduce a novel learning theoretic concept termed the hypothesis-target $\eta$ -shattering rate, which serves as a foundation for our reassessment of the upper bound on the computational complexity of SRACOS.
(3)

Inspired by the reanalysis, we recognize that the accuracy of the trained hypotheses is even less critical than previously thought. With this insight, we formulate a novel algorithm, RACE-CARS (an acronym for “RAndomized CoordinatE Classifying And Region Shrinking”), which perpetuates the essence of SRACOS while promises a theoretical enhancement in convergence. We design experiments on both synthetic functions and black-box tuning for LMaaS. These experiments juxtapose RACE-CARS against a selection of DFO algorithms, empirically highlighting the superior efficacy of RACE-CARS.
(4)

In discussion and appendix, we further substantiate the empirical prowess of RACE-CARS beyond the confines of continuity, also supported by theoretical framework. An ablation study is also conducted to demystify the selection of hyperparameters for our proposed algorithm.

The rest of the paper is organized in five sections, sequentially presenting the background, theoretical study, experiments, discussion and conclusion.

Background

Assumption 1.

In eq. 1, we presume $f(x)$ is lower bounded within $\Omega$ , with $f^{*}:=\min_{x\in\Omega}~{}f(x).$

In our theoretical analysis, we refine the procedure as a stochastic process. Let ${\mathcal{F}}$ denote the Borel $\sigma$ -algebra on $\Omega$ , and ${\mathbb{P}}$ the probability measure defined on ${\mathcal{F}}.$ For instance, if $\Omega$ is continuous space, ${\mathbb{P}}$ is derived from Lebesgue measure $m$ , such that ${\mathbb{P}}(B):=m(B)/m(\Omega)$ for all $B\in{\mathcal{F}}.$

Assumption 2.

Define $\Omega_{\epsilon}:=\{x\in\Omega\colon f(x)-f^{*}\leq\epsilon\}$ , with the assumption $|\Omega_{\epsilon}|:={\mathbb{P}}(\Omega_{\epsilon})>0$ for all $\epsilon>0.$

A hypothesis (or classifier) $h$ is a function mapping the solution space $\Omega$ to $\{0,1\}.$ Define

D_{h}(x):=\begin{cases}{\mathbb{P}}(\{x\in\Omega\colon h(x)=1\})^{-1},&h(x)=1\\ 0,&\text{otherwise},\end{cases}

(2)

a probability distribution in $\Omega.$ Let ${\mathbf{X}}_{h}$ be the random vector in $\left(\Omega,{\mathcal{F}},{\mathbb{P}}\right)$ drawn from $D_{h},$ implying that $\Pr({\mathbf{X}}_{h}\in B)=\int_{x\in B}D_{h}(x)d{\mathbb{P}}$ for all $B\in{\mathcal{F}}.$ Denoted by ${\mathbb{T}}:=\{1,2,\ldots,T\}$ , and let $\mathbb{F}:=\left({\mathcal{F}}_{t}\right)_{t\in{\mathbb{T}}}$ be a filtration of $\sigma$ -algebras on $\Omega$ indexed by ${\mathbb{T}}$ , satisfying ${\mathcal{F}}_{1}\subseteq{\mathcal{F}}_{2}\subseteq\cdots\subseteq{\mathcal{F}}_{T}\subseteq{\mathcal{F}}.$ A typical classification-based optimization algorithm learns an ${\mathbb{F}}$ -adapted stochastic process ${\mathbf{X}}:=({\mathbf{X}}_{t})_{t\in{\mathbb{T}}},$ where each ${\mathbf{X}}_{t}$ is induced by ${\mathbf{X}}_{h_{t}}$ and $h_{t}$ is the hypothesis updated at step $t.$ Subsequently, it samples new data with a stochastic process ${\mathbf{Y}}:=({\mathbf{Y}}_{t})_{t\in{\mathbb{T}}}$ generated by ${\mathbf{X}}$ . Typically, the new candidates at step $t\geq r+1$ are sampled from

{\mathbf{Y}}_{t}:=\begin{cases}{\mathbf{X}}_{t},&\text{with probability $\lambda$}\\ {\mathbf{X}}_{\Omega},&\text{with probability $1-\lambda$},\end{cases}

where ${\mathbf{X}}_{\Omega}$ is random vector drawn from uniform distribution ${\mathcal{U}}_{\Omega}$ and $\lambda\in[0,1]$ is the exploitation rate.

A simplified batch-mode classification-based optimization algorithm is outlined in Algorithm 1. At each step $t,$ it selects a positive set ${\mathcal{S}}_{positive}$ from ${\mathcal{S}}$ containing the best $m$ samples, with the remainder in the negative set ${\mathcal{S}}_{negative}$ . Then it trains a hypothesis $h_{t}$ distinguishing between the positive set and negative set, ensuring $h_{t}(x_{j})=0$ for all $(x_{j},y_{j})\in{\mathcal{S}}_{negative}.$ Finally, it samples $r$ new solutions with the sampling random vector ${\mathbf{Y}}_{t}.$ The sub-procedure $h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative})$ means hypothesis training.

Algorithm 1 Batch-Mode Classification-Based Optimization Algorithm

Input:

T

: Budget;

r

: Training size;

m

: Positive size.

Output:

(x_{best},y_{best}).

Collect:

{\mathcal{S}}=\{(x^{0}_{1},y^{0}_{1}),\ldots,(x^{0}_{r},y^{0}_{r})\}

i.i.d. from

{\mathcal{U}}_{\Omega}

;

(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in{\mathcal{S}}\}

;

for

t=1,\ldots,T/r

Classify:

(\mathcal{S}_{positive},\mathcal{S}_{negative})\leftarrow\mathcal{S}

;

Train:

h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative})

;

Sample:

\{(x^{t}_{1},y^{t}_{1}),\ldots,(x^{t}_{r},y^{t}_{r})\}

i.i.d. with

{\mathbf{Y}}_{t}

;

Select:

\mathcal{S}\leftarrow\mathcal{S}\cup\{(x^{t}_{1},y^{t}_{1}),\ldots,(x^{t}_{r},y^{t}_{r})\}

;

(x_{best},y_{best})=\arg\min\bigl{\{}y\colon(x,y)\in\mathcal{S}\bigr{\}}.

end for

Return:

(x_{best},y_{best})

RACOS is the abbreviation of “RAndomized COordinate Shrinking”. Literally, it trains the hypothesis by this means (Yu, Qian, and Hu 2016), i.e., shrinking coordinates randomly such that all negative samples are excluded from active region of resulting hypothesis. Algorithm 2 shows a continuous version of RACOS.

Algorithm 2 RACOS

Input:

\Omega

: Boundary;

(\mathcal{S}_{positive},\mathcal{S}_{negative})

: Binary sets;

{\mathbb{I}}=\{1,\ldots,n\}

: Index of dimensions.

Output:

h

: Hypothesis.

Randomly select:

x_{+}=(x^{1}_{+},\ldots,x^{n}_{+})\leftarrow{\mathcal{S}}_{positive}

;

Initialize:

h(x)\equiv 1

;

while

\exists x\in\mathcal{S}_{negative}

s.t.

h(x)=1

Randomly select:

k\leftarrow{\mathbb{I}}

;

Randomly select:

x_{-}=(x^{1}_{-},\ldots,x^{n}_{-})\leftarrow\mathcal{S}_{negative}

;

x^{k}_{+}\leq x^{k}_{-}

then

{\textnormal{s}}\leftarrow random(x^{k}_{+},x^{k}_{-})

;

Shrink:

h(x)=0,~{}\forall x\in\{x=(x^{1},\ldots,x^{n})\in\Omega\colon x^{k}>{\textnormal{s}}\}

;

else

{\textnormal{s}}\leftarrow random(x^{k}_{-},x^{k}_{+})

;

Shrink:

h(x)=0,~{}\forall x\in\{x=(x^{1},\ldots,x^{n})\in\Omega\colon x^{k}<{\textnormal{s}}\}

;

end if

Return:

h

end while

Aside from sampling only once per iteration, another difference between sequential and batch mode classification-based DFO is that the sequential version will replace the training set with a new one under certain rules to finish step $t$ (see in in Algorithm 3). In the rest of this paper, we will omit the details of $Replacing$ sub-procedure which can be found in (Hu, Qian, and Yu 2017).

Algorithm 3 Sequential-Mode Classification-Based Optimization Algorithm

Input:

T

: Budget;

r

: Training size;

m

: Positive size;

Replacing

: Replacing sub-procedure.

Output:

(x_{best},y_{best}).

Collect

\mathcal{S}=\{(x_{1},y_{1}),\ldots,(x_{r},y_{r})\}

i.i.d. from

{\mathcal{U}}_{\Omega}

;

(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\}

;

for

t=r+1,\ldots,T

Classify:

(\mathcal{S}_{positive},\mathcal{S}_{negative})\leftarrow\mathcal{S}

;

Train:

h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative})

;

Sample:

(x_{t},y_{t})\sim{\mathbf{Y}}_{t}

;

Replace:

\mathcal{S}\leftarrow Replacing((x_{t},y_{t}),\mathcal{S})

;

(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\}

;

end for

Return:

(x_{best},y_{best})

Classification-based DFO algorithms admit a bound on the query complexity (Yu and Qian 2014), quantifying the total number of function evaluations required to identify a solution that achieves an approximation level of ${\epsilon}$ with a high probability of at least $1-\delta.$

Definition 1 ( $(\epsilon,\delta)$ -Query Complexity).

Given $f,$ $0<\delta<1$ and $\epsilon>0.$ The $(\epsilon,\delta)$ -query complexity of an algorithm ${\mathcal{A}}$ is the number of calls to $f$ such that, with probability at least $1-\delta,$ ${\mathcal{A}}$ finds at least one solution $\tilde{x}\in\Omega$ satisfying

f(\tilde{x})-f^{*}\leq\epsilon.

Definition 2 and 3 are given by (Yu, Qian, and Hu 2016). The first one characterizes the so-called dependence between classification error and target region, which is expected to be minimized to ensure that the efficiency. The second one characterizes the portion of active region of hypothesis, and is also expected to be as small as possible.

Definition 2 (Error-Target $\theta$ -Dependence).

The error-target dependence $\theta\geq 0$ of a classification-based optimization algorithm is its infimum such that, for any $\epsilon>0$ and any $t=1,\ldots,T,$

\big{|}|\Omega_{\epsilon}|\cdot{\mathbb{P}}(\mathcal{R}_{t})-{\mathbb{P}}\bigl{(}\Omega_{\epsilon}\cap\mathcal{R}_{t}\bigr{)}\big{|}\leq\theta|\Omega_{\epsilon}|,

where $\mathcal{R}_{t}:=\Omega_{\alpha_{t}}\Delta\{x\in\Omega\colon h_{t}(x)=1\}$ denotes the relative error, $\Omega_{\alpha_{t}}$ is the sub-level set at step $t$ with $\alpha_{t}:=\min\limits_{1\leq i\leq t}f(x_{i})-f^{*}$ and the operator $\Delta$ is symmetric difference of two sets defined as $A_{1}\Delta A_{2}=(A_{1}\cup A_{2})-(A_{1}\cap A_{2}).$

Definition 3 ( $\gamma$ -Shrinking Rate).

The shrinking rate $\gamma>0$ of a classification-based optimization algorithm is its infimum such that ${\mathbb{P}}(x\in\Omega\colon h_{t}(x)=1)\leq\gamma|\Omega_{\alpha_{t}}|,$ for all $t=1,\ldots,T.$

Theoretical Study

Previous studies gave a general bound of the query complexity of Algorithm 3 based on minor error-target $\theta$ -dependence and $\gamma$ -shrinking rate assumptions:

Theorem 1.

(Hu, Qian, and Yu 2017) Given $0<\delta<1$ and $\epsilon>0,$ if a sequential classification-based optimization algorithm has error-target $\theta$ -dependence and $\gamma$ -shrinking rate, then its $(\epsilon,\delta)$ -query complexity is upper bounded by

\mathcal{O}\biggl{(}\max\left\{\frac{1}{|\Omega_{\epsilon}|}\bigl{(}\lambda+\frac{1-\lambda}{\gamma(T-r)}\sum^{T}_{t=r+1}\Phi_{t}\bigr{)}^{-1}\ln\frac{1}{\delta},T\right\}\biggr{)},

where $\Phi_{t}=\biggl{(}1-\theta-{\mathbb{P}}(\mathcal{R}_{D_{t}})-m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\|{\mathcal{U}}_{\Omega})}\biggr{)}\cdot|\Omega_{\alpha_{t}}|^{-1}$ with the notations $D_{t}:=\lambda D_{h_{t}}+(1-\lambda){\mathcal{U}}_{\Omega}$ and ${\mathbb{P}}(\mathcal{R}_{D_{t}}):=\int_{\mathcal{R}_{t}}D_{t}d{\mathbb{P}}.$

Issues Introduced by Error-Target Dependence

(1)

Overflow of the upper bound

As assumptions entailed in (Yu, Qian, and Hu 2016; Hu, Qian, and Yu 2017), it can be observed that lower values of $\theta$ or $\gamma$ correlate with improved query complexity. However, this is not a hard-and-fast rule. Even with small values of these parameters, we can encounter scenarios where the expected performance does not materialize. Following the lemma given by (Yu, Qian, and Hu 2016): ${\mathbb{P}}(\mathcal{R}_{t})\leq{\mathbb{P}}(\mathcal{R}_{D_{t}})+m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\|{\mathcal{U}}_{\Omega})}$ , we have the following inequality:

	$\displaystyle\Phi_{t}=$	$\displaystyle\biggl{(}1-\theta-{\mathbb{P}}(\mathcal{R}_{D_{t}})-m(\Omega)\sqrt{\frac{1}{2}D_{\mathrm{KL}}(D_{t}\\|{\mathcal{U}}_{\Omega})}\biggr{)}\cdot\|\Omega_{\alpha_{t}}\|^{-1}$
	$\displaystyle\leq$	$\displaystyle(1-\theta-{\mathbb{P}}(\mathcal{R}_{t}))\cdot\|\Omega_{\alpha_{t}}\|^{-1}.$

The concept of error-target $\theta$ -dependence reveals that a small $\theta$ does not guarantee small relative error ${\mathbb{P}}(\mathcal{R}_{t}).$ Contrarily, a small $\theta$ coupled with a large $({\mathbb{P}}(\Omega_{\epsilon}\cap\mathcal{R}_{t}))/|\Omega_{\epsilon}|$ can result in a significant ${\mathbb{P}}(\mathcal{R}_{t}),$ which can even be $1$ as long as $\Omega_{\alpha_{t}}$ is totally out of the active region of $h_{t},$ when the hypothesis $h_{t}$ is completely wrong. Problematically, as a divisor in the proof of Theorem 1, $\Phi_{t}\leq(1-\theta-{\mathbb{P}}(\mathcal{R}_{t}))\cdot|\Omega_{\alpha_{t}}|^{-1}$ is less or equal to 0, disrupting the established inequality. Nevertheless, in order that the inequality disrupt, ${\mathbb{P}}(\mathcal{R}_{t})$ is not necessary to be $1$ as $\theta$ is inherently nonnegative, highlighting that a series of inaccurate hypotheses can undermine the validity of the upper bound. This finding challenges the principle of SRACOS’s radical training strategy.

(2)

Tightness of the upper bound

Consider an extreme but plausible situation where the hypotheses generated at each step are defined as follows:

h_{t}(x)=\begin{cases}1,&x\in\Omega_{\epsilon}\\ 0,&x\notin\Omega_{\epsilon}.\end{cases}

(3)

In the context of sequential-mode classification-based optimization algorithms, where the training sets are not only small but also potentially biased, it’s reasonable to expect large relative errors ${\mathbb{P}}(\mathcal{R}_{t})$ . This scenario could lead to a series of hypotheses that are inaccurate with respect to $\Omega_{\alpha_{t}}$ but, by chance, accurate with respect to $\Omega_{\epsilon}.$ Consequently, the error-target dependence $\theta=\max_{1\leq t\leq T}{\mathbb{P}}(\mathcal{R}_{t})$ can be unexpectedly large. Even all $\Phi_{t}$ values being positive, the query complexity bound given in Theorem 1 is therefore not optimistic. However, the probability of failing to identify an $\epsilon$ -minimum:

		$\displaystyle\Pr\bigl{(}\min_{1\leq t\leq T}f(x_{t})-f^{*}\geq\epsilon\bigr{)}$
	$\displaystyle=$	$\displaystyle(1-\|\Omega_{\epsilon}\|)^{r}\biggl{(}(1-\lambda)(1-\|\Omega_{\epsilon}\|)\biggr{)}^{T-r},$

is less than $\delta$ for a reasonably sized $T$ . This indicates that the upper bound may not be as tight as initially thought.

Revisit of Query Complexity Upper Bound

It’s evident that even minimal error-target dependence can not encapsulate issues arising from substantial relative error. This is because error-target dependence alone is insufficient to fully account for relative error. Intuitively, one might consider introducing an additional assumption to cap relative error. However, such an assumption would be impractical, given the inherently small and biased nature of training datasets in the process. To this end, we give a new concept that stands apart from the constraints of relative error:

Definition 4 (Hypothesis-Target $\eta$ -Shattering Rate).

Given $\eta\in[0,1],$ for a family of hypotheses ${\mathcal{H}}$ defined on $\Omega,$ we say $\Omega_{\epsilon}$ is $\eta$ -shattered by $h\in{\mathcal{H}}$ if

{\mathbb{P}}(\Omega_{\epsilon}\cap\{x\in\Omega\colon h(x)=1\})\geq\eta|\Omega_{\epsilon}|,

and $\eta$ is called hypothesis-target shattering rate.

The hypothesis-target shattering rate mirrors the error-target dependence in its relation to the hypothesis’s target-accuracy. Importantly, it provides a limitation for error-target dependence in conjunction with relative error:

\theta\leq\max\{{\mathbb{P}}\bigl{(}\mathcal{R}_{t}\bigr{)},|1-{\mathbb{P}}\bigl{(}\mathcal{R}_{t}\bigr{)}-\eta|\}.

This rate, $\eta$ , measures the overlap between the target set $\Omega_{\epsilon}$ and the active region of a hypothesis. Crucially, it also mitigates the influence of relative error on error-target dependence. Utilizing the concept hypothesis-target shattering rate, we reexamine the upper bound of $(\epsilon,\delta)$ -query complexity in the subsequent theorem.

Theorem 2.

For sequential-mode classification-based DFO Algorithm 3, let ${\mathbf{X}}_{t}={\mathbf{X}}_{h_{t}},$ $\epsilon>0$ and $0<\delta<1.$ When $\Omega_{\epsilon}$ is $\eta$ -shattered by $h_{t}$ for all $t=r+1\ldots,T$ and $\max\limits_{t=r+1,\ldots,T}{\mathbb{P}}(\{x\in\Omega\colon h_{t}(x)=1\})\leq p\leq 1,$ the $(\epsilon,\delta)$ -query complexity is upper bounded by

\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

The Region-Shrinking Acceleration

In the analysis of $(\epsilon,\delta)$ -query complexity for classification-based optimization, the focus shifts away from minimizing relative error ${\mathbb{P}}(\mathcal{R}_{t})$ , as our goal is to identify optima, not to develop a sequence of accurate hypotheses. The counter-example in equation (3), despite a potentially high relative error, represents an optimal hypothesis scenario. This realization allows for the consideration of more radical hypotheses, directing our attention to the overlap between $\Omega\epsilon$ and the active region of the hypotheses, which is quantified by the hypothesis-target shattering rate.

The $\gamma$ -shrinking rate, as defined in Definition 3, measures the decay of ${\mathbb{P}}(x\in\Omega\colon h_{t}(x)=1)$ . However, the rapid decrease of $|\Omega_{\alpha_{t}}|$ as $\alpha_{t}$ approaches zero makes it impractical to sustain a series of hypotheses with a small $\gamma$ through our training process. Thus, the $\gamma$ -shrinking assumption is often not feasible for minimal $\gamma$ .

Moving beyond the pursuit of minimal relative error and $\gamma$ -shrinking relative to $|\Omega_{\alpha_{t}}|$ , we introduce Algorithm 4, which adaptively shrinks the sampling random vector ${\mathbf{Y}}_{t}$ ’s active region through a $Projection$ sub-procedure.

Algorithm 4 Accelerated Sequential-Mode Classification-Based Optimization Algorithm

Input:

\Omega

: Boundary;

T\in\mathbb{N}^{+}

: Budget;

\quad r=m+k

;

Replacing

: Replacing sub-procedure;

\gamma

: Region shrinking rate;

\rho

: Region shrinking frequency.

Output:

(x_{best},y_{best}).

Collect

\mathcal{S}=\{(x_{1},y_{1}),\ldots,(x_{r},y_{r})\}

i.i.d. from

{\mathcal{U}}_{\Omega}

;

(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\}

;

Initialize

k=1,

\tilde{\Omega}=\Omega

;

for

t=r+1,\ldots,T

Train:

h_{t}\leftarrow\mathcal{T}(\mathcal{S}_{positive},\mathcal{S}_{negative})

;

{\textnormal{s}}\leftarrow random(0,1)

;

{\textnormal{s}}\leq\rho

then

Shrink region:

\tilde{\Omega}=\Omega\cap[x_{best}-\frac{1}{2}\gamma^{k}\|\Omega\|,x_{best}+\frac{1}{2}\gamma^{k}\|\Omega\|]

;

k=k+1

;

end if

Project:

{\mathbf{Y}}_{t}\leftarrow Proj(h_{t},\tilde{\Omega})

;

Sample:

(x_{t},y_{t})\sim{\mathbf{Y}}_{t}

;

Replace:

\mathcal{S}\leftarrow Replacing((x_{t},y_{t}),\mathcal{S})

;

(x_{best},y_{best})=\arg\min\{y\colon(x,y)\in\mathcal{S}\}

;

end for

Return:

(x_{best},y_{best})

The operator $\|\cdot\|$ returns a tuple represents the diameter of each dimension of the region. For instance, when $\Omega=[\omega^{1}_{1},\omega^{1}_{2}]\times[\omega^{2}_{1},\omega^{2}_{2}],$ we have $\|\Omega\|=(\omega^{1}_{2}-\omega^{1}_{1},\omega^{2}_{2}-\omega^{2}_{1}).$ The projection operator $Proj(h_{t},\tilde{\Omega})$ generates a random vector ${\mathbf{X}}_{t}$ with probability distribution $D_{\tilde{h}_{t}}:=\tilde{h}_{t}/{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\}),$ with $\tilde{h}_{t}(x)=1$ whenever $h_{t}(x)=1$ for $x\in\tilde{\Omega}.$ The sampling random vector ${\mathbf{Y}}_{t}$ is induced by ${\mathbf{X}}_{t}.$ The subsequent theorem presents the upper bound of query complexity for Algorithm 4.

Theorem 3.

For Algorithm 4 with region shrinking rate $0<\gamma<1$ and region shrinking frequency $0<\rho<1.$ Let $\epsilon>0$ and $0<\delta<1.$ When $\Omega_{\epsilon}$ is $\eta$ -shattered by $\tilde{h}_{t}$ for all $t=r+1\ldots,T,$ the $(\epsilon,\delta)$ -query complexity is upper bounded by

	$\displaystyle\mathcal{O}\bigg{(}$	$\displaystyle\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta$
		$\displaystyle+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{\|\Omega_{\epsilon}\|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.$

The condition $\gamma\in(0,1)$ ensures that the term $2p/(\gamma^{-\rho}+\gamma^{-(T-r)\rho})$ is significantly less than 1. According to Theorem 3, the $(\epsilon,\delta)$ -query complexity of the Algorithm 4 is lower than that of Algorithm 3 providing $\eta>0$ .

Theorem 3 establishes an upper bound on the $(\epsilon,\delta)$ -query complexity that is applicable to a wide range of scenarios, assuming only that the objective function $f$ is lower-bounded. Building on this, we identify a sufficient condition for acceleration that applies to dimensionally local Holder continuity functions (Definition 5), detailed in the appendix. Within this context, the SRACOS algorithm, which utilizes RACOS for $Training$ phase in Algorithm 3, exhibit polynomial convergence (Hu, Qian, and Yu 2017). We adopt the same RACOS approach for $Training$ sub-procedure in Algorithm 4, introducing ”RAndomized CoordinatE Classifying And Region Shrinking” (RACE-CARS) algorithm.

Definition 5 (Dimensionally local Holder continuity).

Assume that $x_{*}=(x^{1}_{*},\ldots,x^{n}_{*})$ is the unique global minimum such that $f(x_{*})=f^{*}.$ We call $f$ dimensionally local Holder continuity if for all $i=1,\ldots,n,$

L^{i}_{1}|x^{i}-x^{i}_{*}|^{\beta^{i}_{1}}\leq|f(x^{1}_{*},\ldots,x^{i},\ldots,x^{n}_{*})-f^{*}|\leq L^{i}_{2}|x^{i}-x^{i}_{*}|^{\beta^{i}_{2}},

for all $x=(x^{1},\ldots,x^{n})$ in the neighborhood of $X_{*},$ where $\beta^{i}_{1},$ $\beta^{i}_{2},$ $L^{i}_{1},$ $L^{i}_{2}$ are positive constants for $i=1,\ldots,n.$

Experiments

In this section, we design experiments to test RACE-CARS on synthetic functions, and language model tasks respectively. We use same budget to compare RACE-CARS with a selection of DFO algorithms, including SRACOS (Hu, Qian, and Yu 2017), zeroth-order adaptive momentum method (ZO-Adam) (Chen et al. 2019), differential evolution (DE) (Opara and Arabas 2019) and covariance matrix adaptation evolution strategies (CMA-ES) (Hansen 2016). All the baseline algorithms are fine-tuned, and the essential hyperparameters of RACE-CARS can be found in Appendix.

On Synthetic Functions

We commence our empirical experiments with four benchmark functions: Ackley, Levy, Rastrigin and Sphere. Their analytic forms and 2-dimensional illustrations are detailed in Appendix. Characterized by extreme non-convexity and numerous local minima and saddle points—with the exception of the Sphere function—each is minimized within the boundary $\Omega=[-10,10]^{n}$ , with a global minimum value of $0$ . We choose the dimension of solution space $n$ to be $50$ and $500,$ with corresponding function evaluation budgets of $5000$ and $50000$ . Notably, as indicates, the convergence of the RACE-CARS requires only a fraction of this budget.

Region shrinking rate is configured to be $\gamma=0.9$ and $0.95$ , with shrinking frequency of $\rho=0.01$ and $0.001$ for $n=50,500$ , respectively. Each algorithm is executed over 30 trials, and the mean convergence trajectories of the best-so-far values are depicted in Figure 1 and Figure 2. The numerical values adjacent to the algorithm names in the legends represent the mean of the attained minima. It is evident that that RACE-CARS performs the best on both convergence speed and optimal value, with a slight edge to CMA-ES on the strongly convex Sphere function. Yet this comes at the cost of scalability due to CMA-ES’s reliance on an $n$ -dimensional covariance matrix, which is significantly more computationally intensive compared to the other algorithms.

On Black-Box Tuning for LMaaS

Prompt tuning for extremely large pre-trained language models (PTMs) has shown great power. PTMs such as GPT-3 (Brown et al. 2020) are usually released as a service due to commercial considerations and the potential risk of misuse, allowing users to design individual prompts to query the PTMs through black-box APIs. This scenario is called Language-Model-as-a-Service (LMaaS) (Diao et al. 2022; Sun et al. 2022). In this part we follow the experiments designed by (Sun et al. 2022) ¹¹1Code can be found in https://github.com/txsun1997/Black-Box-Tuning, where language understanding task is formulated as a classification task predicting for a batch of PTM-modified input texts $X$ the labels $Y$ in the PTM vocabulary, namely we need to tune the prompt such that the black-box PTM inference API $f$ takes a continuous prompt ${\mathbf{p}}$ satisfying $Y=f({\mathbf{p}};X).$ Moreover, to handle the high-dimensional prompt ${\mathbf{p}},$ (Sun et al. 2022) proposed to randomly embed the $D$ -dimensional prompt ${\mathbf{p}}$ into a lower dimensional space $\mathbb{R}^{d}$ via random projection matrix ${\mathbf{A}}\in\mathbb{R}^{D\times d}.$ Therefore, the objective becomes:

\min_{z\in\mathcal{Z}}\mathcal{L}\bigl{(}f({\mathbf{A}}z+{\mathbf{p}}_{0};X),Y\bigr{)},

where $\mathcal{Z}=[-50,50]^{d}$ is the search space and $\mathcal{L}(\cdot)$ is cross entropy loss.

In our experimental setup, we configure the search space dimension to $d=500$ and the prompt length to $50$ , with RoBERTa (Liu et al. 2019a) serving as the backbone model. We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG’s News (Zhang, Zhao, and LeCun 2015), and RTE (Wang et al. 2018a). With a fixed API call budget of $T=8000,$ we pit RACE-CARS against SRACOS and the default DFO algorithm CMA-ES utilized in (Sun et al. 2022) ²²2 Our choice to exclude ZO-Adam and DE is based on their suboptimal performance with high-dimensional nonconvex black-box functions, as demonstrated in the last section..

For our tests, the shrinking rate is $\gamma=0.7$ , with shrinking frequency of $\rho=0.002.$ Each algorithm is repeated 5 times independently with unique seeds. We assess the algorithms based on the mean and deviation of training loss, training accuracy, development loss and development accuracy. The SST-2 dataset results are highlighted in Figure 3, with additional findings for Yelp Polarity, AG’s News, and RTE detailed in the appendix. The results indicate that RACE-CARS consistently accelerates the convergence of SRACOS. While CMA-ES shows superior performance on Yelp Polarity, AG’s News, and RTE, it also exhibits signs of overfitting. RACE-CARS achieves comparable performance to CMA-ES, despite the latter’s hyperparameters being finely tuned. Notably, the hyperparameters for RACE-CARS were empirically adjusted based on the SST-2 dataset and then applied to the other three datasets without further tuning.

Discussion

Beyond continuity

(i)

For discontinuous objective functions.

Dimensionally local Holder continuity in Definition 5, imposes certain constraints on the objective through a set of continuous envelopes, whereas the objective is not supposed to be continuous. Beyond the continuous cases discussed in the previous section, RACE-CARS remains applicable to discontinuous objectives as well. Refer to appendix for a comprehensive understanding.
(ii)

For discrete optimization.

Similar to SRACOS algorithm, RACE-CARS retains the capability to tackle discrete optimization problems. The convergence theorems, presented in Theorem 2 and Theorem 3, encompass this situation by altering the measure of probability space to be for example, induced by counting measure. We extend experiments on mixed-integer programming problems, substantiating the acceleration of RACE-CARS empirically. See appendix for details.

On the Concept Hypothesis-Target Shattering

The concept hypothesis-target shattering, central to our discussion, draws inspiration from the established learning theory notion of shatter and its deep ties to the Vapnik-Chervonenkis (VC) theory (Vapnik et al. 1998). At the heart of VC theory lies the VC dimension, a measure of a hypothesis family’s capacity to distinguish among data points based on their labels. Specifically, for a collection of data points with binary labels, $S$ , we say a subset $S^{\prime}\subseteq S$ is shattered by a hypothesis family $\mathcal{H}$ if there exists a hypothesis $h\in\mathcal{H}$ that perfectly aligns with the labels of points in $S^{\prime}$ and contrasts with those outside:

h(x)=\begin{cases}1,&x\in S^{\prime}\\ 0,&x\notin S^{\prime}.\end{cases}

The shattering coefficient, $\mathcal{S}(\mathcal{H},n)$ , signifies the variety of point-label combinations that $\mathcal{H}$ can produce for $n$ points. The VC dimension is then defined as $VC(\mathcal{H}):=\{\sup{n:\mathcal{S}(\mathcal{H},n)=2^{n}}\}$ , indicating the maximum number of points that can be distinctly labeled by $\mathcal{H}$ .

In the context of classification-based DFO, we represent the target-representative capability of a family of hypotheses through hypothesis-target shattering, measuring the overlap of hypothesis’s active region and the target. Therefore the quintessence of algorithm design hinges on maximizing this quantity. discerning the target-representative capacity within the intricate landscape of nonconvex, black-box optimization problems is nontrivial. Nonetheless, the target-representative capability of hypotheses family generated by RACOS, although proved under the previous framework, empirically suggests sufficient efficacy in scenarios where the objective function exhibits locally Holder continuouity. Looking ahead, altering the $Training$ and $Replacing$ sub-procedures inherited from SRACOS, which may ideally lead to a bigger shattering rate and maintain the easy-to-sample characterization, will be another extension direction of the current study.

Ablation Experiments

While it’s an appealing goal to develop a universally effective DFO algorithm for black-box functions without the need for hyperparameter tuning, this remains an unrealistic aspiration. Our proposed algorithm, RACE-CARS, introduces two hyperparameters: shrinking-rate $\gamma$ and shrinking-frequency $\rho.$ For an $n$ -dimensional optimization problem, we call $\gamma^{n\rho}$ shrinking factor of RACE-CARS. We take Ackley for a case study, design ablation experiments on the two hyperparameters of RACE-CARS to reveal the mechanism. We stipulate that we do not aim to identify a optimal combination of hyperparameters for maximizing the overlap with the target hypothesis. Instead, our aim is to provide empirical guidance for tuning these hyperparameters effectively. For further details, the reader is directed to the appendix.

Conclusion

In this paper, we refine the framework of classification-based DFO as a stochastic process, and propose a novel learning concept named hypothesis-target shattering rate. Our research delves into the convergence properties of sequential-mode classification-based DFO algorithms and provides a fresh perspective on their query complexity upper bound. Delighted by the computational complexity upper bound under the new framework, we propose a theoretically grounded region-shrinking technique to accelerate the convergence. In empirical analysis, we study the scalability performance of RACE-CARS on both synthetic functions and black-box tuning for LMaaS, showing its superiority over SRACOS.

References

Akay and Karaboga (2012) Akay, B.; and Karaboga, D. 2012. Artificial bee colony algorithm for large-scale problems and engineering design optimization. Journal of intelligent manufacturing, 23: 1001–1014.
Akiba et al. (2019) Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623–2631.
Bäck and Schwefel (1993) Bäck, T.; and Schwefel, H.-P. 1993. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1): 1–23.
Bickel and Levina (2004) Bickel, P. J.; and Levina, E. 2004. Some theory for Fisher’s linear discriminant function,naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6): 989–1010.
Bogunovic et al. (2016) Bogunovic, I.; Scarlett, J.; Krause, A.; and Cevher, V. 2016. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. Advances in neural information processing systems, 29.
Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
Céa (1971) Céa, J. 1971. Optimisation: théorie et algorithmes. Jean Céa. Dunod.
Chen et al. (2019) Chen, X.; Liu, S.; Xu, K.; Li, X.; Lin, X.; Hong, M.; and Cox, D. 2019. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32.
Conn, Scheinberg, and Vicente (2009) Conn, A. R.; Scheinberg, K.; and Vicente, L. N. 2009. Introduction to derivative-free optimization. SIAM.
Diao et al. (2022) Diao, S.; Huang, Z.; Xu, R.; Li, X.; Lin, Y.; Zhou, X.; and Zhang, T. 2022. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531.
El-Yaniv and Wiener (2012) El-Yaniv, R.; and Wiener, Y. 2012. Active Learning via Perfect Selective Classification. Journal of Machine Learning Research, 13(2).
Eriksson et al. (2019) Eriksson, D.; Pearce, M.; Gardner, J.; Turner, R. D.; and Poloczek, M. 2019. Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32.
Falkner, Klein, and Hutter (2018) Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In International conference on machine learning, 1437–1446. PMLR.
Fan and Fan (2008) Fan, J.; and Fan, Y. 2008. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6): 2605.
Fortin et al. (2012) Fortin, F.-A.; De Rainville, F.-M.; Gardner, M.-A. G.; Parizeau, M.; and Gagné, C. 2012. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research, 13(1): 2171–2175.
Frazier (2018) Frazier, P. I. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
Ge et al. (2022) Ge, D.; Liu, T.; Liu, J.; Tan, J.; and Ye, Y. 2022. SOLNP+: A Derivative-Free Solver for Constrained Nonlinear Optimization. arXiv preprint arXiv:2210.07160.
Gotovos (2013) Gotovos, A. 2013. Active learning for level set estimation. Master’s thesis, Eidgenössische Technische Hochschule Zürich, Department of Computer Science,.
Hall, Marron, and Neeman (2005) Hall, P.; Marron, J. S.; and Neeman, A. 2005. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(3): 427–444.
Hansen (2016) Hansen, N. 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772.
Hashimoto, Yadlowsky, and Duchi (2018) Hashimoto, T.; Yadlowsky, S.; and Duchi, J. 2018. Derivative free optimization via repeated classification. In International Conference on Artificial Intelligence and Statistics, 2027–2036. PMLR.
Hensman, Fusi, and Lawrence (2013) Hensman, J.; Fusi, N.; and Lawrence, N. D. 2013. Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
Hu, Qian, and Yu (2017) Hu, Y.-Q.; Qian, H.; and Yu, Y. 2017. Sequential classification-based optimization for direct policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
Larson, Menickelly, and Wild (2019) Larson, J.; Menickelly, M.; and Wild, S. M. 2019. Derivative-free optimization methods. Acta Numerica, 28: 287–404.
Liao (2010) Liao, T. W. 2010. Two hybrid differential evolution algorithms for engineering design optimization. Applied Soft Computing, 10(4): 1188–1199.
Liu et al. (2019a) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019a. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu et al. (2017) Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; Qian, C.; and Yu, Y. 2017. Zoopt: Toolbox for derivative-free optimization. arXiv preprint arXiv:1801.00329.
Liu et al. (2019b) Liu, Y.-R.; Hu, Y.-Q.; Qian, H.; and Yu, Y. 2019b. Asynchronous classification-based optimization. In Proceedings of the First International Conference on Distributed Artificial Intelligence, 1–8.
Michalski (2000) Michalski, R. S. 2000. Learnable evolution model: Evolutionary processes guided by machine learning. Machine learning, 38: 9–40.
Nayebi, Munteanu, and Poloczek (2019) Nayebi, A.; Munteanu, A.; and Poloczek, M. 2019. A framework for Bayesian optimization in embedded subspaces. In International Conference on Machine Learning, 4752–4761. PMLR.
Nelder and Mead (1965) Nelder, J. A.; and Mead, R. 1965. A simplex method for function minimization. The computer journal, 7(4): 308–313.
Nesterov and Spokoiny (2017) Nesterov, Y.; and Spokoiny, V. 2017. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17: 527–566.
Opara and Arabas (2019) Opara, K. R.; and Arabas, J. 2019. Differential Evolution: A survey of theoretical analyses. Swarm and evolutionary computation, 44: 546–558.
Ragonneau and Zhang (2023) Ragonneau, T. M.; and Zhang, Z. 2023. PDFO–A Cross-Platform Package for Powell’s Derivative-Free Optimization Solver. arXiv preprint arXiv:2302.13246.
Ray and Saini (2001) Ray, T.; and Saini, P. 2001. Engineering design optimization using a swarm with an intelligent information sharing among individuals. Engineering Optimization, 33(6): 735–748.
Ros and Hansen (2008) Ros, R.; and Hansen, N. 2008. A simple modification in CMA-ES achieving linear time and space complexity. In International conference on parallel problem solving from nature, 296–305. Springer.
Scheinberg (2022) Scheinberg, K. 2022. Finite Difference Gradient Approximation: To Randomize or Not? INFORMS Journal on Computing, 34(5): 2384–2388.
Shahriari et al. (2015) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148–175.
Shi et al. (2021) Shi, H.-J. M.; Xuan, M. Q.; Oztoprak, F.; and Nocedal, J. 2021. On the numerical performance of derivative-free optimization methods based on finite-difference approximations. arXiv preprint arXiv:2102.09762.
Snoek, Larochelle, and Adams (2012) Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25.
Sobol’ (1967) Sobol’, I. M. 1967. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 7(4): 784–802.
Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631–1642.
Sun et al. (2022) Sun, T.; Shao, Y.; Qian, H.; Huang, X.; and Qiu, X. 2022. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, 20841–20855. PMLR.
Vapnik et al. (1998) Vapnik, V.; et al. 1998. Statistical learning theory.
Wang et al. (2018a) Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. R. 2018a. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Wang, Fonseca, and Tian (2020) Wang, L.; Fonseca, R.; and Tian, Y. 2020. Learning search space partition for black-box optimization using monte carlo tree search. Advances in Neural Information Processing Systems, 33: 19511–19522.
Wang et al. (2018b) Wang, Z.; Gehring, C.; Kohli, P.; and Jegelka, S. 2018b. Batched large-scale Bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, 745–754. PMLR.
Wang et al. (2016) Wang, Z.; Hutter, F.; Zoghi, M.; Matheson, D.; and De Feitas, N. 2016. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55: 361–387.
Yang and Shami (2020) Yang, L.; and Shami, A. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415: 295–316.
Yi et al. (2024) Yi, Z.; Wei, Y.; Cheng, C. X.; He, K.; and Sui, Y. 2024. Improving sample efficiency of high dimensional Bayesian optimization with MCMC. arXiv preprint arXiv:2401.02650.
Yu (1979) Yu, W.-C. 1979. Positive basis and a class of direct search techniques. Scientia Sinica, Special Issue of Mathematics, 1(26): 53–68.
Yu and Qian (2014) Yu, Y.; and Qian, H. 2014. The sampling-and-learning framework: A statistical view of evolutionary algorithms. In 2014 IEEE Congress on Evolutionary Computation (CEC), 149–158. IEEE.
Yu, Qian, and Hu (2016) Yu, Y.; Qian, H.; and Hu, Y.-Q. 2016. Derivative-free optimization via classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
Yue et al. (2023) Yue, P.; Yang, L.; Fang, C.; and Lin, Z. 2023. Zeroth-order Optimization with Weak Dimension Dependency. In The Thirty Sixth Annual Conference on Learning Theory, 4429–4472. PMLR.
Zhang, Zhao, and LeCun (2015) Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

Appendix A Appendix

Synthetic functions

•

Ackley:

	$\displaystyle f(x)=$	$\displaystyle-20\exp(-0.2\sqrt{\sum_{i=1}^{n}(x_{i}-0.2)^{2}/n})$
		$\displaystyle-\exp(\sum_{i=1}^{n}\cos 2\pi x_{i}/n)+e+20.$

•

Levy:

	$\displaystyle f(x)=$	$\displaystyle\sin^{2}(\pi\omega_{1})+\sum_{i=1}^{n-1}(\omega_{i}-1)^{2}\bigl{(}1+10\sin^{2}(\pi\omega_{i}+1)\bigr{)}$
		$\displaystyle+(\omega_{n}-1)^{2}\bigl{(}1+\sin^{2}(2\pi\omega_{n})\bigr{)}$

where $\omega_{i}=1+\frac{x_{i}-1}{4}.$

•

Rastrigin:

$f(x)=10n+\sum_{i=1}^{n}\bigl{(}x_{i}^{2}-10\cos(2\pi x_{i})\bigr{)}.$
•

Sphere:

$f(x)=\sum_{i=1}^{n}(x_{i}-0.2)^{2}.$

Beyond continuity

For discontinuous objective functions

We design experiments on discontinuous objective functions by adding random perturbation to synthetic functions. For example, the perturbation is set to be:

P(x)=\sum_{i=1}^{m}\epsilon_{i}\cdot\delta_{\mathcal{B}(x_{i},0.5)}(x).

$\mathcal{B}(x_{i},0.5)$ is the open ball centering at $x_{i}$ with radius equals to 0.5, with $x_{i},$ $i=1,\ldots,m,$ randomly generated within the solution region. $\delta_{\mathcal{B}(x_{i},0.5)}(x)$ is an indicator function, equaling to 1 when $x\in\mathcal{B}(x_{i},0.5)$ otherwise 0. The perturbations $\epsilon_{i}$ are uniformly sampled from $[0,1]$ for every single ball center $x_{i},$ $i=1,\ldots,m.$ The objective function is set to be

\tilde{f}(x):=f(x)+P(x),

which is lower semi-continuous. We use the same settings as in the body sections with dimension $n=50,$ the perturbation size $m=5n$ and budget $T=100n.$ Similarly, region shrinking rate is set to be $\gamma=0.9$ and region shrinking frequency is $\rho=0.01.$ Each of the algorithm is repeated 30 runs and the mean convergence trajectories of the best-so-far values are presented in Figure 5. The numbers attached to the algorithm names in the legend of figures are the mean value of obtained minimum. It can be observed that the acceleration of RACE-CARS to SRACOS is still valid. Comparing with baselines, RACE-CARS performs the best on both convergence, and obtain the best optimal value. As we anticipated, the performance of SRACOS and RACE-CARS are almost impervious to discontinuity, whereas the other three baselines, whose convergence relies on the continuity, suffers from oscillation or early-stopping to different extent.

For discrete optimization

In order to transfer RACE-CARS to discrete optimization, $Training$ and $Projection$ sub-procedures in Algorithm 4 should be modified. In all cases, we employ the discrete version of RACOS for $Training$ (Yu, Qian, and Hu 2016). Furthermore, we presume counting measure $\#$ as the inducing measure of probability space $(\Omega,\mathcal{F},\mathbb{P}),$ where $\mathbb{P}(B):=\#(B)/\#(\Omega)$ for all $B\in\mathcal{F}.$ The $Projection$ is therefore similar only to set the operator $\|\cdot\|$ return the count of each dimension of the region.

We design experiments on the following formulation:

$\displaystyle\min~{}$	$\displaystyle f(x,y)$	(4)
$\displaystyle s.t.~{}$	$\displaystyle x\in\Omega_{c}$
	$\displaystyle y\in\Omega_{d},$

where $\Omega_{c}$ is the continuous solution subspace and $\Omega_{d}$ is discrete. Equation (4) encompasses a wide range of continuous, discrete and mixed-integer programming problems. In our experiments, we specify equation (4) as a mixed-integer programming problem:

	$\displaystyle\min~{}$	$\displaystyle Ackley(x)+L^{T}abs(y)$
	$\displaystyle s.t.~{}$	$\displaystyle x\in[-1,1]^{n_{1}}$
		$\displaystyle y\in\{-10,-9,\ldots,9,10\}^{n_{2}},$

where $L\in\mathbb{R}^{n_{2}}$ is sampled uniformly from $[1,2]^{n_{2}},$ thus the global optimal value is 0. We choose the dimension of solution space as $n_{1}=n_{2}=50$ and $250,$ the budget of function evaluation is set to be $3000$ and $10000$ respectively. Region shrinking rate is set to be $\gamma=0.95$ and region shrinking frequency is $\rho=0.01,$ $0.005$ respectively. Each of the algorithm is repeated 30 runs and the convergence trajectories of mean of the best-so-far value are presented in Figure 6. As results show, RACE-CARS maintains acceleration to SRACOS in discrete situation.

Ablation experiments

(i)

Relationship between shrinking frequency $\mathbf{\rho}$ and dimension $\mathbf{n}$ .

For Ackley on $\Omega=[-10,10]^{n},$ we fix shrinking rate $\rho=0.95$ and compare the performance of RACE-CARS between different shrinking frequency $\rho$ and dimension $n.$ The shrinking frequencies $\rho$ ranges from $0.002$ to $0.2$ and dimension $n$ ranges from $50$ to $500.$ The function calls budget is set to be $T=30n$ for fair. Experiments are repeated 5 times for each hyperparameter and results are recorded in Table 1 and the normalized results is the heatmap figure in the right. The black curve represents the trajectory of best shrinking frequency with respect to dimension. The horizontal axis is dimension and the vertical axis is shrinking frequency. Results indicate the best $\rho$ is in reverse proportion to $n,$ therefore maintaining $n\rho$ as constant is preferred.
(ii)

Relationship between shrinking factor $\mathbf{\gamma^{n\rho}}$ and dimension $\mathbf{n}$ of solution space.

For Ackley on $\Omega=[-r,r]^{n},$ we compare the performance of RACE-CARS between different shrinking factors and radius $r.$ Different shrinking factors are generated by varying shrinking rate $\gamma$ and dimension times shrinking frequency $n\rho.$ We design experiments on 4 different dimensions $n$ with 4 radii $r.$ The function calls budget is set to be $T=30n.$ Experiments are repeated 5 times for each hyperparameter and results are presented in heatmap format in Figure 7. According to the results, the best shrinking factor is insensitive to the variation of dimension. Considering that the best $n\rho$ maintains constant as $n$ varying, slightly variation of the corresponding best $\gamma$ is preferred. This observation is in line with what we anticipated as in section Experiments.

(a) Radius $r=1$

(b) Radius $r=5$

(c) Radius $r=10$

(d) Radius $r=25$

Figure 7: Comparison of shrinking factor $\mathbf{\gamma^{n\rho}}$ and dimension $\mathbf{n}$ of the solution space $\mathbf{\Omega=[-r,r]^{n}}$ . Results of different solution space radius are presented in each subfigure respectively. In each subfigure, the horizontal axis is the dimension and the vertical axis is shrinking factor. Each pixel represents the heat of y-wise normalized mean function value at the $30n$ step. The black curve is the best shrinking factor in each dimension.
(iii)

Relationship between shrinking factor $\mathbf{\gamma^{n\rho}}$ and radius $\mathbf{r}$ of solution space.

For Ackley on $\Omega=[-r,r]^{n},$ we compare the performance of RACE-CARS between different shrinking factors and radius $r.$ Different shrinking factors are generated by varying shrinking rate $\gamma$ and dimension times shrinking frequency $n\rho.$ We design experiments on 4 different radii $r$ with 4 dimensions $n.$ The function calls budget is set to be $T=30n.$ Experiments are repeated 5 times for each hyperparameter and results are presented in heatmap format in Figure 8. According to the results, the best shrinking factor $\gamma^{n\rho}$ should be decreased as radius $r$ increases.

(a) Dimension $n=50$

(b) Dimension $n=100$

(c) Dimension $n=250$

(d) Dimension $n=500$

Figure 8: Comparison of shrinking factor $\mathbf{\gamma^{n\rho}}$ and radius $\mathbf{r}$ of the solution space $\mathbf{\Omega=[-r,r]^{n}}$ . Results of different dimension are presented in each subfigure respectively. In each subfigure, the horizontal axis is the radius of solution space and the vertical axis is shrinking factor. Each pixel represents the heat of y-wise normalized mean function value at the $30n$ step. The black curve is the best shrinking factor of each solution space radius.

Table 1: Comparison of shrinking frequencies

\mathbf{\rho}

for Ackley on

\mathbf{\Omega=[-10,10]^{n}}

with shrinking rate

\mathbf{\gamma=0.95}

. Mean and standard deviation of function value at the

30n

step are listed in the table. The first row of the table with

\rho=0

is the results of RACOS for reference. We omit the results of

\rho

bigger than 0.1 for concision. The bold fonts are relative better results in each dimension.

	50	100	150	200	250	300	350	400	450	500
0	$3.8\pm 0.2$	$3.9\pm 0.2$	$5.9\pm 0.1$	$5.7\pm 0.1$	$5.8\pm 0.2$	$5.8\pm 0.1$	$5.9\pm 0.0$	$5.8\pm 0.0$	$5.9\pm 0.1$	$5.8\pm 0.1$
0.002	$3.7\pm 0.2$	$3.5\pm 0.2$	$4.3\pm 0.3$	$4.4\pm 0.3$	$4.0\pm 0.6$	$3.9\pm 0.5$	$3.7\pm 0.4$	$3.3\pm 0.4$	$3.3\pm 0.3$	$2.6\pm 0.4$
0.004	$3.4\pm 0.3$	$3.2\pm 0.1$	$3.8\pm 0.6$	$3.7\pm 0.2$	$2.8\pm 0.5$	$2.1\pm 0.5$	$\mathbf{1.9\pm 0.2}$	$\mathbf{1.9\pm 0.2}$	$\mathbf{1.8\pm 0.6}$	$\mathbf{1.7\pm 0.4}$
0.006	$3.3\pm 0.3$	$2.9\pm 0.2$	$2.9\pm 0.4$	$2.0\pm 0.1$	$2.1\pm 0.4$	$\mathbf{1.8\pm 0.2}$	$\mathbf{1.9\pm 0.2}$	$2.4\pm 0.3$	$2.7\pm 0.3$	$3.1\pm 1.5$
0.008	$3.0\pm 0.3$	$2.4\pm 0.4$	$2.3\pm 0.5$	$\mathbf{1.7\pm 0.2}$	$\mathbf{1.8\pm 0.4}$	$2.3\pm 0.7$	$2.5\pm 0.4$	$3.9\pm 0.8$	$4.5\pm 0.9$	$5.5\pm 1.0$
0.010	$2.8\pm 0.4$	$\mathbf{1.8\pm 0.5}$	$\mathbf{1.9\pm 0.4}$	$2.1\pm 0.2$	$2.3\pm 0.6$	$3.6\pm 0.8$	$4.1\pm 0.6$	$4.9\pm 0.9$	$7.5\pm 0.7$	$6.3\pm 0.8$
0.012	$2.5\pm 0.1$	$\mathbf{1.4\pm 0.2}$	$\mathbf{1.6\pm 0.4}$	$\mathbf{1.8\pm 0.2}$	$3.1\pm 0.7$	$4.4\pm 1.1$	$5.4\pm 0.5$	$5.6\pm 1.4$	$6.6\pm 0.8$	$7.5\pm 0.7$
0.014	$2.6\pm 0.3$	$\mathbf{1.5\pm 0.3}$	$2.5\pm 0.5$	$2.2\pm 0.5$	$3.9\pm 0.6$	$5.0\pm 0.8$	$7.1\pm 1.2$	$6.3\pm 0.6$	$8.3\pm 0.7$	$8.3\pm 0.5$
0.016	$2.6\pm 0.4$	$\mathbf{1.3\pm 0.4}$	$2.0\pm 0.3$	$3.4\pm 1.3$	$4.4\pm 0.9$	$6.1\pm 1.0$	$7.0\pm 1.1$	$6.9\pm 0.8$	$8.0\pm 0.6$	$8.9\pm 0.8$
0.018	$2.3\pm 0.2$	$\mathbf{1.3\pm 0.4}$	$2.8\pm 0.8$	$4.1\pm 0.6$	$5.1\pm 0.9$	$6.9\pm 1.0$	$7.1\pm 1.3$	$8.3\pm 0.5$	$9.1\pm 0.9$	$9.3\pm 0.4$
0.020	$2.0\pm 0.6$	$2.0\pm 0.5$	$3.2\pm 1.2$	$4.4\pm 0.8$	$6.3\pm 1.3$	$7.2\pm 1.1$	$7.4\pm 0.6$	$9.1\pm 0.8$	$10.2\pm 0.9$	$10.0\pm 0.6$
0.022	$\mathbf{1.7\pm 0.3}$	$2.0\pm 1.0$	$3.9\pm 1.3$	$5.3\pm 1.0$	$6.8\pm 1.0$	$7.5\pm 1.1$	$8.8\pm 0.6$	$9.1\pm 0.8$	$10.7\pm 0.5$	$10.0\pm 0.3$
0.024	$\mathbf{1.9\pm 0.2}$	$3.3\pm 0.9$	$4.3\pm 1.1$	$6.0\pm 0.9$	$7.0\pm 0.3$	$8.5\pm 0.7$	$9.3\pm 0.4$	$10.1\pm 0.7$	$10.7\pm 0.5$	$10.8\pm 0.6$
0.026	$\mathbf{1.5\pm 0.4}$	$2.7\pm 1.2$	$4.3\pm 0.6$	$7.0\pm 0.7$	$8.3\pm 0.4$	$9.0\pm 0.4$	$9.8\pm 0.7$	$10.1\pm 0.7$	$10.7\pm 0.6$	$11.6\pm 0.4$
0.028	$\mathbf{1.3\pm 0.2}$	$3.8\pm 0.6$	$4.9\pm 0.7$	$7.2\pm 0.7$	$8.8\pm 1.0$	$8.8\pm 0.9$	$9.5\pm 0.5$	$10.7\pm 0.4$	$10.9\pm 0.2$	$11.3\pm 0.4$
0.030	$\mathbf{1.3\pm 0.4}$	$4.0\pm 0.4$	$5.0\pm 0.4$	$7.1\pm 0.9$	$8.2\pm 1.0$	$9.2\pm 0.6$	$10.3\pm 0.5$	$10.6\pm 1.0$	$10.9\pm 0.4$	$11.8\pm 0.4$
0.032	$\mathbf{1.5\pm 0.5}$	$6.1\pm 1.2$	$6.2\pm 1.1$	$7.3\pm 0.6$	$9.1\pm 0.8$	$9.9\pm 0.5$	$10.6\pm 0.4$	$10.5\pm 0.8$	$11.2\pm 0.6$	$12.0\pm 0.3$
0.034	$\mathbf{1.9\pm 0.6}$	$5.1\pm 0.8$	$5.9\pm 0.7$	$8.2\pm 0.7$	$9.0\pm 0.3$	$10.4\pm 0.4$	$10.6\pm 0.6$	$11.2\pm 0.4$	$11.9\pm 0.5$	$12.1\pm 0.4$
0.036	$\mathbf{1.5\pm 0.2}$	$6.4\pm 0.5$	$6.8\pm 0.6$	$7.9\pm 1.0$	$9.3\pm 1.0$	$10.4\pm 0.4$	$10.9\pm 0.3$	$11.3\pm 0.5$	$11.9\pm 0.3$	$12.0\pm 0.3$
0.038	$2.2\pm 1.6$	$5.1\pm 0.8$	$6.1\pm 0.6$	$8.2\pm 1.0$	$9.6\pm 0.8$	$10.8\pm 0.5$	$10.9\pm 0.5$	$11.6\pm 0.3$	$11.8\pm 0.4$	$12.3\pm 0.5$
0.040	$2.2\pm 1.1$	$7.3\pm 1.6$	$7.6\pm 0.4$	$8.7\pm 0.4$	$9.3\pm 0.7$	$10.4\pm 0.5$	$11.4\pm 0.4$	$11.7\pm 0.6$	$12.3\pm 0.2$	$12.2\pm 0.3$
0.042	$2.2\pm 0.8$	$6.4\pm 1.2$	$7.6\pm 0.8$	$9.3\pm 0.8$	$10.0\pm 0.8$	$10.4\pm 0.5$	$11.6\pm 0.3$	$12.3\pm 0.4$	$12.0\pm 0.5$	$12.5\pm 0.5$
0.044	$\mathbf{1.8\pm 0.6}$	$6.8\pm 1.0$	$6.9\pm 0.9$	$9.6\pm 0.5$	$10.3\pm 0.4$	$11.0\pm 1.0$	$11.1\pm 0.3$	$12.0\pm 0.3$	$12.4\pm 0.4$	$12.6\pm 0.3$
0.046	$2.1\pm 0.3$	$7.4\pm 0.7$	$8.1\pm 0.8$	$9.4\pm 0.3$	$10.4\pm 0.9$	$11.3\pm 0.4$	$11.6\pm 0.5$	$12.1\pm 0.4$	$12.4\pm 0.2$	$12.7\pm 0.2$
0.048	$3.2\pm 1.1$	$6.6\pm 1.0$	$7.5\pm 1.3$	$9.9\pm 0.4$	$10.1\pm 0.4$	$11.3\pm 0.2$	$12.2\pm 0.4$	$12.0\pm 0.5$	$12.3\pm 0.3$	$12.5\pm 0.3$
0.050	$3.5\pm 1.0$	$7.7\pm 0.8$	$8.6\pm 0.3$	$10.0\pm 0.3$	$10.6\pm 0.6$	$11.1\pm 0.5$	$12.4\pm 0.2$	$12.5\pm 0.4$	$12.6\pm 0.3$	$12.8\pm 0.2$
0.052	$3.6\pm 0.7$	$8.8\pm 0.9$	$8.1\pm 0.6$	$10.0\pm 0.8$	$11.0\pm 0.4$	$11.6\pm 0.7$	$12.7\pm 0.2$	$12.5\pm 0.1$	$12.8\pm 0.4$	$13.3\pm 0.1$
0.054	$3.8\pm 1.5$	$7.3\pm 0.9$	$8.5\pm 0.3$	$10.1\pm 0.9$	$11.3\pm 0.3$	$11.7\pm 0.2$	$12.7\pm 0.3$	$12.9\pm 0.2$	$12.8\pm 0.2$	$13.0\pm 0.3$
0.056	$3.8\pm 1.3$	$8.8\pm 1.0$	$9.1\pm 0.8$	$10.5\pm 0.7$	$11.1\pm 0.7$	$11.8\pm 0.4$	$12.4\pm 0.2$	$12.6\pm 0.4$	$13.1\pm 0.1$	$13.0\pm 0.2$
0.058	$4.1\pm 1.0$	$9.1\pm 1.1$	$9.3\pm 1.3$	$10.7\pm 0.5$	$10.9\pm 0.2$	$11.9\pm 0.3$	$12.1\pm 0.4$	$12.7\pm 0.4$	$13.1\pm 0.2$	$13.2\pm 0.3$
0.060	$4.1\pm 1.0$	$8.8\pm 0.4$	$9.1\pm 0.9$	$10.8\pm 0.5$	$11.2\pm 0.5$	$11.8\pm 0.3$	$12.5\pm 0.2$	$12.8\pm 0.2$	$13.0\pm 0.3$	$13.2\pm 0.2$
0.062	$4.1\pm 1.7$	$9.2\pm 1.0$	$9.1\pm 0.6$	$10.9\pm 0.5$	$11.9\pm 0.5$	$12.1\pm 0.3$	$12.4\pm 0.2$	$12.9\pm 0.4$	$13.0\pm 0.3$	$13.3\pm 0.1$
0.064	$4.5\pm 1.2$	$8.6\pm 0.5$	$9.7\pm 0.4$	$11.1\pm 0.8$	$11.7\pm 0.2$	$12.3\pm 0.5$	$12.6\pm 0.3$	$13.1\pm 0.2$	$13.3\pm 0.2$	$13.4\pm 0.2$
0.066	$4.7\pm 0.3$	$9.5\pm 0.9$	$9.2\pm 0.4$	$11.0\pm 0.5$	$12.0\pm 0.3$	$12.1\pm 0.3$	$12.8\pm 0.4$	$12.9\pm 0.2$	$13.3\pm 0.2$	$13.3\pm 0.2$
0.068	$4.7\pm 0.7$	$9.2\pm 1.0$	$9.7\pm 0.7$	$11.0\pm 0.7$	$11.7\pm 0.4$	$12.8\pm 0.4$	$12.5\pm 0.6$	$13.0\pm 0.2$	$13.4\pm 0.1$	$13.4\pm 0.2$
0.070	$5.4\pm 1.5$	$9.5\pm 0.9$	$10.1\pm 0.6$	$10.8\pm 0.4$	$12.3\pm 0.3$	$12.4\pm 0.4$	$12.5\pm 0.7$	$13.1\pm 0.4$	$13.3\pm 0.2$	$13.5\pm 0.1$
0.072	$5.3\pm 1.2$	$9.1\pm 0.8$	$10.1\pm 0.4$	$11.7\pm 0.5$	$12.2\pm 0.6$	$12.5\pm 0.4$	$13.0\pm 0.4$	$13.3\pm 0.2$	$13.2\pm 0.2$	$13.7\pm 0.2$
0.074	$5.8\pm 1.0$	$10.1\pm 0.8$	$10.3\pm 0.4$	$11.4\pm 0.3$	$12.0\pm 0.4$	$12.5\pm 0.4$	$12.7\pm 0.3$	$13.0\pm 0.4$	$13.3\pm 0.3$	$13.7\pm 0.1$
0.076	$5.2\pm 1.3$	$9.8\pm 0.5$	$10.6\pm 0.5$	$11.3\pm 0.9$	$12.3\pm 0.3$	$12.7\pm 0.5$	$13.0\pm 0.2$	$13.2\pm 0.2$	$13.5\pm 0.3$	$13.7\pm 0.2$
0.078	$5.9\pm 0.5$	$10.3\pm 0.5$	$9.8\pm 0.6$	$11.8\pm 0.1$	$12.1\pm 0.1$	$12.7\pm 0.4$	$13.2\pm 0.3$	$13.3\pm 0.3$	$13.6\pm 0.1$	$13.7\pm 0.2$
0.080	$5.6\pm 0.7$	$10.1\pm 0.1$	$10.6\pm 0.4$	$11.4\pm 0.4$	$12.3\pm 0.5$	$13.0\pm 0.3$	$13.1\pm 0.1$	$13.4\pm 0.3$	$13.5\pm 0.4$	$13.7\pm 0.2$
0.082	$4.3\pm 1.3$	$10.3\pm 0.8$	$10.2\pm 0.7$	$11.8\pm 0.5$	$12.5\pm 0.4$	$12.8\pm 0.3$	$13.1\pm 0.3$	$13.4\pm 0.2$	$13.5\pm 0.3$	$13.8\pm 0.2$
0.084	$6.7\pm 0.9$	$10.6\pm 0.3$	$10.9\pm 0.2$	$11.8\pm 0.3$	$12.5\pm 0.3$	$13.0\pm 0.3$	$13.3\pm 0.2$	$13.4\pm 0.3$	$13.6\pm 0.2$	$13.8\pm 0.2$
0.086	$4.9\pm 0.6$	$10.2\pm 0.6$	$11.0\pm 0.4$	$11.9\pm 0.3$	$12.4\pm 0.2$	$12.6\pm 0.5$	$13.0\pm 0.3$	$13.5\pm 0.2$	$13.7\pm 0.2$	$13.9\pm 0.2$
0.088	$5.8\pm 1.0$	$10.7\pm 0.6$	$10.9\pm 0.2$	$11.7\pm 0.6$	$12.5\pm 0.2$	$13.0\pm 0.5$	$13.3\pm 0.2$	$13.6\pm 0.3$	$13.6\pm 0.1$	$13.9\pm 0.1$
0.090	$6.6\pm 1.4$	$10.2\pm 0.6$	$11.1\pm 0.4$	$12.1\pm 0.3$	$12.6\pm 0.5$	$13.0\pm 0.2$	$13.5\pm 0.1$	$13.4\pm 0.2$	$13.6\pm 0.2$	$13.8\pm 0.2$
0.092	$7.0\pm 1.0$	$10.4\pm 0.6$	$11.1\pm 0.7$	$12.2\pm 0.3$	$12.8\pm 0.3$	$13.0\pm 0.2$	$13.3\pm 0.2$	$13.5\pm 0.3$	$13.7\pm 0.2$	$13.8\pm 0.2$
0.094	$7.9\pm 0.5$	$10.2\pm 0.2$	$11.2\pm 0.6$	$12.3\pm 0.2$	$12.5\pm 0.3$	$12.8\pm 0.3$	$13.2\pm 0.2$	$13.5\pm 0.2$	$13.6\pm 0.2$	$13.9\pm 0.2$
0.096	$6.7\pm 0.5$	$10.9\pm 0.6$	$11.1\pm 0.2$	$12.2\pm 0.5$	$12.8\pm 0.2$	$13.1\pm 0.3$	$13.4\pm 0.3$	$13.4\pm 0.2$	$13.8\pm 0.3$	$13.9\pm 0.1$
0.098	$7.6\pm 0.5$	$10.7\pm 0.5$	$11.1\pm 0.2$	$12.2\pm 0.3$	$12.6\pm 0.3$	$13.0\pm 0.4$	$13.3\pm 0.2$	$13.6\pm 0.2$	$13.7\pm 0.2$	$14.0\pm 0.1$
0.100	$8.2\pm 1.0$	$10.8\pm 0.3$	$11.3\pm 0.8$	$11.9\pm 0.4$	$13.0\pm 0.2$	$13.3\pm 0.3$	$13.4\pm 0.1$	$13.6\pm 0.2$	$13.8\pm 0.1$	$13.8\pm 0.2$

Appendix B Theory Supplementary

Proofs of Theorems

Theorem 2.

\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

Proof.

Let $\tilde{x}:=\arg\min\limits_{t=1,\ldots,T}~{}f(x_{t}),$ then

		$\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-2}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\bigr{]}$
	$\displaystyle=$	$\displaystyle\cdots\cdots$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\cdots\|{\mathcal{F}}_{r}\big{]}\bigr{]}.$

Where ${\mathbb{I}}_{B}(x)$ is the identical function on $B\in{\mathcal{F}}$ such that ${\mathbb{I}}_{B}(x)\equiv 1$ for all $x\in B$ and ${\mathbb{I}}_{B}(x)\equiv 0$ otherwise. At step $t\geq r+1,$ since ${\mathbf{X}}_{\Omega}$ is independent to ${\mathcal{F}}_{t-1},$ it holds

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|).$

Under the assumption that $\Omega_{\epsilon}$ is $\eta$ -shattered by $h_{t},$ it holds the relation that

\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}|{\mathcal{F}}_{t-1}]=\frac{{\mathbb{P}}(\{x\in\Omega_{\epsilon}\colon h_{t}(x)=1\})}{{\mathbb{P}}(\{x\in\Omega\colon h_{t}(x)=1\})}\geq\frac{\eta}{p}|\Omega_{\epsilon}|.

Therefore,

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|)$
	$\displaystyle\leq$	$\displaystyle 1-\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\|\bigr{)}\Omega_{\epsilon}\|.$

Apparently, the upper bound of $\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]$ satisfies $0<1-\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}|\Omega_{\epsilon}|<1,$ thus

	$\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t+1}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t}]\|{\mathcal{F}}_{t-1}\big{]}\leq$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}\mathbb{E}[{\mathbb{I}}_{\{Y_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle\leq$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}^{2}.$

Moreover,

		$\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\cdots\|{\mathcal{F}}_{r}\big{]}\bigr{]}$
	$\displaystyle\leq$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}^{T-r}\mathbb{E}[{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}]$
	$\displaystyle=$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}^{T-r}(1-\|\Omega_{\epsilon}\|)^{r}$
	$\displaystyle\leq$	$\displaystyle\exp\left\{-\left((T-r)(\lambda\frac{\eta}{p}+(1-\lambda))+r\right)\|\Omega_{\epsilon}\|\right\}.$

In order that $\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)\leq\delta,$ it suffices that

\exp\left\{-\left((T-r)(\lambda\frac{\eta}{p}+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}\leq\delta,

hence the $(\epsilon,\delta)$ -query complexity is upper bounded by

\mathcal{O}\bigg{(}\max\{\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

∎

Theorem 3.

\mathcal{O}\bigg{(}\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

Proof.

Let $\tilde{x}:=\arg\min\limits_{t=1,\ldots,T}~{}f(x_{t}),$ then

		$\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,Y_{T-2}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\bigr{]}$
	$\displaystyle=$	$\displaystyle\cdots\cdots$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\cdots\|{\mathcal{F}}_{r}\big{]}\bigr{]}.$

At step $t\geq r+1,$ since ${\mathbf{X}}_{\Omega}$ is independent to ${\mathcal{F}}_{t-1},$ it holds

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|).$

The expectation of probability that $\tilde{h}_{t}$ hits active region is upper bounded by

\mathbb{E}\bigl{[}{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\})|{\mathcal{F}}_{t-1}\bigr{]}\leq\gamma^{(t-r)\rho}{\mathbb{P}}[\Omega]=\gamma^{(t-r)\rho}.

Under the assumption that $\Omega_{\epsilon}$ is $\eta$ -shattered by $\tilde{h}_{t},$ it holds the relation that

	$\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}\big{]}=$	$\displaystyle\frac{{\mathbb{P}}\bigl{(}\{x\in\Omega_{\epsilon}\colon\tilde{h}_{t}(x)=1\}\bigr{)}}{\mathbb{E}\bigl{[}{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\})\|{\mathcal{F}}_{t-1}\bigr{]}}$
	$\displaystyle\geq$	$\displaystyle\gamma^{-(t-r)\rho}\eta\|\Omega_{\epsilon}\|.$

Therefore,

\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}|{\mathcal{F}}_{t-1}]\leq 1-\bigl{(}\lambda\gamma^{-(t-r)\rho}\eta+(1-\lambda)|\bigr{)}\Omega_{\epsilon}|.

Moreover,

		$\displaystyle\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}\bigl{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{1},\ldots,{\mathbf{Y}}_{r}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{r+1}\in\Omega_{\epsilon}^{c}\}}\cdots\mathbb{E}\left[{\mathbb{I}}_{\{{\mathbf{Y}}_{T-1}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{T}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{T-1}]\|{\mathcal{F}}_{T-2}\right]\cdots\|{\mathcal{F}}_{r}\big{]}\bigr{]}$
	$\displaystyle\leq$	$\displaystyle\prod_{t=r+1}^{T}\bigl{(}1-\bigl{(}\lambda\gamma^{-(t-r)\rho}\eta+(1-\lambda)\|\bigr{)}\Omega_{\epsilon}\|\bigr{)}(1-\|\Omega_{\epsilon}\|)^{r}$
	$\displaystyle\leq$	$\displaystyle\exp\left\{-\left(\sum_{t=r+1}^{T}\lambda\gamma^{-(t-r)\rho}\eta+(T-r)((1-\lambda))+r\right)\|\Omega_{\epsilon}\|\right\}$
	$\displaystyle=$	$\displaystyle\exp\left\{-\left((T-r)(\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda))+r\right)\|\Omega_{\epsilon}\|\right\}.$

In order that $\Pr\left(f(\tilde{x})-f^{*}>\epsilon\right)\leq\delta,$ it suffices that

\exp\left\{-\left((T-r)(\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda))+r\right)|\Omega_{\epsilon}|\right\}\leq\delta,

hence the $(\epsilon,\delta)$ -query complexity is upper bounded by

\mathcal{O}\bigg{(}\max\{\bigl{(}\frac{\gamma^{-\rho}+\gamma^{-(T-r)\rho}}{2}\lambda\eta+(1-\lambda)\bigr{)}^{-1}\bigl{(}\frac{1}{|\Omega_{\epsilon}|}\ln\frac{1}{\delta}-r\bigr{)}+r,T\}\bigg{)}.

∎

Sufficient Condition for Acceleration

Under the assumption that $f$ is dimensionally locally Holder continuous, it is obvious that

\Omega_{\epsilon}\subseteq\prod_{i=1}^{n}[x^{i}_{*}-(\frac{\epsilon}{L^{i}_{1}})^{-\beta^{i}_{1}},x^{i}_{*}+(\frac{\epsilon}{L^{i}_{1}})^{-\beta^{i}_{1}}].

Denoted by $\tilde{x}_{t}=(\tilde{x}^{1}_{t},\ldots,\tilde{x}^{n}_{t}):=\arg\min_{j=1,\ldots,t}f(x_{j})$ . The subsequent sufficient condition gives a lower bound of region shrinking rate $\gamma$ and shrinking frequency $\rho,$ such that RACE-CARS achieves acceleration over SRACOS.

Proposition 1.

For a dimensionally local Holder continuous objective $f.$ Assume that for $\epsilon>0,$ $\Omega_{\epsilon}$ is $\eta$ -shattered by $h_{t}$ for all $t=r+1,\ldots,T.$ In order that $\Omega_{\epsilon}$ being $\eta$ -shattered by $\tilde{h}_{t},$ it is sufficient when the region shrinking rate $\gamma$ and shrinking frequency $\rho$ satisfy:

\frac{1}{2}\gamma^{t\rho}\|\Omega\|\geq\biggl{(}\tilde{x}^{1}_{t}-x^{1}_{*}+(\frac{\epsilon}{L^{1}_{1}})^{-\beta^{1}_{1}},\ldots,\tilde{x}^{n}_{t}-x^{n}_{*}+(\frac{\epsilon}{L^{n}_{1}})^{-\beta^{n}_{1}}\biggr{)}.

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|).$

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{h_{t}}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|)$
	$\displaystyle\leq$	$\displaystyle 1-\bigl{(}\lambda\frac{\eta}{p}+(1-\lambda)\|\bigr{)}\Omega_{\epsilon}\|.$

	$\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t+1}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t}]\|{\mathcal{F}}_{t-1}\big{]}\leq$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}\mathbb{E}[{\mathbb{I}}_{\{Y_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle\leq$	$\displaystyle\bigl{(}1-(\lambda\frac{\eta}{p}+(1-\lambda))\|\Omega_{\epsilon}\|\bigr{)}^{2}.$

	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{Y}}_{t}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]=$	$\displaystyle\mathbb{E}[{\mathbb{I}}_{\{\lambda{\mathbf{X}}_{t}+(1-\lambda){\mathbf{X}}_{\Omega}\in\Omega_{\epsilon}^{c}\}}\|{\mathcal{F}}_{t-1}]$
	$\displaystyle=$	$\displaystyle\lambda(1-\mathbb{E}[{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}])+(1-\lambda)(1-\|\Omega_{\epsilon}\|).$

	$\displaystyle\mathbb{E}\big{[}{\mathbb{I}}_{\{{\mathbf{X}}_{t}\in\Omega_{\epsilon}\}}\|{\mathcal{F}}_{t-1}\big{]}=$	$\displaystyle\frac{{\mathbb{P}}\bigl{(}\{x\in\Omega_{\epsilon}\colon\tilde{h}_{t}(x)=1\}\bigr{)}}{\mathbb{E}\bigl{[}{\mathbb{P}}(\{x\in\Omega\colon\tilde{h}_{t}(x)=1\})\|{\mathcal{F}}_{t-1}\bigr{]}}$
	$\displaystyle\geq$	$\displaystyle\gamma^{-(t-r)\rho}\eta\|\Omega_{\epsilon}\|.$

Scalable Acceleration for Classification-Based Derivative-Free Optimization

Abstract

Introduction

Outline and Contributions

Background

Assumption 1.

Assumption 2.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-Query Complexity).

Definition 2 (Error-Target θ\theta-Dependence).

Definition 3 (γ\gamma-Shrinking Rate).

Theoretical Study

Theorem 1.

Issues Introduced by Error-Target Dependence

Revisit of Query Complexity Upper Bound

Definition 4 (Hypothesis-Target η\eta-Shattering Rate).

Theorem 2.

The Region-Shrinking Acceleration

Theorem 3.

Definition 5 (Dimensionally local Holder continuity).

Experiments

On Synthetic Functions

On Black-Box Tuning for LMaaS

Discussion

Beyond continuity

On the Concept Hypothesis-Target Shattering

Ablation Experiments

Conclusion

References

Appendix A Appendix

Synthetic functions

Beyond continuity

For discontinuous objective functions

For discrete optimization

Ablation experiments

Appendix B Theory Supplementary

Proofs of Theorems

Theorem 2.

Proof.

Theorem 3.

Proof.

Sufficient Condition for Acceleration

Proposition 1.

Definition 1 ( $(\epsilon,\delta)$ -Query Complexity).

Definition 2 (Error-Target $\theta$ -Dependence).

Definition 3 ( $\gamma$ -Shrinking Rate).

Definition 4 (Hypothesis-Target $\eta$ -Shattering Rate).