\floatsetup

[table]capposition=top

Dynamic Exploration-Exploitation Trade-Off in Active Learning Regression with Bayesian Hierarchical Modeling

Upala Junaida Islam^a, Corresponding authors: [email protected], [email protected] Kamran Paynabar^b George Runger^a and Ashif Sikandar Iquebal^a,∗
^aSchool of Computing and Augmented Intelligence Arizona State University USA
^bH. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology USA

Abstract

Active learning provides a framework to adaptively query the most informative experiments towards learning an unknown black-box function. Various approaches of active learning have been proposed in the literature, however, they either focus on exploration or exploitation in the design space. Methods that do consider exploration-exploitation simultaneously employ fixed or ad-hoc measures to control the trade-off that may not be optimal. In this paper, we develop a Bayesian hierarchical approach, referred as BHEEM, to dynamically balance the exploration-exploitation trade-off as more data points are queried. To sample from the posterior distribution of the trade-off parameter, We subsequently formulate an approximate Bayesian computation approach based on the linear dependence of queried data in the feature space. Simulated and real-world examples show the proposed approach achieves at least $21\%$ and $11\%$ average improvement when compared to pure exploration and exploitation strategies respectively. More importantly, we note that by optimally balancing the trade-off between exploration and exploitation, BHEEM performs better or at least as well as either pure exploration or pure exploitation.

Keywords: Active learning regression; Exploration-exploitation trade-off; Bayesian hierarchical model; Approximate Bayesian Computation.

1 Introduction

The past decade has witnessed a widespread adoption of machine learning techniques across a range of applications such as design and discovery of novel materials [1], monitoring and control of advanced manufacturing [2], and decision-making in healthcare [3]. However, a majority of the success stories of machine learning methods have been limited to fitting large experimental and simulated datasets [4]. With rising concerns about resource availability and increasing costs of conducting physical experiments and obtaining labeled datasets, there is a need to develop methodologies that will guide experimental efforts towards gathering the most informative experiments.

Active learning provides a framework to address these challenges by allowing the learning algorithm to adaptively select the most informative experiments in learning a concept (a classification or a regression model), reducing the need for obtaining large datasets [5]. Here, the informativeness of data is usually assessed via an acquisition function. Following the seminal work of Angluin on “Queries and Concept Learning” [6] where the acquisition function was a majority vote among a set of candidate hypotheses, several acquisition functions were introduced that are based on entropy [7], prediction uncertainty [8], etc. However, recent studies have shown that they may have exploration or exploitation bias [9]. For instance, a greedy sampling-based acquisition function proposed in [10] (see Equation (11)), tends to acquire the data points uniformly in the unexplored search space and therefore favors exploration, irrespective of the underlying target concept. An example of this is shown in Figure 3(a). In contrast, a Query-by-Committee (QBC) approach for active learning [11] tends to exploit the search space in the neighborhood where the target concept violates the continuity and, in some cases, uniform continuity assumptions. An example is shown in Figure 3(b).

This raises the classic, albeit important question: how can we encode exploration-exploitation trade-off in active learning for regression problems. Based on our survey of the literature, we note two prominent schools of thought. The first is based on constructing new acquisition functions or appropriate modifications thereof to balance exploration and exploitation [12]. An example of the latter is the hierarchal expected improvement [13] that was introduced in an attempt of achieving non-myopic search in the context of Bayesian optimization. This was accomplished by modifying expected improvement [14] to encourage more exploration. But the most commonly investigated approach is based on optimizing a combination of different acquisition functions for exploration and exploitation through a trade-off parameter [9] as,

\textbf{x}^{*}=\underset{\mathbf{x}}{\arg\!\max}\left(\eta\mathcal{F}_{1}(\mathbf{x})+(1-\eta)\mathcal{F}_{2}(\mathbf{x})\right)

(1)

Here, $\eta\in[0,1]$ is a parameter controlling the trade-off between $\mathcal{F}_{1}(\cdot)$ and $\mathcal{F}_{2}(\cdot)$ , the objective functions corresponding to exploration and exploitation acquisition functions respectively, and $\mathbf{x}^{*}$ is the optimal data to query according to the new strategy. Nonetheless, the challenge with this approach is that $\eta$ is unknown and needs to be estimated during the learning process. Since it is difficult to estimate $\eta$ without observing future data points, existing approaches have relied on trial and error [15] or predefined ad-hoc measures depending on the number of queried data [16]. The exploration-exploitation bias has received attention in the reinforcement learning literature as well. However, unlike active learning, the agent in reinforcement learning is only concerned with maximizing the long-term reward by finding an optimal sequence of actions via exploration-exploitation. Conversely, in active learning, we are concerned with maximizing the reward over a finite (often very small) number of experiments that we can perform. The active learner, therefore, needs to iteratively and effectively update the trade-off parameter as new data are queried.

During active learning, the trade-off parameter is influenced by the noise in the labeled data at each query stage and by the variability in the labeled data across different query stages. This bi-level structure allows us to use a Bayesian hierarchical model to find the optimal trade-off parameter during the learning process. Here, the dependency of parameters related to the hierarchy is reflected in a joint probability model. The posterior density and uncertainty quantification of the parameters are obtained via Bayesian inference [17]. In the absence of closed-form expressions for the posterior distribution, a sampling-based Markov chain Monte Carlo (MCMC) method is used to obtain the marginal posterior distributions [18].

To this end, we present a Bayesian Hierarchical Model to automatically balancing Exploration-Exploitation (referred as BHEEM) trade-off in actively learning an unknown black box function. In particular, BHEEM iteratively updates the trade-off parameter as data points are queried to minimize the generalization error. By imposing a hierarchical structure, we account for the variability in $\eta$ arising from within each query iteration as well as across subsequent queries. We present an Approximate Bayesian Computation (ABC) approach along with Metropolis within Gibbs algorithm to sample from the posterior distribution of $\eta$ . We subscribe to Gaussian Process Regression to obtain the best fit for the underlying black-box function. It is critical to note that the proposed methodology is generalizable and not contingent on the choice of exploration and exploitation or the choice of the regression function. We evaluate the efficacy of BHEEM on six simulated and benchmark datasets, and one real-world example in materials discovery. We also provide sensitivity analysis and numerical convergence results to establish the consistency of BHEEM.

The rest of the paper is organized as follows. Section 2 presents a brief relevant literature review. Section 3.1 unfolds the Bayesian hierarchical model for estimating the trade-off parameter to balance exploration and exploitation. Section 3.2 explains the Gaussian process regression methodology we employed active learning with. Exploration and exploitation approaches adopted in this work are presented in Section 3.3 followed by trade-off strategies commonly used in the literature summarized in Section 3.4. Experimental setup, results from simulation and real-world experiments, and analysis over sensitivity and convergence are presented in Section 4. The paper is concluded in Section 5.

2 Literature Review

The roots of active learning can be traced back to the early 1960s with initial works focused on space-filling designs via sequential experimentation [19]. Later studies developed model-based sequential experimental designs where an underlying data-generating model is considered to guide the search for the subsequent experiments [20, 21, 22]. Model-based sequential design is commonly referred to as active learning in the machine learning community [5]. Over the past two decades, active learning has witnessed significant developments in intelligently querying and labeling data, both in the classification [23, 24, 25] and regression tasks [11, 26, 10]. In this section, we investigate active learning in regression problems with a particular emphasis on the exploration-exploitation trade-off.

Two pioneering works in active learning for regression problems are Active Learning - Mackay (ALM) based on maximizing the expected information gain [27] and Active Learning - Cohn (ALC) based on minimizing the generalization error [8]. Mackay defined the information gain as the change of Shannon’s entropy [28] before and after labeling data. Entropy has also been formulated as a measure for querying data in classification problems within the framework of “uncertainty sampling” [29]. Cohn et al. [8], on the other hand, demonstrated that the expected generalization error is composed of data noise, model bias, and variance. Data noise is independent of the model, and model bias is invariant given a fixed model. Thus the criteria for querying data was simplified to the minimization of the total variance. Meka et al. [30] extended the ALC strategy by adding a regularizing term to integrate the information of both queried and unqueried data.

A more theoretically-motivated active learning strategy called Query-by-Committee (QBC) [31] was first developed for regression by Krogh and Helseby [32]. QBC maintains a committee of hypotheses that are simultaneously trained on the labeled data. The data that maximizes the disagreement between the committee members is deemed the most informative. A variation of QBC, Expected Model Change Maximization (EMCM) was also developed in a regression setting where the expected model change was constructed using the gradient of the error concerning candidate data [26]. O’Neil et al. [33] compared QBC and EMCM along with a few model-free strategies that select the unlabeled data based on density, or diversity, and through an acquisition function named Exploration Guided Active Learning (EGAL) combining both density and diversity [34]. Results from O’Neal et al. [33] indicated that the integral properties of the dataset, such as its geometry, can be promising candidates for active learning strategies. A similar diversity-based passive sampling strategy, Greedy sampling (GS) was adapted as improved Greedy Sampling (iGS) for active learning by incorporating the response information ensuring exploration in both input and output space [10].

While significant developments have taken place in the active learning literature, only a handful of the efforts have considered the problem of exploration-exploitation trade-off, the majority of which is limited to classification problems. For example, past researches have used maximum entropy [7], and distance and similarity-based metrics for exploration [35] while subscribing to mostly uncertainty and redundancy to exploit near the current decision boundary [36]. Approaches to handle the exploration-exploitation bias include switching between exploration and exploitation based on their performance at each iteration [35], explore-then exploit approach [37], random probabilistic measure [16], or combining exploration and exploitation in a predefined ratio [9, 33, 38]. Combining the acquisition functions based on cross-validation, sensitivity analysis, or trial-and-error [15] can only be performed retrospectively and does not help with prospective data queries.

Beyond active learning, the exploration-exploitation problem has also been investigated in the reinforcement learning literature. Here, the exploitation of the greedy approach has been confronted with various exploratory approaches. Upper Confidence Bound (UCB) is one of the most celebrated approaches against this bias. Here, uncertainty is used to balance the exploration and exploitation instead of choosing the greedy action according to current knowledge [39]. The $\epsilon-$ greedy approach decays the randomness or the rate of exploration over the learning process to encourage more exploitation of the observed knowledge [40]. The Boltzmann exploration approach uses exponential weighting schemes to balance exploration and exploitation in a probabilistic fashion [41]. The balancing parameter has also been obtained using a trial-and-error process [42] or controlled based on a variation of action results and perception of environmental change [43].

Based on the literature review, we notice that the existing methods have either considered either a predefined ratio, probabilistic function or trial and error methods to control the exploration-exploitation trade-off. None of the efforts have focused on adaptively updating the trade-off parameter as more data is queried. This study advances the research on dynamically updating the trade-off between exploration and exploitation as more knowledge is gathered about the black-box function from the queried data.

3 Methodology

In this section, we present the general schema of active learning along with the models and strategies we employ in this study. Let $X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots\}$ and $\mathbf{y}=\{y_{1},y_{2},\ldots\}$ represent the possible collection of all design points and their corresponding responses respectively. In the generic problem of active learning, we consider that the actively growing set of queried (i.e., labeled) data $\mathcal{D}^{N}=\{(\mathbf{x}_{o_{i}},y_{o_{i}})\}_{i=1}^{N}$ is accessed by the underlying model. Here, $\mathbf{x}_{o_{i}}{\in}\mathbb{R}^{d}$ and $y_{o_{i}}{\in}\mathbb{R}$ are the queried data and their corresponding output responses respectively. The objective of active learning in a regression setting is to select the next data $\mathbf{x}_{o_{N+1}}$ from the search space, $X$ , which will be the most informative towards learning the unknown black-box function $f(\mathbf{x})$ given the current knowledge. The response $y_{{o}_{N+1}}$ is obtained via simulation or physical experiments. By actively selecting the data to learn from, it can rapidly reduce the generalization error which is the prediction error of the algorithm at the data unobserved so far. The process is iterated until stopping criteria e.g., the maximum number of labeled data, or predicted change, have been satisfied. See Figure 1 for a flow chart of the active learning approach.

Refer to caption — Figure 1: General schema of an active learner.

3.1 Exploration-exploitation trade-off

In machine learning, exploration refers to generating novel information in the search space while exploitation focuses on improving decisions (e.g., maximizing reward) based on the available information. Although this is a generally accepted definition of exploration and exploitation [44], its usage varies depending on the context. Particularly, in the regression setting, exploration refers to sampling data from the unobserved regions in the search space to gather more (potentially novel) information about the black-box function, such as local minima/maxima. In contrast, exploitation aims at accurately capturing the black-box function in the regions where it is highly unpredictable and has a sharp change or discontinuity. As mentioned in the foregoing, it is possible to encode exploration and exploitation separately in an acquisition function [34], however, it is challenging to dynamically balance between the two [45]. We consider a general framework for sampling data points that maximizes a linear combination of acquisition functions aimed at simultaneous exploration and exploitation as presented in Equation (1). The next data is queried through pure exploitation when $\eta=0$ , and pure exploration when $\eta=1$ . By increasing $\eta$ from $0$ to $1$ , the degree of exploration increases in the constructed acquisition function.

To estimate $\eta$ dynamically, we consider two levels of variability. First is the variability in queried data emerging from the noise and measurement errors in $y$ at each querying stage. The second level of variability is associated with the dynamic nature of $\eta$ between querying stages due to the nature of the unknown black-box function (e.g., jumps or discontinuities). This bi-level nature of variability motivates the need for a hierarchical model. A Bayesian hierarchical model allows us to capture this hierarchy while simultaneously encoding the uncertainty at each of the levels. The trade-off parameter $\eta_{j}$ captures the first level of variability at querying stage $j$ given the set of hyper-parameters $\bm{\theta}$ . We model $\{\eta_{j}|\bm{\theta}_{j}\}\stackrel{{\scriptstyle iid}}{{\sim}}p(\eta|{\bm{\theta}_{j}})$ . Here, the conditional independence of $\{\eta_{j}|\bm{\theta}_{j}\}$ tells us that the trade-off parameters at one sampling stage are exchangeable (de Finetti’s theorem [46]), but not independent. To capture the variability in $\eta$ across multiple sampling stages, we model $\{\bm{\theta}_{j}|\bm{\psi}\}\stackrel{{\scriptstyle iid}}{{\sim}}p(\bm{\theta}|\bm{\psi})$ where $\bm{\psi}$ is another set of hyper-parameters fixed from a priori assumptions. De Finetti’s theorem from this conditional independence shows that the trade-off parameters at subsequent querying stages are exchangeable as well. Indeed, this aligns with our assumption that the trade-off parameter at any querying stage is not completely independent of the previous actions (exploration or exploitation). In the next two subsections, we discuss the prior and the sampling distributions employed in this study.

3.1.1 Prior distribution

Towards optimally estimating $\eta$ using a Bayesian hierarchical framework, we begin by considering a prior distribution based on the knowledge that it lies between $0$ (pure exploitation) and $1$ (pure exploration). With this knowledge, we let $\eta$ follow a beta distribution in the form of $\eta\sim\text{Beta}(\alpha,\beta)$ with hyperparameters $\alpha$ and $\beta$ . The combined selection of $\alpha$ and $\beta$ defines the shape of the prior distribution of $\eta$ as observed in Figure 2.

A larger value of $\alpha$ shifts the bulk of the probability towards 1 and emphasizes exploration (e.g., $\alpha=5,\beta=0.1$ ), whereas an increase in $\beta$ moves the distributions towards 0 and encourages exploitation (e.g., $\alpha=0.1,\beta=5$ ). $\alpha=\beta<1$ creates a U-shape distribution with maximum probability near 0 and 1 imposing a similar higher probability on both exploration and exploitation (e.g., $\alpha=\beta=0.1$ ). On the other hand, $\alpha=\beta>1$ generates a bell-shape distribution with the maximum probability at the middle $\left(\eta=0.5\right)$ imposing a lower probability on pure exploration and exploitation, and a higher probability on a mixture of them (e.g., $\alpha=\beta=0.5$ ). Finally, $\alpha=\beta=1$ creates a uniform distribution with equal probability everywhere and is indifferent towards the selection of $\eta$ . Based on extensive simulations, we note that changing the range of $\alpha$ and $\beta$ beyond the interval of $[a,b]=[0.1,5.0]$ has no significant impact on the exploration-exploitation trade-off. Without any information about the nature of the black-box function, we impose a uniform prior distribution in the form of $\alpha,\beta\sim\text{Uniform}(0.1,5.0)$ .

3.1.2 Posterior and sampling distribution

The joint posterior distribution of $\eta$ and the hyper-parameters is,

p(\eta,\alpha,\beta|\mathbf{y},X)\propto p(\alpha)p(\beta)p(\eta|\alpha,\beta)p(\mathbf{y},X|\eta)

(2)

where $\{\left(X,\mathbf{y}\right)\}$ represents the collection of both labeled and unlabeled data. In Equation (2), $p(\mathbf{y},X|\eta)$ is the likelihood of $\eta$ denoted by $\mathcal{L}(\eta;\mathbf{y},X)$ , and $p(\eta|\alpha,\beta)$ is the prior. The hyperprior $p(\alpha)p(\beta)=p(\alpha|a)p(\beta|b)$ due to the independence of $\alpha$ and $\beta$ as well as a priori selection of $a$ and $b$ . The joint posterior distribution in Equation (2) is not available in the closed form due to the intractable likelihood function, $p(\mathbf{y},X|\eta)$ . Therefore, the unknown parameters are sampled from their respective full conditional distributions using Gibbs sampling. The individual full conditional distribution of $\alpha,\beta,\eta$ is factorized as,

	$\displaystyle{}p(\alpha\|\beta,\eta)\propto{p(\alpha)}{p(\eta\|\alpha,\beta)}$
	$\displaystyle p(\beta\|\alpha,\eta)\propto{p(\beta)}{p(\eta\|\alpha,\beta)}$
	$\displaystyle p(\eta\|\alpha,\beta,\mathbf{y},X)\propto{p(\eta\|\alpha,\beta)}{p(\mathbf{y},X\|\eta)}$		(3)

The Gibbs sampler proceeds by iteratively constructing a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [47]. But the likelihood of $\eta$ is intractable unlike that of $\alpha$ and $\beta$ since the generation of new data from observed $\eta$ does not follow any known distribution. In the absence of a tractable likelihood, we subscribe to Approximate Bayesian Computation (ABC) to approximate the posterior distribution of $\eta$ which has been used in numerous previous applications where evaluation of the likelihood is expensive or infeasible [48].

The traditional ABC algorithm [49], also known as ABC rejection sampler, follows a two-step procedure to sample from the posterior distribution. First, it draws samples of the unknown parameter from its prior distribution. Second, the sampled parameter values are accepted based on the similarity between the data simulated under some model specified by the sampled parameter values and the queried data, or between the summary statistics thereof. In other words, ABC uses summary statistics to filter samples that do not agree with the queried data. Though efficient in simple probability models, ABC rejection is limited in complex ones where the posterior is greatly different from the prior since it involves sampling from the prior [50]. Different computational methods have been presented in the literature to overcome this limitation of ABC inference, among which ABC-MCMC and its variants have been widely used and accepted [50]. The ABC-MCMC algorithm combines ABC with Metropolis-Hastings or Metropolis algorithm for iterative approximate Bayesian computation. In the following, we will discuss the Metropolis sampling followed by a new approach to implement ABC and summary statistics into BHEEM.

Metropolis sampling is a simplified form of the Metropolis-Hastings sampling algorithm. It proposes a symmetric proposal distribution $J(\mathbf{\theta}|\mathbf{\theta}_{old})$ for the parameter vector $\mathbf{\theta}$ given the most recent accepted sample $\mathbf{\theta}_{old}$ at each sampling stage. Each proposed sample is either accepted or rejected depending on an acceptance ratio computed from the prior and likelihood of the proposed and old samples. ABC-MCMC algorithm shadows the steps of a typical Metropolis algorithm. The initial sample is drawn from the prior, and the next ones are sampled from a proposal distribution centered at the most recently accepted sample. But the acceptance ratio constructed by ABC uses summary statistics, therefore replacing the likelihood ratio like in the regular Metropolis algorithm.

We initiate the Gibbs sampler with $\bm{\mathbf{\theta}}^{(0)}=\{\alpha^{(0)},\beta^{(0)},\eta^{(0)}\}$ , and generate a dependent sequence of the parameter vector, $\{\bm{\mathbf{\theta}}^{(1)},\bm{\mathbf{\theta}}^{(2)},\ldots,\bm{\mathbf{\theta}}^{(s)}\}$ . Given a current state of the parameters $\bm{\mathbf{\theta}}^{(s)}=\{\alpha^{(s)},\beta^{(s)},\eta^{(s)}\}$ , the next state of the parameter vector is generated as follows,

Step 1: Sample $\alpha^{(s+1)}\sim p(\alpha|\beta^{(s)},\eta^{(s)})$

Step 2: Sample $\beta^{(s+1)}\sim p(\beta|\alpha^{(s+1)},\eta^{(s)})$

Step 3: Sample $\eta^{(s+1)}\sim p(\eta|\alpha^{(s+1)},\beta^{(s+1)},\mathbf{y},X)$ : As mentioned before, we employ ABC-MCMC at this stage. Similar to a typical Metropolis algorithm, we sample $\eta^{(s+1)}$ till it converges. We propose $\hat{\eta}$ from a proposal symmetric distribution centered at $\eta_{old}^{(s+1)}$ . The proposed sample is accepted $\left(\eta_{new}^{(s+1)}=\hat{\eta}\right)$ with probability $\text{min}(1,r_{\eta})$ or rejected $\left(\eta_{new}^{(s+1)}=\eta_{old}^{(s+1)}\right)$ with probability $1-r_{\eta}$ where the acceptance ratio, $r_{\eta}$ , is constructed as,

\displaystyle r_{\eta}=\frac{p\left(\hat{\eta}|\alpha,\beta,\mathbf{y},X\right)}{p\left(\eta_{old}^{(s+1)}|\alpha,\beta,\mathbf{y},X\right)}=\frac{p\left(\hat{\eta}|\alpha_{new}^{(s+1)},\beta_{new}^{(s+1)}\right)p\left(\mathbf{y},X|\hat{\eta}\right)}{p\left(\eta_{old}^{(s+1)}|\alpha_{new}^{(s+1)},\beta_{new}^{(s+1)}\right)p\left(\mathbf{y},X|\eta_{old}^{(s+1)}\right)}

But we do not have access to an explicit expression for the likelihood $p(\mathbf{y},X|\eta)$ . To check the fitness of the proposed sample $\hat{\eta}$ and to compare it with the current sample $\eta_{old}^{(s+1)}$ in the absence of the likelihood, ABC reconstructs the acceptance ratio $r_{\eta}$ as,

\displaystyle r_{\eta}=\frac{p\left(\hat{\eta}|\alpha_{new}^{(s+1)},\beta_{new}^{(s+1)}\right)}{p\left({\eta}_{old}^{(s+1)}|\alpha_{new}^{(s+1)},\beta_{new}^{(s+1)}\right)}\mathbb{I}\{d(\mathcal{S}(\hat{\mathbf{x}}),\mathcal{S}(X_{o}))\geq\nu\}

(4)

where $\mathcal{S}(\hat{\mathbf{x}})$ and $\mathcal{S}(X_{o})$ are the sufficient statistics for data nominated by $\hat{\eta}$ and the queried data respectively, $d(\cdot,\cdot)$ is a distance measure between the two statistics, and $\mathbb{I}\{\cdot\}$ is an indicator function. In traditional ABC, the fitness of a candidate sample is measured in terms of its similarity to the observed data. However, in active learning, it is measured in terms of the new information generated by each new query. To measure the fitness of candidate queries, we utilize their linear dependency on the already queried data in the Hilbert space. Any data $\hat{\mathbf{x}}$ is considered approximately linearly dependent (ALD) on the already queried set $X_{o}$ if,

\displaystyle\delta=\underset{\mathbf{a}}{\min}{\left\|{\sum_{i=1}^{N-1}a_{i}\mathbf{\phi}(\mathbf{x}_{o_{i}})-\mathbf{\phi}(\hat{\mathbf{x}})}\right\|}^{2}\leq\nu

(5)

$\nu$ is an threshold parameter determining the level of sparsity, and $\phi$ is a finite-dimensional mapping of the feature vectors, $\mathbf{x}$ , to a Hilbert space. If the ALD condition in Equation (5) holds, $\phi(\hat{\mathbf{x}})$ can be inferred by the already queried data, $X_{o}$ , as shown in the expression that $\phi(\hat{\mathbf{x}})=\sum_{i}a_{i}\phi(\mathbf{x}_{o_{i}})$ . In this case, no new information is gained and we reject the candidate query. In contrast, if the ALD criterion is not satisfied, i.e., $\delta>\nu$ , then we select the candidate query and collect its response.

However, to use the ALD criterion in order to measure the fitness of candidate queries, we need to define the mapping $\phi$ . In the absence of the function $phi$ , we invoke Mercer’s Theorem [51] that guarantees the existence of a kernel function $k(\mathbf{x},\mathbf{x}^{\prime})$ such that $k=\langle\phi(\mathbf{x}),\phi(\mathbf{x}^{\prime})\rangle$ . As a result, all the calculations in the feature space can be performed by defining the kernel function instead of the function $\phi$ . With the substitution of $\phi$ with the kernel function, we obtain,

\displaystyle\mathbf{a}^{*}=K(X_{o},X_{o})^{-1}K(X_{o},\hat{\mathbf{x}})\quad\text{ and }\quad\delta=K(\hat{\mathbf{x}},\hat{\mathbf{x}})-K(X_{o},\hat{\mathbf{x}})^{T}\mathbf{a}^{*}

(6)

For each candidate $\hat{\eta}$ proposed by the Metropolis algorithm, we check the linear dependency of the nominated query $\hat{\mathbf{x}}$ . In particular, $\hat{\mathbf{x}}$ is accepted for querying if $\delta\geq\nu$ such that $\mathbb{I}\{\delta\geq\nu\}=1$ . In that case, we compare the prior of $\hat{\eta}$ and $\eta_{old}^{(s+1)}$ before accepting one of them as $\eta_{new}^{(s+1)}$ . On the other hand, if $\delta<\nu$ , it makes $\mathbb{I}\{\delta\geq\nu\}=0$ , implying $\eta_{new}^{(s+1)}=\eta_{old}^{(s+1)}$ irrespective of the priors. Authors in [52] experimented with different values for $\nu$ using cross-validation. Through the sensitivity analysis $\nu$ as discussed in Section 4.5, we fixed $\nu=0.001$ for all our experimentation.

Step 4: Let $\bm{\theta}^{(s+1)}=\{\alpha^{(s+1)},\beta^{(s+1)},\eta^{(s+1)}\}$ . After generating $\{\bm{\theta}^{(1)},\bm{\theta}^{(2)},\ldots,\bm{\theta}^{(s)}\}$ , the approximation of $\eta_{j}$ is obtained by averaging the observed $\{\eta_{j}^{(1)},\eta_{j}^{(2)},\ldots,\eta_{j}^{(s)}\}$ .

In this work, we defined the proposal distribution as $\mathcal{N}({\eta}_{old},\tau^{2})$ where ${\eta}_{old}$ is the sample parameter accepted in the last iteration, and $\tau^{2}$ is the variance. The performance of the MCMC chain depends on the variance of proposal distribution, $\tau^{2}$ . If $\tau$ is too small, i.e., 0.001, the Metropolis can get stuck at one sample and may take a long time for the chain to converge. On the other hand, if $\tau$ is too large, i.e., 0.25, it is possible to sample highly diverse and sometimes infeasible values increasing the rejection rate as well as hampering the rate of convergence. A detailed sensitivity analysis for the effect of $\tau$ on the convergence of the algorithm is discussed in Section 4.4.

3.2 Gaussian Process Regression

In the absence of any known functional form of $f(\mathbf{x})$ , we consider Gaussian process regression (GPR) as our underlying learning model. GPR considers a distribution over the underlying function $f(\mathbf{x})$ and aims to specify the black-box function by its mean and covariance. Due to the noise inherited in labeled data without the learner’s knowledge, the output $y$ comprises $f(\mathbf{x})$ as $y=f(\mathbf{x})+\varepsilon$ where $\varepsilon\sim\mathcal{N}\left(\mathbf{0},\sigma_{n}^{2}\right)$ . Observing the noisy outputs, ${\mathbf{y}}_{o}$ , GPR attempts to reconstruct the underlying function $f(\mathbf{x})$ by removing the contaminating noise, $\varepsilon$ [53]. We denote the queried dataset as $({X}_{o},\mathbf{y}_{o})$ . GPR imposes a zero-mean Gaussian process prior over the noisy outputs such that,

\mathbf{y}_{o}\sim\mathcal{N}\left(\mathbf{0},K({X}_{o},{X}_{o})+\sigma_{n}^{2}I\right)

(7)

The joint prior distribution between the training output set $\mathbf{y}$ and test output set $\hat{\mathbf{f}}$ is as following,

\begin{bmatrix}\mathbf{y}\\ \hat{\mathbf{f}}\end{bmatrix}\sim\mathcal{N}_{m}\left(\mathbf{0},\begin{bmatrix}K(X_{o},X_{o})+\sigma_{n}^{2}I&K(X_{o},X_{u})\\ K(X_{u},X_{o})&K(X_{u},X_{u})\end{bmatrix}\right)

(8)

where $X_{u}$ is the set of unlabeled testing points. The posterior distribution at the test data is given as $\{\hat{\mathbf{f}}|X_{o},\mathbf{y}_{o},X_{u}\}\sim\mathcal{N}\left(\bar{\mathbf{f}},\text{cov}(\hat{\mathbf{f}})\right)$ where,

	$\displaystyle\bar{\mathbf{f}}$	$\displaystyle{}=K(X_{u},X_{o})[K(X_{o},X_{o})+\sigma_{n}^{2}I]^{-1}\mathbf{y}_{o}$		(9)
	$\displaystyle\text{cov}(\hat{\mathbf{f}})$	$\displaystyle=K(X_{u},X_{u})-K(X_{u},X_{o})[K(X_{o},X_{o})+\sigma_{n}^{2}I]^{-1}K(X_{o},X_{u})$		(10)

3.3 Acquisition functions

In this section, we present some of the most commonly used acquisition functions and characterize them either as an exploration or an exploitation strategy based on the literature and our extensive simulation studies.

$\bullet\quad\textbf{Improved Greedy Sampling: }$ As mentioned in Section 2, the concept of improved greedy sampling (iGS) was introduced to ensure diversity in the queried data [10]. At every iteration, iGS determines the unexplored region by searching over the entire region in both the input and output spaces. It queries the next data which is located the farthest from its nearest training point or in an unobserved region according to the then prediction by the regression model. The acquisition function at $\mathbf{x}_{j}$ is calculated as $\min\left(u(\mathbf{x}_{j})v(\mathbf{x}_{j})\right)$ where $u(\mathbf{x}_{j})=\|\mathbf{x}_{j}-X_{o}\|_{2}$ and $v(\mathbf{x}_{j})=\|\bar{f}_{j}-\mathbf{y}_{o}\|_{2}$ refer to the distance of $\mathbf{x}_{j}$ with training points in input and output space respectively. Here $\mathbf{y}_{o}$ is the set of the observed output or the labels at training points $X_{o}$ , $\bar{f}_{j}$ is the predicted output at test point $\mathbf{x}_{j}$ , and $\|\cdot\|_{2}$ is the L2 norm. IGS defines the distance at $\mathbf{x}_{j}$ as $u(\mathbf{x}_{j})v(\mathbf{x}_{j})$ instead of $u(\mathbf{x}_{j})+v(\mathbf{x}_{j})$ or $\left(u(\mathbf{x}_{j})\right)^{2}+\left(v(\mathbf{x}_{j})\right)^{2}$ due to the possibility of significantly different scale of $u$ and $v$ which can hamper the latter two formulas with the dominance of one measure over another. It then selects the next data, $\mathbf{x}_{\text{iGS}}^{*}$ , such that,

\mathbf{x}_{\text{iGS}}^{*}=\underset{\mathbf{x}_{j}}{\arg\!\max}{\left(\min{\left(\|\mathbf{x}_{j}-X_{o}\|_{2}\|\bar{f}_{j}-\mathbf{y}_{o}\|_{2}\right)}\right)}

(11)

By selecting data from unexplored regions in both input and output spaces, iGS avoids sampling from any concentrated region and hence satisfies the requirement for an exploration strategy.

$\bullet\quad\textbf{Query by Committee: }$ Successfully applied in classification and regression-based problems, Query by Committee (QBC) is a framework rooted in the concept of utilizing an ensemble of hypotheses [31]. Maintaining a committee of models, QBC queries the data where the committee members disagree the most about a measure of criteria. The committee members are all trained on the same set of training points but with competing hypotheses or regression models. Considering a committee of $Q$ models denoted as $h_{1},h_{2},\ldots,h_{Q}$ , we define the measure of disagreement between two models $h_{l}$ and $h_{p}$ at $\mathbf{x}_{j}$ as the absolute difference of prediction by the respective models at that point, or $|h_{l}(\mathbf{x}_{j})-h_{p}(\mathbf{x}_{j})|$ . Then the acquisition function at $\mathbf{x}_{j}$ is defined as $\max_{l,p}(|h_{l}(\mathbf{x}_{j})-h_{p}(\mathbf{x}_{j})|)$ , representing the maximum disagreement at $\mathbf{x}_{j}$ where $l,p=1,2,\ldots,Q$ . Then the next data, $\mathbf{x}_{\text{QBC}}^{*}$ is selected such that,

\mathbf{x}_{\text{QBC}}^{*}=\underset{\mathbf{x}_{j}}{\arg\!\max}{\left(\underset{l,p}{\max}({|h_{l}(\mathbf{x}_{j})-h_{p}(\mathbf{x}_{j})|)}\right)}

(12)

QBC approach based on GPR allows us to exploit the regions in the search space where the function’s behavior is uncertain i.e., discontinuities or change points. As indicated in [53], the mean square (MS) continuity and differentiability of kernel functions control their flexibility. For example, let us compare GPR models with exponential, Matérn $3/2$ , Matérn $5/2$ , and squared exponential kernel functions. The non-differentiable exponential kernel generates the roughest prediction, whereas the infinitely differentiable squared exponential kernel produces the smoothest ones. An intermediate level of smoothness can be observed in Matérn $3/2$ and Matérn $5/2$ kernels which are one and two times MS differentiable respectively [53]. The sharp changes in underlying functions are captured by the rough predictions via dense queries. But the smoother prediction deviates from the underlying model at the region. Therefore, kernel functions with different MS continuity and differentiability behave differently where the functional form is unpredictable with discontinuity or sharp change. When QBC selects the data at which committee members differ the most in their prediction, it keeps exploiting that very region.

$\bullet\quad\textbf{Maximum variance: }$ This strategy defines the acquisition function as the variance predicted by the regression model[54]. A learner’s expected error can be decomposed into noise, bias, and variance [8]. Since the noise is independent of the model or data, and bias is invariant given a fixed model, minimizing the variance is intuitively guaranteed to minimize the future generalization error of the model [5]. Equation (10) provides the Gaussian process posterior covariance, and the diagonal elements of $\text{cov}(\hat{\mathbf{f}})$ provide the predicted variance, $\mathbb{V}$ . The next data is queried following,

\mathbf{x}_{\text{VAR}}^{*}=\underset{\mathbf{x}_{j}}{\arg\!\max}\left(\mathbb{V}(\mathbf{x}_{j})\right)

(13)

Since predicted variance is higher mostly in less-explored regions, it can be implemented as an adequate exploration acquisition function to query the next data [54].

$\bullet\quad\textbf{Maximum entropy: }$ Another active learning strategy formulated based on uncertainty is the maximum entropy strategy [5]. Entropy is an information-theoretic measure that often has been defined as the amount of information needed to “encode” a distribution and has been related to the uncertainty of the underlying model [5]. Thus minimizing model entropy can reasonably lead to revealing the model uncertainty. Shannon’s entropy has been used as an acquisition function named maximum entropy sampling [7] or uncertainty sampling [5]. Since the posterior prediction of the Gaussian process follows a multivariate normal distribution, Shannon’s entropy of the distribution,

\displaystyle H[\hat{\mathbf{f}},\text{cov}(\hat{\mathbf{f}})]=\frac{1}{2}\log|\text{cov}(\hat{\mathbf{f}})|+\frac{D}{2}\log(2\pi e)

where $D$ is the dimension of the variable, and $\hat{\mathbf{f}}$ and cov( $\hat{\mathbf{f}}$ ) are the predicted mean and covariance according to Eqs. (9) and (10) respectively [53]. By minimizing the learning model uncertainty, this strategy is also more prone to exploration than exploitation. The maximum entropy strategy queries the data where it has the highest entropy following,

\mathbf{x}_{\text{ENT}}^{*}=\underset{\mathbf{x}_{j}}{\arg\!\max}\left(H[\hat{\mathbf{f}},\text{cov}(\hat{\mathbf{f}})]\right)

(14)

3.4 Exploration-exploitation trade-off functions

As touched on briefly in the introduction, the existing approaches have tried achieving the trade-off between exploration and exploitation using different methodologies [38, 55, 56, 16]. Here we have listed a few that we will compare BHEEM with.

$\bullet\quad\textbf{Static trade-off: }$ Many of the previous studies have held on to static trade-offs between exploration and exploitation [38]. Prior information can be used to decide on the trade-off (e.g., equal importance to both or more importance to one based on the nature of the function). Existing approaches have conducted trial and error with different trade-off values between exploration and exploitation between appropriate exploration and exploitation acquisition functions to conduct the simulated experiments, and then select the one that promises overall better accuracy [55]. By fixing the exploration-exploitation trade-off throughout the whole learning process, this method reduces the computation complexity to a great extent. Nevertheless, the same trade-off cannot be expected to demonstrate similar performance for every function (as can be seen in the experimental results).

$\bullet\quad\textbf{Probabilistic trade-off: }$ Inspired by simulated annealing [57] and built on the $\epsilon-$ decreasing greedy algorithm [56], this algorithm updates the exploration-exploitation combination in a probabilistic approach [16]. For instance, the exploration probability is defined as $p_{R}=\alpha^{t-1}$ where $\alpha$ is less than 1 and $t$ is the current time step or iteration number. To query new data, a uniform random variable $Z$ is generated, if $Z\leq p_{R}$ , we consider $\eta=1$ , and exploration is performed, otherwise, we consider $\eta=0$ , and exploitation is applied [16]. Hence the strategy starts with pure exploration. The exploration probability decays gradually, and the learning model gets more prone to pure exploitation over time. Intuitively, this transition is practical since the initial queried data should focus more on exploration, and after learning the overall trend of the function, we can identify the irregular regions via exploitation. However, the transition rate depends on the choice of $\alpha$ which we fixed at $0.7$ following Elreedy et al. [16]. But again, the transition rate should depend on the nature of the function and fixing it will decrease the efficiency of the learning process.

4 Experimental results

In this section, we present the experimental setup and focus on the performance evaluation of the proposed methodology. We compare the performance of BHEEM with other active learning strategies over six simulated experiments and one real-world case study for predicting the property of MAX phase materials. Later on, we discuss convergence and sensitivity analysis of the process.

4.1 Experimental setup

To demonstrate the efficacy of BHEEM, we employ the Matérn 3/2 kernel function in the GPR model to avoid too rough (exponential kernel) or too wavy (squared exponential kernel) prediction. The kernel function is defined as,

K_{\nu=3/2}(z)=\sigma_{f}^{2}\left(1+{\sqrt{3}z/l}\right)\exp\left(-{\sqrt{3}z/l}\right)

(15)

where $\sigma_{f}^{2}$ is the signal variance that we fixed at $1$ , $z=||\mathbf{x}-\mathbf{x^{\prime}}||_{2}$ , and $l$ is the length-scale parameter optimized by maximizing the log marginal likelihood of the Gaussian process regression model.

While applying the QBC strategy, we maintained a committee of ten Gaussian process models, each with a different kernel function; (i) squared exponential, (ii) exponential, (iii) Matérn 3/2, (iv) Matérn 5/2, (v) rational quadratic, (vi) product of dot product and constant kernel, (vii) product of i and iii, (viii) product of i and iv, (ix) product of ii and iii, (x) product of ii and iv [53]. For the generation of the data, we fixed the signal-to-noise (SN) ratio to 10 which we define as the decibel of the ratio of signal power to the power of the data noise. For the performance evaluation, we calculated the root mean squared error as,

\text{RMSE}=\sqrt{\sum_{k=1}^{1000}{\frac{(\bar{f}(\mathbf{x}_{k})-f(\mathbf{x}_{k}))^{2}}{1000}}}

(16)

where $f(\cdot)$ and $\bar{f}(\cdot)$ represent the true and predicted response respectively at the set of equidistant test points, $\{\mathbf{x}_{k}\}_{k=1}^{1000}$ . We repeated all simulated experiments 100 times to achieve a consistent estimate of the performance.

4.2 Simulated experiments

To demonstrate the performance of BHEEM against the existing strategies, we considered the six following functions.

•

$F_{1}(x)=\begin{cases}3.5\exp\left(-\frac{(x-10)^{2}}{200}\right)+\epsilon,&\qquad\quad\qquad\quad\quad\text{if $x\leq 25$}\\ 8-3.5\exp\left(-\frac{(x-35)^{2}}{200}\right)+\epsilon,&\qquad\qquad\quad\quad\quad\text{otherwise}.\end{cases}\qquad\qquad\;\;\quad\qquad\quad(17)$
•

$F_{2}(x)=\sin(x)+2\exp(-30x^{2})+\epsilon\qquad\qquad\qquad\quad\quad\quad\;\;x\in[-2,2]\qquad\qquad\quad\qquad\;\;\quad(18)$
•

Three-hump camel function:
$F_{3}(\mathbf{x})=2x_{1}^{2}-1.05x_{1}^{4}+x_{1}^{6}/6+x_{1}x_{2}+x_{2}^{2}+\epsilon\qquad\qquad\quad x_{1},x_{2}\in[-5,5]\qquad\qquad\qquad\;\;(19)$
•

Six-hump camel function:
$F_{4}(\mathbf{x})=(4-2.1x_{1}^{2}+x_{1}^{4}/3)x_{1}^{2}+x_{1}x_{2}+(4x_{2}^{4}-4)x_{2}^{2}+\epsilon\quad x_{1},x_{2}\in[-2,2]\qquad\qquad\qquad\;(20)$
•

Hartmann 3D function:
$F_{5}(\mathbf{x})=-\sum_{i=1}^{4}\alpha_{i}\exp{\left(-\sum_{j=1}^{3}A_{ij}(x_{j}-P_{ij})^{2}\right)}+\epsilon\quad\quad x_{i}\in[0,1]\;\forall i=1,2,3\quad\qquad\quad(21)$

where $\alpha=(1.0,1.2,3.0,3.2)^{T}$ , $A=\begin{bmatrix}3.0&10&30\\ 0.1&10&35\\ 3.0&10&30\\ 0.1&10&35\end{bmatrix}$ , and $P=10^{-4}\begin{bmatrix}3689&1170&2673\\ 4699&4387&7470\\ 1091&8732&5547\\ 381&5743&8828\end{bmatrix}$
•

$F_{6}(\mathbf{x})=\frac{1}{2}\sum_{i=1}^{10}x_{i}^{2}-\cos\left(2\pi x_{i}\right)+\epsilon\quad\qquad\qquad\qquad\quad\quad x_{i}\in[-5,5]\;\forall i=1,2,\ldots,10\quad\;(22)$

First, we implement BHEEM by employing iGS and QBC as the pure exploration and exploitation strategy respectively. With three random initial queried data, Figure 3 shows the result of iGS, QBC, and BHEEM after the $12^{\text{th}}$ iteration for the underlying function from Equation (18). Unsurprisingly, iGS spreads out its queried data in both the $x$ and $y$ direction predicting the smooth portions of the function competently, but fails to fit the peak of the function due to the sudden change near $x=0$ . QBC exploits this uncertainty in predictions across different committee members who lead to different predictions near the sharp peak. Therefore, QBC queries most of the data in this region while ignoring other regions in the process. BHEEM queries data near $x=0$ enough times to fit the peak there but has also explored and provided satisfactory prediction in other regions achieving an overall lower error than the other two acquisition functions.

Continuing to employ iGS and QBC as pure exploration and exploitation strategies respectively, Figure 4 (a-f) compares the root mean square error (RMSE) of different methodologies while learning Equation (17-22) and summarises the result using boxplots. For each of the examples, we have set the upper limit on the total number of experiments to be 100. The key observation in the boxplots is that our proposed methodology, BHEEM, consistently achieves a lower generalization error with a smaller number of queried data than any other employed strategy. Between iGS and QBC, we observe the latter to be more efficient in some cases (Figure 4(c)), and iGS in others (Figure 4(e)).

Nevertheless, BHEEM is either better (Figure 4(b,e)) or at least as accurate as one of the approaches (Figure 4(c)).

Among the one-dimensional problems, BHEEM achieved the lowest RMSE compared to others after the query of the first $10$ data in the case of $F_{1}(x)$ as observed in Figure 4(a). IGS was the closest one in this case and achieved only about $5\%$ higher RMSE on an average. While predicting $F_{2}(x)$ , BHEEM converged the fastest and performed the best as shown in Figure 4 (b). The second best strategy was iGS which scored about $60\%$ higher RMSE on average from BHEEM after querying $25$ data. Among the two-dimensional problems, in $F_{3}(\mathbf{x})$ , BHEEM surpassed others most of the time and at the $50^{\text{th}}$ addition of point, achieved about $8.5\%$ lower RMSE from QBC which scored the closest to BHEEM for the corresponding functions. For $F_{4}(\mathbf{x})$ , the consistency of lower generalization error and fast convergence persisted for BHEEM. For the three-dimensional $F_{5}(\mathbf{x})$ , BHEEM and QBC were the two strategies achieving the lowest RMSE across the querying stages. BHEEM was able to achieve significantly lower RMSE than the pure exploration strategy, iGS. In the ten-dimensional problem of $F_{6}(\mathbf{x})$ (Figure 4(f)), all the active learning strategies are observed to compete with each other over the learning process. The pure exploration strategy, iGS, is performing better than all other acquisition functions at the $50^{\text{th}}$ addition of data whereas QBC seems to achieve the lowest RMSE at the $100^{\text{th}}$ addition. In most of these cases, all active learning strategies performed better than the random sampling strategy. Our result was affected by a few decisions including our choice of kernel, sampling algorithm and proposal distribution, number of points in the initial set of queried data, etc. However, the overall result demonstrates that our proposed approach tends to perform at least tantamount to pure exploration or pure exploitation.

To compare the computational time taken by the applied methodologies, Figure 4 also presents the average time for each methodology to query 100 data during the learning process. Our proposed methodology, BHEEM, takes a significantly larger processing time, about $2.5$ and $2$ times higher than the duration of iGS and QBC, respectively. Hence, BHEEM is more suitable for offline applications that involve time-consuming physical experiments than the onlineones, which is the case for most applications in manufacturing and materials that involve conducting time consuming physical experiments.

Figure 5 presents the percentage improvement of BHEEM from the pure exploration (iGS) and pure exploitation (QBC) calculated from the average RMSE obtained in the simulated examples. Corresponding to each function, the four bars of each color represent the percentage improvement from exploration using iGS (blue) and exploitation using QBC (yellow) after querying the fifth, tenth, $25^{\text{th}}$ , and $50^{\text{th}}$ data. Here, the positive and negative bars indicate increased and decreased accuracy of the BHEEM respectively, compared to pure exploration and pure exploitation. We observe that among the twenty-four cases of displayed results, only one showed deterioration where BHEEM had lower accuracy than both the pure strategies. In all other stages, BHEEM had significant improvement and achieved lower RMSE compared to at least one of them while surpassing result of both pure strategies in sixteen of them.

We also compare our proposed methodology with static and probabilistic updates of $\eta$ . For the static trade-off, we considered $\eta=0.25,0.5,0.75$ during the learning process individually and calculated the RMSE for each of them. To note, $\eta=0$ and $\eta=1$ refer to pure exploitation and pure exploration respectively which we have already compared with BHEEM in Figure 5 and Figure 6. For the probabilistic update of $\eta$ , we followed the strategy described in Section 3.3.2. A heatmap for the average RMSE scaled for each of the comparative trade-off methodologies is presented in Figure 6 where blue and red cells represent more and less accurate models respectively. The heatmap clearly shows that no one value of $\eta$ works well for every function, or even for every iteration in one function. BHEEM distinctly provides better results than all the static and probabilistic trade-offs.

Finally, Figure 7 presents the average improvement achieved by BHEEM from pure exploration and exploitation strategy across the functions. Here, we have used iGS and maximum variance as the exploration strategy in Figure 7(a) and Figure 7(b) respectively, and QBC as the exploitation strategy in both cases. In Figure 7(a), we observe about $7\%$ and $21\%$ lower average RMSE from pure exploration (iGS) and exploitation (QBC) respectively. In Figure 7(b), we observe about $5.7\%$ and $2.3\%$ lower average RMSE from pure exploration (maximum variance) and exploitation (QBC) respectively. Overall, our proposed methodology of trade-off always promises better accuracy than pure exploration and exploitation irrespective of the choice of strategies.

4.3 Case study: MAX phase materials

The layered ternary carbides and nitrides with the general formula $M_{n+1}AX_{n}$ are called MAX phase materials where $M$ and $A$ are early transition metal and A-group elements respectively, whereas $X$ is Nitrogen or Carbon and $n$ is an integer between 1 and 4 [58]. Their layered structures kink and delaminate the materials during deformation resulting in an unusual and unique combination of both ceramic and metallic properties which makes them attractive candidates for structural and fuel coating applications. In this study, we intended to predict the lattice constants of MAX phase materials by analyzing their compositions and elastic constants using the data from Aryal et al. [59]. The lattice constant is an important piece of information to define the overall lattice structure, which helps model the microstructure evolution. To represent the discrete categorical composition features into numeric input to the models, we used one-hot encoding, a popular approach replacing the categorical variable with as many variables as categories [60]. Assuming that we wish to differentiate between two materials with the same elements in M and A, and the same value of $n$ , one of them is a carbide, while the other is a nitride. It is possible to represent this information with two binary variables using one-hot encoding where $0$ and $1$ represent the non-existence and existence of that element in the material respectively.

Each category value is represented as a 2-dimensional, sparse vector of $1$ for one of the dimensions, and $0$ for the other. In general, for variables of cardinality $d$ , the one-hot encoding would transfer it to $d$ number of binary variables where each observation indicates the presence (1) or absence (0) of the dichotomous binary variable [60]. In the MAX phase problem, we have ten elements in M, twelve elements in A, and two elements in X. Hence, after using one-hot encoding and adding the numerical variable for n, we have a total of 25 variables representing the composition of the materials. Including the composition and the elastic constants, there is a total of 30 predictors to predict the lattice constant. Figure 8 provides a comparative analysis of RMSE of different methodologies employed in this case study. Our proposed methodology (combining iGS and QBC) outperformed the other strategies and surpassed at least one of the pure exploration or exploitation in almost every iteration. BHEEM achieved about $10.4\%$ and $11.3\%$ average improvement from the pure exploration and exploitation strategy respectively.

4.4 Convergence

In this section, we discuss the convergence of sampled $\eta$ at one querying stage, its relation with the choice of hyperparameters, and the dynamics $\eta$ during the active learning process. The results in this section are all based on the learning process of $F_{1}(x)$ .

$\bullet\quad\textbf{Convergence of Metropolis: }$ We visualize the progression of the Metropolis algorithm in Figure 9 for a representative simulation example of $F_{1}(x)$ . It plots the accepted $\hat{\eta}$ at one stage of the querying process. The chains were initiated with different ranges of values for $\hat{\eta}$ and continued till 10,000 samples were selected. The green, orange, and blue lines represent MCMC chains initiated with sampled values of $\eta$ ranging between $(0,0.2)$ , $(0.4,0.6)$ , and $(0.8,1)$ respectively. For each range of initial values, we plotted only three MCMC chains for better visualization. Figure 9(a) and Figure 9(b) demonstrate the chains till 50 and 10,000 accepted samples respectively. It is evident from the figures that irrespective of the initial sample, MCMC chains at one stage of learning converge to a close neighborhood of $\eta$ very quickly.

$\bullet\quad\textbf{Dynamics of $\mathbf{\eta}$: }$ Progression of the posterior mean of $\eta$ during successive query stages while learning $F_{1}(x)$ is presented in Figure 9(c). Dotted line is used only for visualization of the sample path and has no physical meaning. The values of $\eta$ closer to $1$ and $0$ encourages exploration and exploitation respectively. Every other value of $\eta$ combines exploration and exploitation in a specific ratio. Based on this representative sample path, we first note that BHEEM is never stuck with exploration or exploitation which is the case when the trade-off parameter is pre-defined. A close investigation of the sampled $\eta$ shows that BHEEM continues to either explore, exploit, or combine both towards optimally learning the function as more data points are queried. Note that the sample path of the mean $\eta$ depends on the noise level therefore may be different on each run.

$\bullet\quad\textbf{Sensitivity of hyperparameters: }$ Through the sensitivity analysis of the standard deviation of proposal distribution, $\tau$ , and threshold parameter, $\nu$ , visualized in Figure 10, we will discuss the dependence of convergence of RMSE and selection of $\eta$ on both the hyperparameters. We repeated our experiments by changing $\nu$ as $0.0001,0.001,0.01,0.1,$ and $0.25$ while keeping $\tau$ fixed. Boxplot of root mean square error (RMSE) across different querying stages in Figure 10(a) indicates that different values of $\nu$ do not have a significant effect on RMSE the since we execute the MCMC sampling long enough to ensure convergence. However, the noise in data compels us to query the data differently each time, and the value of $\eta$ at the previous stages affect $\eta$ in subsequent stages. Therefore, selected values of $\eta$ , presented as a boxplot in Figure 10(b), have a large variance across the learning process. Due to the near indifference to the result, we have chosen $\nu=0.001$ for all experimentation. Figure 10(c) and Figure 10(d) present the RMSE and the $\eta$ selected respectively for different values of $\tau$ , $0.001,0.01,0.1,$ and $0.25$ , while keeping $\nu=0.001$ . Here also, we observe the behavior of RMSE and $\eta$ to be indifferent towards the value of $\tau$ . But due to our previous discussion on the relation between $\tau$ and the speed of convergence of Metropolis algorithm in the last paragraph of Section 3.1.2, we chose $\tau=0.1$ throughout all experiments.

5 Conclusion

In this work, we address the exploration-exploitation problem in active learning for regression problems. Our approach, BHEEM, is based on dynamically balancing the trade-off between exploration and exploitation during the learning process using a Bayesian hierarchical model. BHEEM captures the variability in the trade-off parameter during each query stages and across successive query stages. In the absence of a likelihood function, we devised an approximate Bayesian computation based on the linear dependence of the data in the Hilbert space to sample from the posterior distribution of the trade-off parameter. We demonstrated and compared BHEEM with exploration, exploitation, and other well-known strategies to balance the two in the six simulated and one real-world case study. From the average percentage improvement in the simulated experiments, BHEEM achieved about $21\%$ and $24\%$ lower average RMSE from pure exploration and exploitation respectively irrespective of the chosen strategies. Overall, when compared to existing active learning approaches, BHEEM converged with fewer iterations in all cases and performed better or at least as effective as the more efficient strategy among the two it combines. The proposed approach to dynamically balancing exploration and exploitation has wide applications in various domains where conducting experiments and labeling data is costly such as materials characterization, mechanical testing, manufacturing, and medical sciences.

Some of the limitations of the current studies include the assumption on the filtering distance threshold $\nu$ and more importantly, a restricted prior on the trade-off parameter $\eta$ . In future works, we plan to consider a more expressive and flexible Dirichlet process prior to ensure consistency and faster convergence of the hierarchical model. ]

References

[1] Turab Lookman, Prasanna V. Balachandran, Dezhen Xue and Ruihao Yuan “Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design” In npj Computational Materials 5.1, 2019, pp. 2057–3960
[2] Thorsten Wuest, Daniel Weimer, Christopher Irgens and Klaus-Dieter Thoben “Machine learning in manufacturing: advantages, challenges, and applications” In Production & Manufacturing Research 4.1 Taylor & Francis, 2016, pp. 23–45
[3] Danton S Char, Nigam H Shah and David Magnus “Implementing machine learning in health care—addressing ethical challenges” In The New England Journal of Medicine 378.11 NIH Public Access, 2018, pp. 981
[4] T. Liao and Guoqiang Li “Metaheuristic-based inverse design of materials – A survey” In Journal of Materiomics 6.2 Elsevier, 2020, pp. 414–430 DOI: 10.1016/j.jmat.2020.02.011
[5] Burr Settles “Active Learning Literature Survey”, 2009
[6] Dana Angluin “Queries and concept learning” In Machine Learning 2.4 Springer, 1988, pp. 319–342
[7] Alex Holub, Pietro Perona and Michael C Burl “Entropy-based active learning for object recognition” In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008, pp. 1–8 IEEE
[8] David A Cohn, Zoubin Ghahramani and Michael I Jordan “Active learning with statistical models” In Journal of Artificial Intelligence Research 4, 1996, pp. 129–145
[9] Nicolas Cebron and Michael R. Berthold “Active learning for object classification: from exploration to exploitation” In Data Mining and Knowledge Discovery 18, 2009, pp. 283–299
[10] Dongrui Wu, Chin-Teng Lin and Jian Huang “Active Learning for Regression Using Greedy Sampling” In Information Sciences 474, 2019, pp. 90–105
[11] Robert Burbidge, Jem J. Rowland and Ross D. King “Active Learning for Regression Based on Query by Committee” In Lecture Notes in Computer Science, 2007
[12] Chen Change Loy, Timothy M Hospedales, Tao Xiang and Shaogang Gong “Stream-based joint exploration-exploitation active learning” In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1560–1567 IEEE
[13] Zhehui Chen, Simon Mak and C.. Wu “A hierarchical expected improvement method for Bayesian optimization” URL: arXiv:1911.07285
[14] Donald R. Jones, Matthias Schonlau and William J. Welch “Efficient Global Optimization of Expensive Black-Box Functions” In Journal of Global Optimization 13.4 Kluwer Academic Publishers, 1998, pp. 455–492 DOI: 10.1023/A:1008306431147
[15] Ali Ajdari and Hashem Mahlooji “An Adaptive Exploration-Exploitation Algorithm for Constructing Metamodels in Random Simulation Using a Novel Sequential Experimental Design” In Communications in Statistics - Simulation and Computation 43.5, 2014, pp. 947–968
[16] Dina Elreedy, Amir F. Atiya and Samir I. Shaheen “A Novel Active Learning Regression Framework for Balancing the Exploration-Exploitation Trade-Off” In Entropy 21.7, 2019
[17] Andrew Gelman, John B Carlin, Hal S Stern and Donald B Rubin “Bayesian data analysis” ChapmanHall/CRC, 1995
[18] Feng Guo, Dipak K Dey and Kent E Holsinger “A Bayesian hierarchical model for analysis of single-nucleotide polymorphisms diversity in multilocus, multipopulation samples” In Journal of the American Statistical Association 104.485 Taylor & Francis, 2009, pp. 142–154
[19] Herman Chernoff “Sequential design of experiments” In The Annals of Mathematical Statistics 30.3 JSTOR, 1959, pp. 755–770
[20] V Roshan Joseph, Tirthankar Dasgupta, Rui Tuo and CF Jeff Wu “Sequential exploration of complex surfaces using minimum energy designs” In Technometrics 57.1 Taylor & Francis, 2015, pp. 64–74
[21] Ray-Bing Chen, Weichung Wang and CF Jeff Wu “Sequential designs based on Bayesian uncertainty quantification in sparse representation surrogate modeling” In Technometrics 59.2 Taylor & Francis, 2017, pp. 139–152
[22] Lu Lu and Christine M Anderson-Cook “Strategies for sequential design of experiments and augmentation” In Quality and Reliability Engineering International 37.5 Wiley Online Library, 2021, pp. 1740–1757
[23] Sanjoy Dasgupta and Daniel Hsu “Hierarchical sampling for active learning” In Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 208–215
[24] Xin Li and Yuhong Guo “Adaptive active learning for image classification” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 859–866
[25] William H Beluch, Tim Genewein, Andreas Nürnberger and Jan M Köhler “The power of ensembles for active learning in image classification” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9368–9377
[26] Wenbin Cai, Ya Zhang and Jun Zhou “Maximizing expected model change for active learning in regression” In 2013 IEEE 13th International Conference on Data Mining IEEE, 2013
[27] David JC MacKay “Information-based objective functions for active data selection” In Neural Computation 4.4 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1992, pp. 590–604
[28] Claude E. Shannon “A mathematical theory of communication” The Bell system technical journal, 1948
[29] David D Lewis and Jason Catlett “Heterogeneous uncertainty sampling for supervised learning” In Machine Learning Proceedings 1994 Elsevier, 1994, pp. 148–156
[30] Rajitha Meka, Adel Alaeddini, Sakiko Oyama and Kristina Langer “An Active Learning Methodology for Efficient Estimation of Expensive Noisy Black-Box Functions Using Gaussian Process Regression” In IEEE Access 8, 2020, pp. 111460–111474
[31] H.. Seung, M. Opper and H. Sompolinsky “Query by committee” In Proceedings of the fifth annual workshop on Computational learning theory New York, NY, USA: Association for Computing Machinery, 1992, pp. 287–294 DOI: 10.1145/130385.130417
[32] Anders Krogh and Jesper Vedelsby “Neural Network Ensembles, Cross Validation, and Active Learning” In Advances in Neural Information Processing Systems 7.7, 1995
[33] Jack O’Neill, Sarah Jane Delany and Brian MacNamee “Model-free and model-based active learning for regression” In Advances in Computational Intelligence Systems Springer, 2017, pp. 375–386
[34] Rong Hu, Sarah Jane Delany and Brian Mac Namee “EGAL: Exploration guided active learning for TCBR” In International Conference on Case-Based Reasoning, 2010, pp. 156–170 Springer
[35] Yoram Baram, Ran El Yaniv and Kobi Luz “Online choice of active learning algorithms” In Journal of Machine Learning Research 5.Mar, 2004, pp. 255–291
[36] Changchang Yin et al. “Deep similarity-based batch mode active learning with exploration-exploitation” In 2017 IEEE International Conference on Data Mining (ICDM), 2017, pp. 575–584 IEEE
[37] Ryan Smith et al. “Imprecise action selection in substance use disorder: Evidence for active learning impairments when solving the explore-exploit dilemma” In Drug and alcohol dependence 215 Elsevier, 2020, pp. 108208
[38] Seho Kee, Enrique Del Castillo and George Runger “Query-by-committee improvement with diversity and density in batch active learning” In Information Sciences 454 Elsevier, 2018, pp. 401–418
[39] Peter Auer “Using confidence bounds for exploitation-exploration trade-offs” In Journal of Machine Learning Research 3.Nov, 2002, pp. 397–422
[40] Haitham Afifi and Holger Karl “Reinforcement Learning for Virtual Network Embedding in Wireless Sensor Networks” In 2020 16th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)(50308), 2020, pp. 123–128 IEEE
[41] Richard S Sutton and Andrew G Barto “Reinforcement learning: An introduction” MIT press, 2018
[42] Dimitris E Koulouriotis and A Xanthopoulos “Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems” In Applied Mathematics and Computation 196.2 Elsevier, 2008, pp. 913–922
[43] Shin Ishii, Wako Yoshida and Junichiro Yoshimoto “Control of exploitation–exploration meta-parameter in reinforcement learning” In Neural Networks 15.4-6 Elsevier, 2002, pp. 665–687
[44] Jonathan D Cohen, Samuel M McClure and J Yu Angela “Should I stay or should I go? How the human brain manages”
[45] PM A Rabbitt “Errors and error correction in choice-response tasks.” In Journal of Experimental Psychology 71.2 American Psychological Association, 1966, pp. 264
[46] José M Bernardo “The concept of exchangeability and its applications” In Far East Journal of Mathematical Sciences 4 SHIROCH REPROGRAPHICS, 1996, pp. 111–122
[47] Peter D Hoff “A first course in Bayesian statistical methods” Springer, 2009
[48] Alberto Giovanni Busetto and Joachim M Buhmann “Stable Bayesian parameter estimation for biological dynamical systems” In 2009 International Conference on Computational Science and Engineering 1, 2009, pp. 148–157 IEEE
[49] Jonathan K Pritchard, Mark T Seielstad, Anna Perez-Lezaun and Marcus W Feldman “Population growth of human Y chromosomes: a study of Y chromosome microsatellites.” In Molecular Biology and Evolution 16.12 Oxford University Press, 1999, pp. 1791–1798
[50] Paul Marjoram, John Molitor, Vincent Plagnol and Simon Tavaré “Markov chain Monte Carlo without likelihoods” In Proceedings of the National Academy of Sciences 100.26 National Acad Sciences, 2003, pp. 15324–15328
[51] James Mercer “Xvi. functions of positive and negative type, and their connection the theory of integral equations” In Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character 209.441-458 The Royal Society London, 1909, pp. 415–446
[52] Yaakov Engel, Shie Mannor and Ron Meir “The kernel recursive least-squares algorithm” In IEEE Transactions on Signal Processing 52.8 IEEE, 2004, pp. 2275–2285
[53] Christopher KI Williams and Carl Edward Rasmussen “Gaussian Processes for Machine Learning” The MIT Press, 2006
[54] Yazhou Yang and Marco Loog “A variance maximization criterion for active learning” In Pattern Recognition 78, 2018, pp. 358–370
[55] Maicon Pierre Lourenço et al. “A new active learning approach for adsorbate–substrate structural elucidation in silico” In Journal of Molecular Modeling 28.6 Springer, 2022, pp. 1–11
[56] Volodymyr Kuleshov and Doina Precup “Algorithms for multi-armed bandit problems” In arXiv preprint arXiv:1402.6028, 2014
[57] Peter JM Van Laarhoven and Emile HL Aarts “Simulated annealing” In Simulated annealing: Theory and applications Springer, 1987, pp. 7–15
[58] Per Eklund et al. “Homoepitaxial growth of Ti–Si–C MAX-phase thin films on bulk Ti3SiC2 substrates” In Journal of Crystal Growth 304.1, 2007, pp. 264–269
[59] Sitaram Aryal, Ridwan Sakidja, Michel W Barsoum and Wai-Yim Ching “A genomic approach to the stability, elastic, and electronic properties of the MAX phases” In Physica Status Solidi (b) 251.8 Wiley Online Library, 2014, pp. 1480–1497
[60] Kedar Potdar, Taher S Pardawala and Chinmay D Pai “A comparative study of categorical variable encoding techniques for neural network classifiers” In International Journal of Computer Applications 175.4, 2017, pp. 7–9

	$\displaystyle{}p(\alpha\|\beta,\eta)\propto{p(\alpha)}{p(\eta\|\alpha,\beta)}$
	$\displaystyle p(\beta\|\alpha,\eta)\propto{p(\beta)}{p(\eta\|\alpha,\beta)}$
	$\displaystyle p(\eta\|\alpha,\beta,\mathbf{y},X)\propto{p(\eta\|\alpha,\beta)}{p(\mathbf{y},X\|\eta)}$		(3)