Active Learning by Query by Committee with Robust Divergences

Hideitsu Hino
The Institute of Statistical Mathematics, Tokyo 190-8565, Japan
RIKEN AIP, Tokyo 103-0027, Japan
[email protected] &Shinto Eguchi
The Institute of Statistical Mathematics, Tokyo 190-8565, Japan

Abstract

Active learning is a widely used methodology for various problems with high measurement costs. In active learning, the next object to be measured is selected by an acquisition function, and measurements are performed sequentially. The query by committee is a well-known acquisition function. In conventional methods, committee disagreement is quantified by the Kullback–Leibler divergence. In this paper, the measure of disagreement is defined by the Bregman divergence, which includes the Kullback–Leibler divergence as an instance, and the dual $\gamma$ -power divergence. As a particular class of the Bregman divergence, the $\beta$ -divergence is considered. By deriving the influence function, we show that the proposed method using $\beta$ -divergence and dual $\gamma$ -power divergence are more robust than the conventional method in which the measure of disagreement is defined by the Kullback–Leibler divergence. Experimental results show that the proposed method performs as well as or better than the conventional method.

Keywords Information Geometry, Bregman Divergence, Power Divergence, Active Learning, Query By Committee, Robust Statistics

1 Introduction

Supervised learning, a typical machine learning problem set-up, is the problem of approximating the input–output correspondence given a large number of input and output data pairs in advance. The greater the number of input-output data available for model construction, the more accurately the input-output relationship can be approximated, but the cost of obtaining appropriate outputs for the inputs can be significant. For example, if the environment in which crops are grown (e.g., the average temperature for each month, type and amount of fertilizer to be administered, and weather conditions) is used as the input and property (e.g., sugar content) of a particular crop as the output, it will take several months to several years to obtain the output for a particular input. In another example, researchers have very limited time to use the synchrotron radiation experimental facilities where the X-ray spectrum measurement experiments described below are conducted. Although it is possible to plan particular measurements, which are considered as the input in this case, the cost of obtaining the corresponding output is high, and it is necessary to consider ways to extract the maximum amount of information with as few measurements as possible. There is a methodology called experimental design [7], which is a technique to carefully design the types and values of input variables and the number of measurements required before conducting an experiment. On the other hand, when a certain amount of data (input–output pairs) has already been observed and a predictive model has been built using it, the methodology to automatically select the next sample to annotate to maximally improve the predictive accuracy is called active learning [31, 18]. It is theoretically known that the probability of an incorrect prediction of the response variable for an unknown input (generalization error) can be reduced by appropriately selecting the examples (samples) to be used to train the predictor through active learning. Active learning is widely used in practical applications [36, 35], and theoretical analysis has also been conducted [5, 10, 4, 20].

Information geometry is a methodology that treats parametric models of probability distributions with a geometric approach [1]. It enables the analysis of statistical inference problems using the tools of differential geometry, and it is used to elucidate the structure of information not only in statistics, but also in various other fields [2]. Divergence functions, which quantify the degree of discrepancy between probability distributions, play an essential role in information geometry [12], and remarkable results in robust statistics have been obtained, for example, through information geometric analysis with a particular class of divergence functions [14]. As a complementary approach to conventional theoretical development and practical algorithms, we consider an active learning algorithm from the viewpoint of information geometry. We consider a sequential active learning problem with a particular acquisition function and a generalized linear model. An intuitively comprehensive picture of the sequential sample selection procedure is provided by considering an input vector as an element of the (algebraic) dual space to the parameter space. On the basis of the geometric formulation of sample selection for an active learning algorithm, robust variants of the selection procedure with a $\beta$ divergence and dual $\gamma$ -power divergence are proposed. Proofs for theorems and propositions are deferred to the appendix section for the sake of readability.

2 Active Learning

In this section, the setup of active learning is introduced, and an information geometric perspective of the procedure of selecting a new sample to be measured is considered.

Let $X\in\mathcal{X}\subseteq\mathbb{R}^{d}$ be an input variable and $Y\in\mathcal{Y}$ be an output variable, where $\mathcal{Y}$ is a subset of $\mathbb{R}$ for regression problems and is $\{+1,-1\}$ for discrimination problems. The realizations of the random variables $X$ and $Y$ are denoted as $\bm{x}$ and $y$ , respectively. The function $h:\mathcal{X}\to\mathcal{Y}$ , which predicts the response variable from the explanatory variable, is called a hypothesis or predictor. The set of all possible hypotheses is represented by $\mathcal{H}$ .

The probability density function $p(y|\xi)$ or $p(y|\bm{x};\bm{\theta})$ is sometimes abbreviated by its parameter $\xi$ or $\bm{\theta}$ . Let $\mathcal{P}$ be the space of all Radon–Nikodým derivatives with a common support, which are dominated by a $\sigma$ -finite measure $\Lambda$ on $\mathbb{R}^{d}$ . We typically consider most cases where $\Lambda$ is fixed by the Lebesgue measure or the counting measure so that $\mathcal{P}$ is the space of all probability density functions or that of all probability mass functions.

2.1 Sequential Observation for Generalized Linear Model

Suppose we have a small initial training dataset $S_{0}=\{(\bm{x}_{i},y_{i})\}$ , and the initial predictive model $y=h_{0}(\bm{x})$ is trained with $S_{0}$ . In active learning, a learner is supposed to select a sample $\bm{x}$ for which the value of the corresponding output variable $y$ is unknown by some criteria, thereby obtaining the value of $y$ . The function that returns the value of the explanatory variable for $\bm{x}$ is often called the oracle. In statistics, for most sequential designs, a setting is assumed in which observation points can be freely selected according to some standard; this is called membership query synthesis in the context of active learning [3]. On the other hand, in the literature on active learning, it is often assumed that the learner has access to a set of pooled unlabeled samples denoted by $\mathcal{X}_{p}$ and selects one sample from $\mathcal{X}_{p}$ in one iteration of the active learning procedure on the basis of the value of an acquisition function $a(\bm{x})$ . Then, the output value $y$ for the chosen sample $\bm{x}$ is measured, and the dataset for learning the predictive model is updated as $S_{t+1}=S_{t}\cup\{(\bm{x},y)\}$ . We follow this problem setting.

Given a set of input and output pairs $S_{t}=\{(\bm{x}_{i},y_{i})\}_{i=1,\dots,n_{t}}$ , we define $X_{t}=(\bm{x}_{1},\dots,\bm{x}_{n_{t}})^{\top}\in\mathbb{R}^{n_{t}\times d}$ as the design matrix and a vector of realizations of output $\bm{y}=(y_{1},\dots,y_{n_{t}})$ . We consider the joint distribution $p(\bm{y}|X_{t},\bm{\xi})$ of $\bm{y}$ parameterized by $\bm{\xi}\in\mathbb{R}^{m}$ and estimate $\bm{\xi}$ by the maximum likelihood method. An exponential family is a broad class of statistical models, which includes Gaussian distribution and Poisson distribution, for example. A generalized linear model (GLM; [16]) considers that each output $y$ is assumed to be generated from a particular distribution in an exponential family. Statistical inference on the GLM is investigated from the viewpoint of information geometry [19, 11]. There are several equivalent representations for GLM, and we adopt the following form:

\displaystyle p(y|\xi(\bm{x}))=

\displaystyle\exp\left(\frac{y\xi(\bm{x})-\psi(\xi(\bm{x}))}{\phi}+c(y,\phi)\right),

(1)

where $\phi$ is a dispersion parameter and is assumed to be known in this paper. The function $\psi(\xi(\bm{x}))$ is the cumulant generating function. Here, $\xi\in\mathbb{R}$ is the canonical parameter of the distribution, but in the framework of the generalized linear model, we further introduce the linear predictor $\xi(\bm{x})=h(\bm{x};\bm{\theta})=\langle\bm{\theta},\bm{x}\rangle$ and consider $\bm{\theta}$ as the target of estimation, where $\langle\;\cdot\;,\;\cdot\;\rangle$ is the Euclidean inner product, and accordingly, $p(y|\xi(\bm{x}))=p(y|\bm{x};\bm{\theta})$ . As a basic requirement, $\xi$ is in $(-\infty,\infty)$ , so that the linear model $\xi=\langle\bm{\theta},\bm{x}\rangle$ is always well-defined for any regression parameter $\bm{\theta}\in\mathbb{R}^{d}$ .

Example 1

Multiple Regression
In the standard multiple linear regression in which the output $y$ follows a Gaussian distribution with the mean $\xi$ and variance $\sigma^{2}$ , $\phi=\sigma^{2}$ , and $c(y,\phi)=-\frac{y^{2}}{\phi}-\log(2\pi\phi)^{1/2}$ . The cumulant generating function is $\psi(\xi)=\frac{1}{2}\xi^{2}$ , and the probability density function is

\displaystyle\begin{aligned} p(y|\xi(\bm{x}))=&\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{1}{2\sigma^{2}}(y-\xi)^{2}\right)\\ =&\exp\left(\frac{y\langle\bm{\theta},\bm{x}\rangle-\frac{1}{2}\langle\bm{\theta},\bm{x}\rangle^{2}}{\phi}-\frac{y^{2}}{2\phi}-\log(2\pi\phi)^{1/2}\right).\end{aligned}

(2)

Example 2

Logistic Regression
In logistic regression, the output $y$ follows a binomial distribution with parameters $\xi$ , $\phi=1$ , and $c(y,\phi)=0$ . The cumulant generating function is $\psi(\xi)=\log(1+e^{\xi})$ ; hence, the probability function is

\displaystyle p(y|\xi(\bm{x}))=\exp(y\xi-\log(1+e^{\xi})).

(3)

Since the generalized linear model (1) is determined by design $X_{t-1}$ and parameter $\bm{\theta}$ , the model manifold is denoted as

\displaystyle M_{X_{t-1}}=\{p(\bm{y}|X_{t-1};\bm{\theta})\}.

(4)

In the framework of active learning, we assume, by using $\bm{y}$ and $X_{t-1}$ , that an estimate of parameter $\hat{\bm{\theta}}_{t-1}$ has been obtained. Then, by using the information contained in $\bm{y},X_{t-1}$ and $\hat{\bm{\theta}}_{t-1}$ , we explore a point $\bm{x}_{t}\in\mathcal{X}_{p}$ . Considering the fact that the canonical parameter for the generalized linear model $\xi$ is expressed by $\langle\bm{\theta},\bm{x}\rangle=\xi(\bm{x})$ , we consider fixing $\bm{\theta}=\hat{\bm{\theta}}_{t-1}$ and $\bm{x}$ as the element of the algebraic dual space of $\bm{\theta}\in\Theta$ . Then, the problem of exploring the sample point $\bm{x}\in\mathcal{X}_{p}$ in active learning is considered as the problem of finding $\bm{x}$ by maximizing an acquisition function characterized by $p(y|\bm{x};\hat{\bm{\theta}}_{t-1})$ . The dimensionality of joint distribution is $t-1$ in $M_{X_{t-1}}$ and $t$ in $M_{X_{t}}$ , while the dimensionality of parameter $\bm{\theta}$ remains the same. The parameter $\bm{\theta}$ specifies a point of the model manifold $M_{X_{t}}$ or $M_{X_{t-1}}$ . From this perspective, the updated dataset $S_{t}=S_{t-1}\cup\{(\bm{x}_{t},y_{t})\}$ in the process of active learning is regarded as the extension of the model space $M_{X_{t-1}}$ to $M_{X_{t}}$ as schematically depicted in Figure 1. We consider an active learning problem that explores an additional observation point $\bm{\theta}_{t}$ based on the current model parameter $\hat{\bm{\theta}}_{t-1}$ . In Fig.1, the dotted arrow connects the different model spaces, and the solid arrow connects model parameters within the same model space $M_{X_{t}}$ . The cross mark specifies a point in $M_{X_{t-1}}$ while open circles specify points in $M_{X_{t}}$ . Note that the parameter $\hat{\bm{\theta}}_{t-1}$ corresponds to a point obtained by MLE with $S_{t-1}$ in $M_{X_{t-1}}$ . Additional observation $(\bm{\theta}_{t},y_{t})$ defines an updated model space $M_{X_{t}}$ . The parameter $\hat{\bm{\theta}}_{t-1}$ specifies a point in the updated model space $M_{X_{t}}$ , but it is not the parameter $\hat{\bm{\theta}}_{t}$ obtained by MLE using the updated dataset $S_{t}$ in general.

Refer to caption — Figure 1: The model space $M_{X_{t-1}}$ corresponding to the design $X_{t-1}$ up to the step $t-1$ of active learning, and the extended space $M_{X_{t}}$ based on the new observation $\bm{x}_{t}$ .

2.2 Acquisition Function

The design of the acquisition function is one of the central issues in active learning studies. Several acquisition functions aim to quantify the difficulty of label prediction in some way and actively incorporate difficult-to-predict samples into learning. This approach of designing acquisition functions is called the uncertainty-based approach. As another approach, it is reasonable to sample inputs so as to reflect the distribution of explanatory variables, and methods that are based on the idea of annotating representative samples [27, 30] have been proposed. In addition, several methods have recently been proposed to learn acquisition functions according to the environment and data [24, 17, 33].

In this work, we consider the uncertain-based approach. In particular, we adopt a simple and intuitive method for quantifying the uncertainty called the query by committee (QBC).

2.3 Query by Committee

One of the criteria for selecting new observations in active learning is the query by committee (QBC; [32]). This is an approach that selects the sample on which there will be the most disagreement in a sense of a consensus of multiple predictive models. Each time a new sample $\bm{x}\in\mathcal{X}_{p}$ or query is issued, committee members vote on the response $y$ for the query $\bm{x}$ . Various methods have been proposed and discussed in relation to ensemble learning [13] and the resulting reduction of version space [15]. As a simple representative method, the following procedure is proposed in [25]:

1.

Learn $C$ predictive models with different parameters $\bm{\theta}_{c,t-1},c=1,\dots,C$ by, for example, Bagging [9].
2.

Select a sample from the pool as $\bm{x}_{t}=\mathop{\rm arg~{}max}\limits_{\bm{x}\in\mathcal{X}_{p}}a(\bm{x})$ by using the acquisition function

$\displaystyle a_{0}(\bm{x})=\sum_{c=1}^{C}w_{c}D_{0}(p(Y|\xi_{c,t-1}(\bm{x})),p(Y|\bar{\xi}(\bm{x}))),$ (5)

where $D_{0}$ is the Kullback–Leibler (KL) divergence, and $\bar{\xi}(\bm{x})$ is the consensus model parameter defined later, and $\xi_{c,t-1}(\bm{x})=\langle\bm{\theta}_{c,t-1},\bm{x}\rangle$ . Measure the response $y$ corresponding to the selected $\bm{x}_{t}$ and denote it as $y_{t}$ .
3.

Update the training dataset $S_{t}=S_{t-1}\cup\{(\bm{x}_{t},y_{t})\}$ .

In Eq. (5), the divergences are mixed with the mixing weight $w_{c},c=1,\dots,C$ where $w_{c}\geq 0$ and $\sum_{c=1}^{C}w_{c}=1$ . The weight $w=(w_{1},\dots,w_{C})$ reflects the reliability of committee members, and can be fixed in advance or determined during the learning procedure for a committee member. In this work, we consider $w_{c}=1/C$ and fixed throughout the active learning process.

The consensus model parameter $\bar{\xi}(\bm{x})$ is defined by the minimizer

\displaystyle\bar{\xi}(\bm{x})=\mathop{\rm arg~{}min}\limits_{\xi(\bm{x})}\sum_{c=1}^{C}w_{c}D_{0}(p(Y|\xi(\bm{x})),p(Y|\xi_{c}(\bm{x})))

(6)

of the weighted sum of $e$ -projections from committee models $p(y|\xi_{c}(\bm{x}))=p(y|\bm{x};\bm{\theta}_{c})$ as shown in Figure 2. The minimizer (6) is called the $e$ -mixture of models [21, 34].

The consensus model is explicitly obtained by minimizing

	$\displaystyle J(\xi)=$	$\displaystyle\sum_{c=1}^{C}w_{c}D_{0}(p(Y\|\xi(\bm{x})),p(Y\|\xi_{c}(\bm{x})))=\sum_{c=1}^{C}w_{c}\mathbb{E}_{\xi}\left[\frac{Y\xi-Y\xi_{c}-\psi(\xi)+\psi(\xi_{c})}{\phi}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{\phi}\sum_{c=1}^{C}w_{c}\{\psi^{\prime}(\xi)\xi-\psi^{\prime}(\xi)\xi_{c}-\psi(\xi)+\psi(\xi_{c})\},$		(7)

where we used $\psi^{\prime}(\xi)=\mathbb{E}_{\xi}[Y]$ . Then, solving $\frac{\partial J(\xi)}{\partial\xi}=\psi^{\prime\prime}(\xi)\sum_{c=1}^{C}w_{c}(\xi-\xi_{c})=0$ , we have $\bar{\xi}=\sum_{c=1}^{C}w_{c}\xi_{c}$ .

The acquisition function in QBC is defined by the summation of KL divergences from models $p(y|\xi_{c}(\bm{x}))$ to the consensus model $p(y|\bar{\xi}(\bm{x}))$ as depicted in Figure 3. Intuitively, the query with the most split votes would be worth querying the oracle because of the high uncertainty. To quantify the diversity of this vote, the divergence from the average model is considered. When the sum of the divergences from the mean is large, the individual committee member’s disagreement is considered to be large. In the original work [25], the acquisition function (5) is defined as the sum of the KL-divergences from committee members to the consensus model (6). It is also possible to consider the sum of the divergences from the consensus model to the committee members, which is used for defining the consensus model. In this paper, we only consider the case (6) following the original definition in [25], but the sum of the divergences from the consensus model to the committee members gives similar results.

Note that the maximization of Eq. (5) is often not a well-defined optimization problem when $\bm{x}$ freely takes any value in $\mathcal{X}$ . In fact, the domain of $\bm{x}$ is a set of candidate measurement points $\mathcal{X}_{p}$ given in advance, the so-called pool, and maximization is carried out by evaluating all the elements of this set. Therefore, there is no need for mathematical optimization, and the maximization is simply carried out by evaluating the acquisition function $a_{0}(\bm{x})$ .

The consensus model and the acquisition function are originally defined on the basis of the KL divergence, which is vulnerable to outliers. In the QBC procedure explained above, committee models are trained by using a small amount of training data. If we use Bagging to construct various committee members, the situation would be worse, and it is highly possible that some of the committee members behave as outliers. To alleviate this problem, we consider using robust divergence measures instead of a standard KL divergence for computing the consensus model and defining the acquisition function.

3 Divergence Functions

Divergence function is an index to measure the discrepancy between two probability density functions. It plays a central role in integrating statistics, information theory, statistical physics, and machine learning with many other fields. The most popular divergence function is the KL divergence. In this section, we consider two classes of alternative divergences.

3.1 Bregman divergence

Let $U$ be a monotonically increasing convex function on $\mathbb{R}$ and $u$ be the derivative of $U$ . We define $U^{\ast}(\zeta)=\sup_{z\in\mathbb{R}}\{z\zeta-U(z)\}$ , that is, the Legendre transform of $U$ , and $u^{\ast}=u^{-1}$ as the derivative of $U^{\ast}$ . We consider transforming the function $f$ by $u^{\ast}(f)$ , and denote the transformed function as $\breve{f}=u^{\ast}(f)$ , which is called the $u$ -representation of the function $f$ . Then, the Bregman potential between two functions $f$ and $g$ is defined as

\displaystyle d_{U}(f,g)=U^{\ast}(f)+U(\breve{g})-f\breve{g},

(8)

and the Bregman divergence [8] is defined as

\displaystyle D_{U}(p,q)=\int d_{U}(p(y),q(y))\mathrm{d}\Lambda(y)=\int d_{U}(p,q)\mathrm{d}\Lambda,

(9)

where $p$ and $q$ are the probability density or probability mass functions. Note that we omit the integral variable $y$ for notational simplicity. Then, the $u$ -cross entropy and $u$ -entropy are defined as

	$\displaystyle C_{U}(p,q)=$	$\displaystyle\int U(\breve{q})\mathrm{d}\Lambda-\int p\breve{q}\mathrm{d}\Lambda,$		(10)
	$\displaystyle H_{U}(p)=$	$\displaystyle\int U(\breve{p})\mathrm{d}\Lambda-\int p\breve{p}\mathrm{d}\Lambda,$		(11)

respectively. Using these entropies, we define the Bregman divergence or the $u$ -divergence from $p$ to $q$ as

\displaystyle D_{U}(p,q)=C_{U}(p,q)-H_{U}(p).

(12)

The most popular convex function $U$ and its related functions for the Bregman divergence would be the exponential function, which leads to the Kullback–Leibler divergence where

\displaystyle\begin{aligned} U(z)&=\exp(z),&U^{\ast}(\zeta)&=\zeta(\log\zeta-1),\\ u(z)&=\exp(z),&u^{\ast}(\zeta)&=\log\zeta.\end{aligned}

(13)

The Euclidean distance is recovered with

\displaystyle\begin{aligned} U(z)&=\frac{1}{2}z^{2},&U^{\ast}(\zeta)&=\frac{1}{2}\zeta^{2},\\ u(z)&=z,&u^{\ast}(\zeta)&=\zeta.\end{aligned}

(14)

Other important examples include the $\eta$ -type with $\eta\geq 0$

\displaystyle\begin{aligned} U(z)&=\exp(z)+\eta z,&U^{\ast}(\zeta)&=(\zeta-\eta)\{\log(\zeta-\eta)+1\},\\ u(z)&=\exp(z)+\eta,&u^{\ast}(\zeta)&=\log(\zeta-\eta),\end{aligned}

(15)

and the $\beta$ -type with $\beta\geq 0$

\displaystyle\begin{aligned} U(z)&=\frac{1}{\beta+1}(\beta z+1)^{\frac{\beta+1}{\beta}},&U^{\ast}(\zeta)&=\frac{\zeta^{\beta+1}}{\beta(\beta+1)}-\frac{\zeta}{\beta},\\ u(z)&=(\beta z+1)^{1/\beta},&u^{\ast}(\zeta)&=\frac{\zeta^{\beta}-1}{\beta}.\end{aligned}

(16)

Both the $\eta$ -type and $\beta$ -type functions lead to robust estimators. In this work, we concentrate on the $\beta$ -type and only consider the $\beta$ -divergence

	$\displaystyle D_{\beta}(p,q)=$	$\displaystyle\frac{1}{\beta+1}\int q^{\beta+1}\mathrm{d}\Lambda-\frac{1}{\beta+1}\int p^{\beta+1}\mathrm{d}\Lambda-\frac{1}{\beta}\int p(q^{\beta}-p^{\beta})\mathrm{d}\Lambda$
	$\displaystyle=$	$\displaystyle{\frac{1}{\beta(1+\beta)}}\int p^{\beta+1}{\mathrm{d}}\Lambda-\frac{1}{\beta}{\int pq^{\beta}\mathrm{d}\Lambda}+{\frac{1}{\beta+1}}\int q^{\beta+1}\mathrm{d}\Lambda.$		(17)

as an instance of the Bregman divergence.

3.2 Dual $\gamma$ -power Divergence

In [6], it is proposed that a class of power divergences, and discussed a robust parameter estimation on the basis of this class. It is shown to be robust to outliers, and its relationship with the pseudo-spherical score was investigated in [14, 22, 23].

The standard $\gamma$ -power divergence is given by

\displaystyle D_{\gamma}(p,q)=-\frac{\int pq^{\gamma}\mathrm{d}\Lambda}{(\int q^{\gamma+1}\mathrm{d}\Lambda)^{\frac{\gamma}{\gamma+1}}}+\Big{(}\int p^{\gamma+1}\mathrm{d}\Lambda\Big{)}^{\frac{1}{\gamma+1}},

(18)

which satisfies the scale invariance for the second argument as $D_{\gamma}(p,zq)=D_{\gamma}(p,q)$ for any constant $z>0$ .

Alternatively, we consider a dual $\gamma$ power divergence as

\displaystyle D^{*}_{\gamma}(p,q)=-\frac{\int pq^{\gamma}{\rm d}\Lambda}{(\int p^{\gamma+1}\mathrm{d}\Lambda)^{\frac{1}{\gamma+1}}}+\Big{(}\int q^{\gamma+1}\mathrm{d}\Lambda\Big{)}^{\frac{\gamma}{\gamma+1}}

(19)

for $p$ and $q$ of $\mathcal{P}$ . Note that $D^{*}_{\gamma}(zp,q)=D^{*}_{\gamma}(p,q)$ for any constant $z>0$ . This implies that, if $p$ and $q$ are density functions with a finite mass, then $D^{*}_{\gamma}(p,q)\geq 0$ with equality $p=zq$ . On the other hand, if $p$ and $q$ are in $\cal P$ , then $D^{*}_{\gamma}(p,q)=0$ means $p=q$ . It is worth noting that the dual $\gamma$ -power divergence is closely related to the $\beta$ -divergence defined in Eq. (17). Consider a scale adjustment as $\min_{v>0}D_{\beta}(vp,q)$ . We observe that

\displaystyle\frac{\partial}{\partial v}D_{\beta}(vp,q)={\frac{v^{\beta}}{\beta}}\int p^{\beta+1}\mathrm{d}\Lambda-\frac{1}{\beta}{\int pq^{\beta}\mathrm{d}\Lambda},

(20)

in which the minimizer is given by $v^{*}=\big{(}\frac{\int pq^{\beta}\mathrm{d}\Lambda}{\int p^{\beta+1}\mathrm{d}\Lambda}\big{)}^{\frac{1}{\beta}},$ so that we have the minimum

\displaystyle D_{\beta}(v^{*}p,q)=-\frac{1}{\beta+1}\bigg{[}\frac{\int pq^{\beta}\mathrm{d}\Lambda}{\big{(}\int p^{\beta+1}\mathrm{d}\Lambda\big{)}^{\frac{1}{\beta+1}}}\bigg{]}^{\frac{\beta+1}{\beta}}+{\frac{1}{\beta+1}}\int q^{\beta+1}\mathrm{d}\Lambda.

(21)

We note that this divergence is scale-invariant for the first argument. By a power transformation for two terms on the right side of Eq. (21), we obtain

\displaystyle\bigg{[}\int q^{\beta+1}\mathrm{d}\Lambda\bigg{]}^{\frac{\beta}{\beta+1}}\geq\frac{\int pq^{\beta}\mathrm{d}\Lambda}{\big{(}\int p^{\beta+1}\mathrm{d}\Lambda\big{)}^{\frac{1}{\beta+1}}}.

(22)

Accordingly, we get $D^{*}_{\gamma}(p,q)$ by the difference between both sides of (22) if $\beta=\gamma$ .

4 Consensus model defined by robust divergences

The consensus model used in QBC is the model with parameter $\bar{\xi}$ defined in Eq. (6), which gives the minimum sum of the KL divergences from the consensus model $\bar{\xi}$ to the committee members $\xi_{c},c=1,\dots,C$ . The acquisition function (5) is also defined as the KL divergence.

In the problem setting of active learning, the number of samples given initially can be very small. Predictive models fitted using a very small number of samples, which are possibly further reduced by splitting, are likely to be very inaccurate. In the case of a linear model, the fitted coefficients $\bm{\theta}_{c}$ can take extremely large values, so that $|\xi_{c}(\bm{x})|=|\langle\bm{\theta}_{c},\bm{x}\rangle|$ can be very large. This means that the parameter $\xi_{c}(\bm{x})$ behaves as an outlier in the construction of the consensus model or in the calculation of the acquisition function. Therefore, instead of the KL divergence, it is reasonable to consider a consensus model and an acquisition function using robust divergences such as the $\beta$ -divergence (Fig. 4), which has a limited effect on the inclusion of outliers. Stable behavior can be expected even in situations where the committee members are not reliable. Unfortunately, the consensus model based on the Bregman divergence is not well-defined in general due to the impossibility of normalization. For such cases, we also consider a dual $\gamma$ -power divergence, which provides an explicit consensus model that is well defined irrespective of the distributions to be mixed.

4.1 Consensus model with Bregman divergence

It is nontrivial how to calculate the consensus model with the Bregman divergence defined by using a $\beta$ -type convex function. However, the following theorem provides a way to achieve the minimizer of the sum of Bregman divergences.

Theorem 1 (Characterization of $u$ -mixture [26])

: Let $w=(w_{1},\dots,w_{C})\in\Delta_{C}$ be an element of $C$ probability simplex. For probability density functions or probability mass functions $p_{c}(y),c=1,\dots,C$ , consider

\displaystyle A_{U}(q;w)=\sum_{c=1}^{C}w_{c}D_{U}(q,p_{c}).

(23)

Then,

\displaystyle\mathop{\rm arg~{}min}\limits_{q}A_{U}(q;w)=p_{u}(y;w),

(24)

where

\displaystyle p_{u}(y;w)=u\left(\sum_{c=1}^{C}w_{c}\breve{p}_{c}(y)-b\right).

(25)

Proof 1

Proof of this theorem is shown in Appendix A.1.

The model $p_{u}(y;w)$ is called the $u$ -mixture of $p_{c}(y),c=1,\dots,C$ associated with weight $w$ . The constant $b$ is a normalizing factor so that $p_{u}$ is a valid probability density or mass function.

Example 3 (KL-divergence and geometric mean)

When we consider the KL-divergence as an instance of the $u$ -divergence, the geometric mean of $p_{c}(y)$ with the weights $w_{c}$ defined by

\displaystyle\bar{p}_{G}(y;w)=e^{-b(w)}\prod_{c=1}^{C}p_{c}(y)^{w_{c}},

(26)

where $b(w)=\log\int\prod_{c=1}^{C}p_{c}(y)^{w_{c}}dy$ , minimizes the weighted average $A_{0}(p)=\sum_{c=1}^{C}w_{c}D_{0}(p,p_{c})$ because $A_{0}(p)\geq A_{0}(\bar{p}_{G})=0.$

Example 4 (Euclidean distance and arithmetic mean)

It is also straightforward to show that the arithmetic mean

\displaystyle\bar{p}_{A}(y;w)=\sum_{c=1}^{C}w_{c}p_{c}(y)

(27)

is the minimizer of the weighted sum corresponding to the Euclidean distance.

To be concrete, for the $\beta$ -divergence,

		$\displaystyle\sum_{c=1}^{C}w_{c}\breve{p}(y\|\xi_{c}(\bm{x}))-b(\bm{x})$
	$\displaystyle=$	$\displaystyle\frac{1}{\beta}\sum_{c=1}^{C}w_{c}\exp\left\{\beta\frac{y\xi_{c}(\bm{x})-\psi(\xi_{c}(\bm{x}))}{\phi}+\beta c(y,\phi)\right\}-\frac{1}{\beta}-b(\bm{x}),$		(28)

hence we have

	$\displaystyle p_{u}(y\|\bm{x};w)=$	$\displaystyle u\left(\sum_{c=1}^{C}w_{c}\breve{p}(y\|\xi_{c}(\bm{x}))-b(\bm{x})\right)$
	$\displaystyle=$	$\displaystyle\left[\sum_{c=1}^{C}w_{c}\exp\left\{\beta\frac{y\xi_{c}(\bm{x})-\psi(\xi_{c}(\bm{x}))}{\phi}+\beta c(y,\phi)\right\}-\beta b(\bm{x})\right]^{\frac{1}{\beta}}.$		(29)

We note that in this expression, the consensus model does not have an explicit consensus parameter $\bar{\xi}$ ; therefore, we will denote the consensus model as $p_{u}(y|\bm{x};w)$ instead of $p(y|\bar{\xi}(\bm{x});w)$ henceforth. This expression contains a normalization factor $b(\bm{x})$ , which depends on the input variable $\bm{x}$ ; therefore, the analytical calculation is prohibitive.

We characterize the robustness of the $u$ -mixture in terms of the influence function [29]. Consider the mixture of models $p_{c}(y)=p(y;\xi_{c})$ with respect to the $u$ -divergences with a weight $w_{c}$ . We will omit $\bm{x}$ for notational simplicity. Then, we define an $\epsilon$ -contamination weight operator $W_{\epsilon}$

\displaystyle W_{\epsilon}[p]=

\displaystyle\int\left\{(1-\epsilon)\sum_{c=1}^{C}w_{c}\delta(\xi-\xi_{c})+\epsilon\delta(\xi-\xi_{\mathrm{out}})\right\}p(y;\xi)\mathrm{d}\xi,

(30)

where $\delta$ denotes the Dirac measure degenerated at zero. For notational simplicity, we write $p(y;w(\epsilon))=W_{\epsilon}[p]$ .

The outlier $\xi_{\mathrm{out}}$ is a parameter value far from other committee members $\xi_{c}$ . In GLM, the parameter $\xi$ is composed of $\bm{\theta}$ and $\bm{x}$ ; hence there are two possible reasons for the outlying point. One is the outlying input point $\bm{x}$ , and the other is the outlying regression coefficient $\bm{\theta}$ . In this work, the pool $\mathcal{X}_{p}$ is fixed and $\|\bm{x}\|$ is bounded, but it is meaningful to consider the situation that the parameter $\xi=\langle\bm{\theta},\bm{x}\rangle$ takes a very large value. The regression coefficient $\bm{\theta}$ is, in principle, not bounded and it may make $\xi$ arbitrary value.

With this $\epsilon$ -contamination weight operator, we replace the weighted sum operation such as $(1-\epsilon)\sum_{c=1}^{C}w_{c}p(y;\xi_{c})+\epsilon p(y;\xi_{\mathrm{out}})$ by $W_{\epsilon}[p]=p(y;w(\epsilon))$ , which is reduced to $\sum_{c=1}^{C}w_{c}p(y;\xi_{c})$ when $\epsilon=0$ .

We focus on the contamination of the outlier model $p(y;\xi_{\mathrm{out}})$ in the mixture and consider its influence on the minimizer $p_{u}(y;w)$ of $A_{U}(q;w)$ .

Definition 1 (Influence function of $u$ -mixture)

The influence function of the minimizer $p_{u}(y;w)$ of the weighted sum of the $u$ -divergences $A_{U}(q;w)=\sum_{c=1}^{C}w_{c}D_{U}(q,p_{c})$ for the outlier $\xi_{\mathrm{out}}$ is defined as

\displaystyle\mathrm{IF}(p_{u}(y;w),\xi_{\mathrm{out}})=\lim_{\epsilon\to 0}\frac{p_{u}(y;w(\epsilon))-p_{u}(y;w)}{\epsilon}.

(31)

With this influence function, we can characterize the robustness of the $u$ -mixture $p_{u}(y;w)$ of distributions in the exponential family.

Proposition 2

The influence function of the $u$ -mixture of exponential family distributions is unbounded when the $u$ -divergence is the Kullback–Leibler divergence. The influence function of the $u$ -mixture of exponential family distributions is bounded when the $u$ -divergence is the $\beta$ -divergence with $\beta>0$ .

Proof 2

Proof of this proposition is shown in Appendix A.2.

4.2 Consensus model with dual $\gamma$ -power divergence

If the domain of $u^{\ast}$ is restricted to a subset of $\mathbb{R}$ , then there does not exist such a normalizing constant $b$ in Eq. (25). For example, when $D_{U}$ is the $\beta$ -power divergence, the minimizer is written as

\displaystyle p_{u}(y;w)=\Big{(}\sum_{c=1}^{C}w_{c}p_{c}(y)^{\beta}-b\Big{)}^{\frac{1}{\beta}}.

(32)

If $p_{c}(y)>0$ for all $y$ of $\mathbb{R}$ , $\lim_{\>|y|\rightarrow\infty}p_{c}(y)=0$ . Hence, it must be $\beta>0$ and $b\leq 0$ since $\lim_{\>|y|\rightarrow\infty}p_{u}(y;w)=(-b)^{\frac{1}{\beta}}$ . If $b<0$ , $p_{u}(y;w)$ is not integrable because $p_{u}(y;w)\geq(-b)^{\frac{1}{\beta}}$ . Note that this problem will occur unless the support of $p_{c}(y)$ is finite-discrete.

To solve the problem above, we consider the dual $\gamma$ -power divergence $D^{\ast}_{\gamma}(p,q)$ defined in (19). We introduce a simple result for the minimum dual $\gamma$ -mixture.

Proposition 3

Let $A_{\gamma}(q)=\sum_{c}w_{c}D^{*}_{\gamma}(q,p_{c})$ , where $D_{\gamma}^{*}(q,p_{c})$ is defined in (19). Then the minimizer of $A_{\gamma}(q)$ in $q$ of $\cal P$ is given by

\displaystyle p_{\gamma}(y;w)=\frac{1}{z(w)}\Big{(}\sum_{c=1}^{C}w_{c}p_{c}(y)^{\gamma}\Big{)}^{\frac{1}{\gamma}},

(33)

where

\displaystyle{z(w)}=\int\Big{(}\sum_{c=1}^{C}w_{c}p_{c}(y)^{\gamma}\Big{)}^{\frac{1}{\gamma}}d\Lambda(y).

(34)

Proof 3

Proof of this proposition is shown in Appendix A.3.

The model $p_{\gamma}(y;w)$ is called the dual $\gamma$ -mixture of $p_{c}(y),c=1,\dots,C$ associated with weight $w$ .

We now focus on the behaviors of the minimizers discussed above. For this, it is assumed that $p_{c}(y)=p(y;\xi_{c})$ , where $p(y;\xi)=\exp\Big{\{}\frac{y\xi-\psi(\xi)}{\phi}+c(y,\phi)\Big{\}}.$

Definition 2 (Influence function of dual $\gamma$ -mixture)

The influence function of the minimizer $p_{\gamma}(y;w)$ of the weighted sum of the dual $\gamma$ -divergences $A_{\gamma}(q;w)=\sum_{c=1}^{C}w_{c}D_{\gamma}^{\ast}(q,p_{c})$ for the outlier $\xi_{\mathrm{out}}$ is defined as

\displaystyle\mathrm{IF}(p_{\gamma}(y;w),\xi_{\mathrm{out}})=\lim_{\epsilon\to 0}\frac{p_{\gamma}(y;w(\epsilon))-p_{\gamma}(y;w)}{\epsilon}.

(35)

Proposition 4

The influence function of the dual $\gamma$ -mixture of exponential family distributions is bounded when $\gamma>0$ .

Proof 4

Proof of this proposition is shown in Appendix A.4.

5 Acquisition function with robust divergences

In [25], the acquisition function is defined as

\displaystyle a_{0}(\bm{x};w)=\sum_{c=1}^{C}w_{c}D_{0}(p(Y;\xi_{c}(\bm{x})),p(Y;\bar{\xi}(\bm{x}))),

(36)

where the weight $w$ is explicitly denoted to consider the effect of an outlying committee member. Here, we also consider acquisition functions based on the $\beta$ -divergence and the dual $\gamma$ -power divergence given by

\displaystyle a_{\beta}(\bm{x};w)=\sum_{c=1}^{C}w_{c}D_{\beta}(p(Y;\xi_{c}(\bm{x})),p_{\beta}(Y|\bm{x}))

(37)

and

\displaystyle a_{\gamma}(\bm{x};w)=\sum_{c=1}^{C}w_{c}D_{\gamma}^{\ast}(p(Y;\xi_{c}(\bm{x})),p_{\gamma}(Y|\bm{x})),

(38)

where $p_{\beta}$ and $p_{\gamma}$ are the consensus models obtained with respect to the $\beta$ -divergence and the dual $\gamma$ -power divergence, respectively.

Denoting the $\epsilon$ -contaminated activation function as $a(\bm{x};w(\epsilon))=$ We define the influence function for the acquisition function $a(\bm{x};w)$ as follows.

Definition 3 (Influence function of acquisition function)

The influence function of the acquisition function for the outlier $\xi_{\mathrm{out}}$ is defined as

\displaystyle\mathrm{IF}(a(\bm{x};w),\xi_{\mathrm{out}})=\lim_{\epsilon\to 0}\frac{a(\bm{x};w(\epsilon))-a(\bm{x};w)}{\epsilon},

(39)

where $a(\bm{x},w(\epsilon))=W_{\epsilon}[D(p(y;\xi),p(y;W_{\epsilon}[\xi]))].$

Then we have the following proposition.

Proposition 5

The influence function for $a_{0}$ is not bounded, whereas those for $a_{\beta}$ and $a_{\gamma}$ are bounded with respect to an outlier $\xi_{\mathrm{out}}$ .

Proof 5

Proof of this proposition is shown in Appendix A.5.

This proposition claims that, as is the case for the consensus model, the mixtures based on the KL divergence is vulnerable to outlier in the committee while those based on the $\beta$ and the dual $\gamma$ -power divergence are robust.

6 Experiments

Logistic regression is used as the predictive model. The model is fitted using the initial training dataset $S_{0}$ . Then, 10 logistic regression models are trained on 10 partitions of the labeled data at hand and used as committee members, where $w_{c}=1/10,\;c=1,\dots,10$ . The consensus models are constructed on the basis of the KL divergence and the $\beta$ divergence with $\beta=1.0$ and the dual $\gamma$ -power divergence with $\gamma=1.0$ , and by using the acquisition function using them, we select one datum from the pool dataset $\mathcal{X}_{p}$ . The correct label is assigned to the selected sample and added to the training data, and the predictive model is retrained using the extended training dataset. As a baseline, we also compare the results with those obtained by replacing the selection by acquisition function with random sampling.

The following one artificial dataset and six real-world datasets from the LIBSVM datasets¹¹1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ are considered.

1.

artificial: It is an artificially generated three-dimensional dataset for two-class classification. The samples are drawn from $\mathcal{N}(\bm{\mu}_{i},\Sigma),i=0,1$ , where $\bm{\mu}_{0}=(1,1,1)$ and $\bm{\mu}_{1}=-\bm{\mu}_{0}$ , and $\Sigma=I_{3}$ (unit matrix). The pooled dataset is composed of $1,000$ data points for each class. The prediction error is evaluated by using $50,000$ data points for each class generated in the same manner as the training dataset. By three sampling methods, we sequentially selected $100$ samples.
2.

adult: The original number of attributes is 123 and is reduced to three by PCA. 100 samples are sequentially selected by three sampling methods. The original sample has 48,842 data points with 37,155 positives and 11,687 negatives.
3.

breast-cancer: The original number of attributes is 9 and is reduced to three by PCA. 100 samples are sequentially selected by three sampling methods. The original sample has 4,000 data points with 2,839 positives and 1,161 negatives.
4.

diabetis: The original number of attributes is 8 and is reduced to three by PCA. 100 samples are sequentially selected by three sampling methods. The original sample has 9,360 data points with 6,078 positives and 3,282 negatives.
5.

mushrooms: The original number of attributes is 112, and is reduced to three by PCA. 100 samples are sequentially selected by three sampling methods. The original sample has 8,124 data points with 4,208 positives and 3,916 negatives.
6.

ijcnn: The original number of attributes is 22, and is reduced to three by PCA. 100 samples are sequentially selected by three sampling methods. The original sample has 35,000 data points with 31,585 positives and 3,415 negatives.
7.

titanic: The original number of attributes is three. 100 samples are sequentially selected by three sampling methods. The original sample has 3,000 data points with 2,009 positives and 991 negatives.

The initial number of data points is 50 for each class. For the three real-world datasets, the pool and test datasets are generated as follows. Keeping the initial training dataset with 100 data points, let the number of data points in the minority class be $m$ . As the pool dataset, $m$ data points are randomly drawn from each of the positive and negative datasets, and the remaining data points are used for evaluating the prediction error. Averages and standard deviations of 10 random samplings of initial data, random sampling, and random splitting for learning committee members are shown in Figure 5.

From left panels of Figs. 5 and 6, it is seen that the consensus model and the acquisition function derived from the $\beta$ divergence and the dual $\gamma$ -power divergence provide smaller prediction errors in many cases. The conventional method with the KL divergence is not stable in the early stage of active learning, whereas the proposed method enables the selection of better samples to measure, which should be attributed to its robustness to a poor committee member in the early stage of active learning.

We also show the values of the influence function at queried points in the right panels of Figs 5 and 6. As an outlying point, we find an input point $\bm{x}\in\mathcal{X}_{p}$ from the pool which maximizes the absolute value of $\xi=\langle\bm{\theta},\bm{x}\rangle$ where $\bm{\theta}$ is the regression coefficient of the current predictive model. From these figures, it is clearly seen that the value of the influence function for the method based on the KL-divergence is higher than the other two methods, which is consistent with the theoretical result. It is also seen that for the adult data, the method based on the $\beta$ divergence does not perform well in terms of error rate, and the influence function values are as high as those of the KL divergence-based method.

7 Conclusion and Discussion

In this paper, we revisited a classical active learning method QBC. By replacing the KL-divergence with the $\beta$ -divergence or the dual $\gamma$ -power divergence, we were able to obtain a favorable performance both theoretically and experimentally.

The $u$ -mixture for distributions in an exponential family does not have a closed form solution in general; hence, we used an implicit characterization of the $u$ -mixture proved in [26]. We also considered the dual $\gamma$ -power mixture based on a scale-invariant divergence. It is proven that it has a similar form to the $u$ -mixture, but it has an advantage that the normalization factor is always computable; hence, the dual $\gamma$ -power mixture is always defined unlike the $u$ -mixture. When we fix a divergence measure for probability distributions and impose a constant mean condition, we obtain a class of the maximum entropy distributions [12]. For the KL-divergence, the exponential family is derived as a class of maximum entropy distributions. On the other hand, when we adopt the $u$ -divergence, the associated maximum $u$ -entropy distribution is of a different form compared with the case of the exponential family, and the consensus model derived using the $u$ -mixture of such a class of distributions is easy to obtain by the arithmetic average of the model parameters, as in the case of the consensus model parameter $\bar{\bm{\theta}}$ for the conventional QBC. In other words, the statistical model and the estimation procedure have a dualistic structure in combination, and the use of the $u$ -mixture of the exponential family distributions is in this sense an unnatural procedure. However, this inconsistency of the model and estimation is a source of robustness [14]. In our future work, we will explore a more detailed geometric characterization of active learning based on, for example, Pythagoras foliation associated with the Bregman divergence.

The problem of model selection, namely, the appropriate choice of $\beta$ or $\gamma$ parameter, is an interesting open problem. In the literature on robust regression, the optimization of the parameter $\gamma$ for the $\gamma$ -power divergence is considered in [28] via the notions of asymptotic efficiency and breakdown point using the theory of S-estimation. Methods to select an appropriate parameter for robust divergence used in active learning would be explored. Another important issue is the adaptive selection of divergence measures. After a sufficient sample has been collected, the KL divergence-based method will work well. Even if we use robust divergence-based methods, it may fail, for example in the case of $\beta$ divergence-based method on adult data. Furthermore, since robust divergence has tuning parameters, its adjustment also affects the performance of active learning. One possible strategy is to collect samples using highly robust methods and parameters in the early stages of active learning. Then, change the divergence measure with the KL divergence-based ones at a later stage. In this case, criteria regarding the number of samples to be collected and divergence selection are necessary and should be considered in conjunction with the parameter selection issue.

Acknowledgments

Part of this work is supported by JSPS KAKENHI No. JP18H03211, JP22H03653 and NEDO JPNP18002 and JST CREST No.JPMJCR2015.

Data Availability

Source code is available upon reasonable request.

Appendix A Proofs

A.1 Proof of Theorem 1

For the sake of notational simplicity, we introduce a notation for the norm of a function $f$ as $\langle f\rangle=\int f(y)\mathrm{d}\Lambda(y)$ . Then, we have

$\displaystyle A_{U}(q;w)=$	$\displaystyle\sum_{c}w_{c}\{\langle U(\breve{p}_{c})\rangle-\langle U(\breve{q})\rangle-\langle q,\breve{p}_{c}-\breve{q}\rangle\}$
$\displaystyle=$	$\displaystyle\sum_{c}w_{c}\langle U(\breve{p}_{c})\rangle-\langle U(\breve{q})\rangle-\langle q,\sum_{c}w_{c}\breve{p}_{c}-\breve{q}\rangle$
$\displaystyle=$	$\displaystyle\langle U(\breve{p}_{u})\rangle-\langle U(\breve{q})\rangle-\langle q,\breve{p}_{u}-\breve{q}\rangle$
	$\displaystyle-\langle U(\breve{p}_{u})\rangle+\sum_{c}w_{c}\langle U(\breve{p}_{c})\rangle-b\langle q\rangle$
$\displaystyle=$	$\displaystyle D_{U}(q,p_{u})-b-\langle U(\breve{p}_{u})\rangle+\sum_{c}w_{c}\langle U(\breve{p}_{c})\rangle.$	(40)

In the last equation, only the first term depends on $q$ ; hence, $q=p_{u}$ minimizes the weighted sum of the Bregman divergences.

A.2 Proof of proposition 2

Consider the KL-divergence with $u(z)=\exp u$ and $u^{\ast}(\zeta)=\log\zeta$ . Then, the consensus model is given by the model $p(y;\bar{\xi},w)$ with the parameter $\bar{\xi}=\sum_{c}w_{c}\xi_{c}$ . Denoting the action of the $\epsilon$ -contamination $w_{\epsilon}$ to a function $f(\xi)$ as

\displaystyle W_{\epsilon}[f]=\int((1-\epsilon)w_{c}\delta(\xi-\xi_{c})+\epsilon\delta(\xi-\xi_{\mathrm{out}}))f(\xi)\mathrm{d}\xi=(1-\epsilon)\sum_{c=1}^{C}w_{c}f(\xi_{c})+\epsilon f(\xi_{\mathrm{out}}),

(41)

we have

\displaystyle p(y;\bar{\xi},w(\epsilon))=

\displaystyle\exp\left(\phi^{-1}(yW_{\epsilon}[\xi]-\psi(W_{\epsilon}[\xi]))+c(y,\phi)\right).

(42)

Taking the derivative with respect to $\epsilon$ , we have the influence function as

\displaystyle\mathrm{IF}(p(y;\bar{\xi},w),\xi_{\mathrm{out}})=

\displaystyle p(y;\bar{\xi},w)\left\{\frac{y\xi_{\mathrm{out}}-\psi^{\prime}(\xi_{\mathrm{out}})}{\phi}-\frac{y\bar{\xi}-\sum_{c}w_{c}\psi^{\prime}(\xi_{c})}{\phi}\right\}.

(43)

Hence, the influence function is unbounded for $\xi_{\mathrm{out}}$ . On the other hand, consider the $\beta$ -divergence with $u(z)=(\beta z+1)^{1/\beta}$ and $u^{\ast}(\zeta)=\frac{\zeta^{\beta}-1}{\beta}$ . Then, the minimizer $p_{u}(y;w(\epsilon))$ of $A_{U}(q;w(\epsilon))$ is

\displaystyle p_{u}(y;w(\epsilon))=

\displaystyle\left[W_{\epsilon}[\exp\left\{\phi^{-1}\beta(y\xi_{c}-\psi(\xi_{c}))+\beta c(y,\phi)\right\}]-\beta b(w(\epsilon))\right]^{1/\beta}.

(44)

Taking the derivative with respect to $\epsilon$ , we have the influence function as

	$\displaystyle\mathrm{IF}(p_{u}(y;w);\xi_{\mathrm{out}})=$	$\displaystyle\frac{1}{\beta p_{u}^{1-\frac{1}{\beta}}}\left[\int\{\delta(\xi-\xi_{\mathrm{out}})-\sum_{c}w_{c}\delta(\xi-\xi_{c})\}\exp\{\phi^{-1}\beta(y\xi-\psi(\xi))+\beta c(y,\phi)\}\mathrm{d}\xi-\beta b\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{\beta p_{u}^{1-\frac{1}{\beta}}}\{p_{\mathrm{out}}^{\beta}-\overline{p^{\beta}}\},$		(45)

where we used simplified notations $p_{c}=p(y;\xi_{c})$ , $p_{\mathrm{out}}=p(y|\xi_{\mathrm{out}})$ and

\displaystyle\overline{p^{\beta}}=\sum_{c}w_{c}p_{c}^{\beta}.

(46)

We conclude that if $\beta>0$ , then there exists a limit of $\mathrm{IF}(p_{u}(y;w),\xi_{\mathrm{out}})$ when $|\xi_{\mathrm{out}}|\to\infty$ .

A.3 Proof of proposition 3

A direct observation gives that

	$\displaystyle A_{\gamma}(q)-A_{\gamma}(p_{\gamma}(\cdot;w))=$	$\displaystyle-\frac{\int q\sum_{c}w_{c}p_{c}^{\gamma}{\rm d}\Lambda}{(\int q^{\gamma+1}{\rm d}\Lambda)^{\frac{1}{\gamma+1}}}+\frac{\int p_{\gamma}(\cdot;w)\sum_{c}w_{c}p_{c}^{\gamma}{\rm d}\Lambda}{(\int p_{\gamma}(\cdot;w)^{\gamma+1}{\rm d}\Lambda)^{\frac{1}{\gamma+1}}}$
	$\displaystyle=$	$\displaystyle-\frac{z(w)^{\gamma}\int q\>p_{\gamma}(\cdot;w)^{\gamma}{\rm d}\Lambda}{(\int q^{\gamma+1}{\rm d}\Lambda)^{\frac{1}{\gamma+1}}}+z(w)^{\gamma}\bigg{(}\int p_{\gamma}(\cdot;w)^{\gamma+1}{\rm d}\Lambda\bigg{)}^{\frac{\gamma}{\gamma+1}}$

which is equal to $z(w)^{\gamma}D_{\gamma}(q,p_{\gamma}(\cdot;w))$ . This leads to the conclusion that $A_{\gamma}(q)\geq A_{\gamma}(p_{\gamma}(\cdot;w))$ , and the equality holds if and only if $q=p_{\gamma}(\cdot;w)$ .

A.4 Proof of proposition 4

The consensus model with the dual $\gamma$ -power divergence is given by

	$\displaystyle p_{\gamma}(y;w)=$	$\displaystyle\frac{1}{z(w)}\left(\sum_{c}w_{c}p_{c}(y)^{\gamma}\right)^{\frac{1}{\gamma}}=\frac{1}{z(w)}\left(\overline{p(y)^{\gamma}}\right)^{\frac{1}{\gamma}},$		(47)
	$\displaystyle z(w)=$	$\displaystyle\int\left(\overline{p(y)^{\gamma}}\right)^{\frac{1}{\gamma}}\mathrm{d}y.$		(48)

The $\epsilon$ -contamination model is

\displaystyle p_{\gamma}(y;w(\epsilon))=

\displaystyle\frac{\left(W_{\epsilon}[p^{\gamma}(y;\xi)]\right)^{\frac{1}{\gamma}}}{\int\left(W_{\epsilon}[p^{\gamma}(y;\xi)]\right)^{\frac{1}{\gamma}}}\mathrm{d}\Lambda(y).

(49)

The derivative of $\left(W_{\epsilon}[p^{\gamma}(y;\xi)]\right)^{\frac{1}{\gamma}}$ is

\displaystyle\frac{\partial\left(W_{\epsilon}[p^{\gamma}(y;\xi)]\right)^{\frac{1}{\gamma}}}{\partial\epsilon}=

\displaystyle p(y;\xi_{\mathrm{out}})^{\gamma}-\sum_{c}w_{c}p_{c}(y)^{\gamma},

(50)

and the derivative of the normalizing factor is

\displaystyle\left.\frac{\partial z(w(\epsilon))}{\partial\epsilon}\right|_{\epsilon=0}=

\displaystyle\int\left\{p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}}\right\}\mathrm{d}\Lambda.

(51)

Then,

$\displaystyle{\rm IF}(p_{\gamma}(y),\xi_{\mathrm{out}})=$	$\displaystyle\frac{1}{\gamma z(w)}(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})\overline{p^{\gamma}}^{\frac{1}{\gamma}-1}$
$\displaystyle-$	$\displaystyle\frac{1}{z^{2}(w)}\int(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})\mathrm{d}\Lambda\overline{p^{\gamma}}^{\frac{1}{\gamma}}$
$\displaystyle=$	$\displaystyle\frac{1}{\gamma}\frac{p_{\gamma}(y;w)}{\overline{p(y)^{\gamma}}}(p_{\mathrm{out}}(y)^{\gamma}-\overline{p(y)^{\gamma}})-\frac{p_{\gamma}(y;w)}{z(w)}\int(p_{\mathrm{out}}(y)^{\gamma}-\overline{p(y)^{\gamma}})\mathrm{d}\Lambda(y).$	(52)

We conclude that if $\gamma>0$ , then there exists a limit of ${\rm IF}(q_{\gamma},\xi_{\rm out})$ when $\xi_{\rm out}$ goes to $\infty$ or $-\infty$ . This is because $p(y;\xi_{\rm out})^{\gamma}$ converges to one of singular density functions $p_{\pm\infty}(y)^{\gamma}$ as $\xi_{\rm out}$ goes to $\infty$ or $-\infty$ . On the other hand, if $\gamma\leq 0$ , the influence function becomes unbounded in the outlier.

A.5 Proof of proposition 5

In the proof, we assume $\phi=1$ and $c(y,\phi)=0$ for the exponential family for the sake of simplicity. We also omit $\bm{x}$ from the acquisition function. We first consider the case with the KL-divergence. The $\epsilon$ -contamination for the acquisition function is denoted as

\displaystyle a_{0}(\bm{x},w(\epsilon))=

\displaystyle W_{\epsilon}[D_{0}(p(y;\xi),p(y;W_{\epsilon}[\xi]))].

(53)

Taking the derivative of $a_{0}(\bm{x},w(\epsilon))$ with respect to $\epsilon$ and setting $\epsilon\to 0$ , we have

$\displaystyle{\mathrm{IF}}(a_{0}(\bm{x};w),\xi_{\mathrm{out}})=$	$\displaystyle\int p_{\mathrm{out}}(y\xi_{\mathrm{out}}-\psi(\xi_{\mathrm{out}}))\mathrm{d}\Lambda-\int p_{\mathrm{out}}(y\bar{\xi}-\psi(\bar{\xi}))\mathrm{d}\Lambda$
	$\displaystyle-\sum_{c}w_{c}\int p_{c}(y\xi_{c}-\psi(\xi_{c}))\mathrm{d}\Lambda+\sum_{c}w_{c}\int p_{c}(y\bar{\xi}-\psi(\bar{\xi}))\mathrm{d}\Lambda$
	$\displaystyle-\sum_{c}w_{c}\int p_{c}(y\xi_{\mathrm{out}}-\psi^{\prime}(\xi_{\mathrm{out}})-y\bar{\xi}+\psi^{\prime}(\bar{\xi}))\mathrm{d}\Lambda$
$\displaystyle=$	$\displaystyle\left\{\psi^{\prime}(\xi_{\mathrm{out}})-\bar{\xi}-\sum_{c}w_{c}\psi^{\prime}(\xi_{c})\right\}\xi_{\mathrm{out}}$
	$\displaystyle-\psi(\xi_{\mathrm{out}})+\psi^{\prime}(\xi_{\mathrm{out}})+2\sum_{c}w_{c}\psi^{\prime}(\xi_{c})\bar{\xi}$
	$\displaystyle-\sum_{c}w_{c}\psi^{\prime}(\xi_{c})\xi_{c}+\sum_{c}w_{c}\psi(\xi_{c})-\psi^{\prime}(\bar{\xi}).$	(54)

Hence, the influence function is unbounded for $\xi_{\mathrm{out}}$ .

Consider the influence function for

	$\displaystyle a_{\beta}(\bm{x};w)=$	$\displaystyle\sum_{c}w_{c}D_{\beta}(p_{c},p_{u})$
	$\displaystyle=$	$\displaystyle\frac{1}{\beta+1}\int p_{u}^{\beta+1}\mathrm{d}\Lambda+\sum_{c}w_{c}\left[\frac{1}{\beta(\beta+1)}\int p^{\beta+1}_{c}\mathrm{d}\Lambda-\frac{1}{\beta}\int p_{c}p_{u}^{\beta}\mathrm{d}\Lambda\right],$		(55)

where

\displaystyle p_{u}(y;w)=\left[\sum_{c}w_{c}\exp(\beta(y\xi_{c}-\psi(\xi_{c}))-\beta b\right]^{\frac{1}{\beta}}.

(56)

Taking the derivative of $a_{\beta}(\bm{x},w(\epsilon))$ with respect to $\epsilon$ and setting $\epsilon\to 0$ , we have

$\displaystyle{\mathrm{IF}}(a_{\beta}(\bm{x},w),\xi_{\mathrm{out}})=$	$\displaystyle\int\frac{1}{\beta}p_{u}^{\beta+\frac{1}{\beta}-1}\left\{p_{\mathrm{out}}^{\beta}-\overline{p^{\beta}}\right\}\mathrm{d}\Lambda-\frac{1}{\beta}\int\overline{p^{1}}p_{u}^{\beta+\frac{1}{\beta}-2}(p_{\mathrm{out}}^{\beta}-\overline{p^{\beta}})\mathrm{d}\Lambda$
	$\displaystyle+\frac{1}{\beta(\beta+1)}\int p_{\mathrm{out}}^{\beta+1}\mathrm{d}\Lambda-\frac{1}{\beta}\int p_{\mathrm{out}}p_{u}^{\beta}\mathrm{d}\Lambda$
	$\displaystyle-\frac{1}{\beta(\beta+1)}\int\overline{p^{\beta+1}}\mathrm{d}\Lambda+\frac{1}{\beta}\int\overline{p^{1}}p_{u}^{\beta}\mathrm{d}\Lambda.$	(57)

We conclude that if $\beta>0$ , then there exists a limit of $\mathrm{IF}(a_{u}(\bm{x};w),\xi_{\mathrm{out}})$ when $|\xi_{\mathrm{out}}|\to\infty$ .

For the dual $\gamma$ -power divergence, the acquisition function is of the form

	$\displaystyle a_{\gamma}(\bm{x};w)=$	$\displaystyle\sum_{c}w_{c}D^{\ast}_{\gamma}(p_{c},p_{\gamma})$
	$\displaystyle=$	$\displaystyle\sum_{c}w_{c}\left\{-\frac{\int p_{c}p_{\gamma}^{\gamma}\mathrm{d}\Lambda}{\left(p_{c}^{\gamma+1}\mathrm{d}\Lambda\right)^{\frac{1}{\gamma+1}}}+\left(\int p_{\gamma}^{\gamma+1}\mathrm{d}\Lambda\right)^{\frac{\gamma}{\gamma+1}}\right\}.$		(58)

The derivative of the $\epsilon$ -contaminated model is

\displaystyle\left.\frac{\partial p_{\gamma}}{\partial\epsilon}\right|_{\epsilon=0}=\frac{1}{\gamma}\frac{p_{\gamma}}{\overline{p^{\gamma}}}(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})-\frac{1}{z(w)}p_{\gamma}\int(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})\mathrm{d}\Lambda.

(59)

Then, the influence function is

$\displaystyle{\mathrm{IF}}(a_{\gamma}(\bm{x};w),\xi_{\mathrm{out}})=$	$\displaystyle-\frac{\int p_{\mathrm{out}}p^{\gamma}_{\gamma}\mathrm{d}\Lambda}{\left(\int p_{\mathrm{out}}^{\gamma+1}\mathrm{d}\Lambda\right)^{\frac{1}{\gamma+1}}}+2\sum_{c}w_{c}\frac{\int p_{c}p^{\gamma}_{\gamma}\mathrm{d}\Lambda}{\left(\int p_{c}^{\gamma+1}\mathrm{d}\Lambda\right)^{\frac{1}{\gamma+1}}}$
	$\displaystyle-\sum_{c}w_{c}\left(\int p_{c}^{\gamma+1}\mathrm{d}\Lambda\right)^{-\frac{1}{\gamma+1}}\int\frac{p_{c}p_{\gamma}^{\gamma}}{\overline{p^{\gamma}}}p_{\mathrm{out}}^{\gamma}\mathrm{d}\Lambda$
	$\displaystyle+\frac{1}{z(w)}p_{\gamma}\int(p^{\gamma}_{\mathrm{out}}-\overline{p^{\gamma}})\mathrm{d}\Lambda\sum_{c}w_{c}\left(\int p_{c}^{\gamma+1}\mathrm{d}\Lambda\right)^{-\frac{1}{\gamma+1}}$
	$\displaystyle+\frac{p_{\gamma}^{\gamma+1}}{\overline{p^{\gamma}}}(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})\mathrm{d}\Lambda-\frac{\gamma}{z(w)}p_{\gamma}\int(p_{\mathrm{out}}^{\gamma}-\overline{p^{\gamma}})\mathrm{d}\Lambda.$	(60)

We conclude that if $\gamma>0$ , then there exists a limit of ${\rm IF}(a_{\gamma}(\bm{x};w),\xi_{\rm out})$ when $\xi_{\rm out}$ goes to $\infty$ or $-\infty$ .

References

[1] Shun-ichi Amari. Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics. Springer New York, 1985.
[2] Shun-ichi Amari. Information Geometry and Its Applications. Springer Publishing Company, Incorporated, 1st edition, 2016.
[3] Dana Angluin. Queries and Concept Learning. Machine Learning, 2(4):319–342, 1988.
[4] Pranjal Awasthi, Maria Florina Balcan, and Philip M. Long. The power of localization for efficiently learning linear separators with noise. J. ACM, 63(6), January 2017.
[5] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 – 89, 2009. Learning Theory 2006.
[6] Ayanendranath Basu, Ian R. Harris, Nils L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 09 1998.
[7] George E. P. Box, J. Stuart Hunter, and William G. Hunter. Statistics for Experimenters: Design, Innovation, and Discovery. Wiley Series in Probability and Statistics. Wiley, 2005.
[8] L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967.
[9] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[10] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems, 2005.
[11] Shinto Eguchi. Chapter 2 - pythagoras theorem in information geometry and applications to generalized linear models. In Angelo Plastino, Arni S.R. Srinivasa Rao, and C.R. Rao, editors, Information Geometry, volume 45 of Handbook of Statistics, pages 15–42. Elsevier, 2021.
[12] Shinto Eguchi and Osamu Komori. Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint. Springer Publishing Company, Incorporated, 1st edition, 2022.
[13] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective Sampling Using the Query by Committee Algorithm. Machine Learning, 28(2-3):133–168, 1997.
[14] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal., 99(9):2053â2081, oct 2008.
[15] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Query by Committee made real. In Advances in Neural Information Processing Systems, NIPS 2005, pages 443–450, 2005.
[16] Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Statistical Science, 1(3):297 – 310, 1986.
[17] Manuel Haußmann, Fred Hamprecht, and Melih Kandemir. Deep active learning with adaptive acquisition. In International Joint Conference on Artificial Intelligence, IJCAI 2019, pages 2470–2476, 2019.
[18] Hideitsu Hino. Active learning: Problem settings and recent developments. CoRR, abs/2012.04225, 2020.
[19] Yoshihiro Hirose and Fumiyasu Komaki. An extension of least angle regression based on the information geometry of dually flat spaces. Journal of Computational and Graphical Statistics, 19(4):1007–1023, December 2010.
[20] Hideaki Ishibashi and Hideitsu Hino. Stopping criterion for active learning based on deterministic generalization bounds. In International Conference on Artificial Intelligence and Statistics, AISTATS 2020, pages 386–397, 2020.
[21] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm. In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), volume 2, pages 1339–1344 vol.2, 1993.
[22] Takafumi Kanamori and Hironori Fujisawa. Affine invariant divergences associated with proper composite scoring rules and their applications. Bernoulli, 20(4):2278 – 2304, 2014.
[23] Takafumi Kanamori and Hironori Fujisawa. Robust estimation under heavy contamination using unnormalized models. Biometrika, 102(3):559–572, 05 2015.
[24] Ksenia Konyushkova, Sznitman Raphael, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, NIPS 2017, volume 2017-Decem, pages 4226–4236, 2017.
[25] Andrew McCallum and Kamal Nigam. Employing em and pool-based active learning for text classification. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, page 350â358, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
[26] Noboru Murata and Yu Fujimoto. Bregman divergence and density integration. Journal of Math for Industry, 1:97–104, 2009.
[27] Hieu T. Nguyen and Arnold Smeulders. Active learning using pre-clustering. In International Conference on Machine Learning, ICML 2004, pages 623–630, 2004.
[28] Marco Riani, Anthony C. Atkinson, Aldo Corbellini, and Domenico Perrotta. Robust regression with density power divergence: Theory, comparisons, and data analysis. Entropy, 22(4), 2020.
[29] P.J. Rousseeuw, F.R. Hampel, E.M. Ronchetti, and W.A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley Series in Probability and Statistics. Wiley, 2011.
[30] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, ICLR 2018, 2018.
[31] Burr Settles. Active Learning Literature Survey. Machine Learning, 15(2):201–221, 2010.
[32] H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Annual ACM Workshop on Computational Learning Theory, COLT 1992, pages 287–294, 1992.
[33] Yusuke Taguchi, Hideitsu Hino, and Keisuke Kameyama. Pre-training acquisition functions by deep reinforcement learning for fixed budget active learning. Neural Process. Lett., 53(3):1945–1962, 2021.
[34] Ken Takano, Hideitsu Hino, Shotaro Akaho, and Noboru Murata. Nonparametric e-mixture estimation. Neural Comput., 28(12):2687–2725, 2016.
[35] Kei Terayama, Ryo Tamura, Yoshitaro Nose, Hidenori Hiramatsu, Hideo Hosono, Yasushi Okuno, and Koji Tsuda. Efficient construction method for phase diagrams using uncertainty sampling. Physical Review Materials, 3(3):33802, 2019.
[36] Tetsuro Ueno, Hideitsu Hino, Ai Hashimoto, Yasuo Takeichi, Yasuhiro Sawada, and Kanta Ono. Adaptive design of an X-ray magnetic circular dichroism spectroscopy experiment with Gaussian process modeling. npj Computational Materials, 4(1), 2018.

Active Learning by Query by Committee with Robust Divergences

Abstract

1 Introduction

2 Active Learning

2.1 Sequential Observation for Generalized Linear Model

Example 1

Example 2

2.2 Acquisition Function

2.3 Query by Committee

3 Divergence Functions

3.1 Bregman divergence

3.2 Dual γ\gamma-power Divergence

4 Consensus model defined by robust divergences

4.1 Consensus model with Bregman divergence

Theorem 1 (Characterization of uu-mixture [26])

Proof 1

Example 3 (KL-divergence and geometric mean)

Example 4 (Euclidean distance and arithmetic mean)

Definition 1 (Influence function of uu-mixture)

Proposition 2

Proof 2

4.2 Consensus model with dual γ\gamma-power divergence

Proposition 3

Proof 3

Definition 2 (Influence function of dual γ\gamma-mixture)

Proposition 4

Proof 4

5 Acquisition function with robust divergences

Definition 3 (Influence function of acquisition function)

Proposition 5

Proof 5

6 Experiments

7 Conclusion and Discussion

Acknowledgments

Data Availability

Appendix A Proofs

A.1 Proof of Theorem 1

A.2 Proof of proposition 2

A.3 Proof of proposition 3

A.4 Proof of proposition 4

A.5 Proof of proposition 5

References

3.2 Dual $\gamma$ -power Divergence

Theorem 1 (Characterization of $u$ -mixture [26])

Definition 1 (Influence function of $u$ -mixture)

4.2 Consensus model with dual $\gamma$ -power divergence

Definition 2 (Influence function of dual $\gamma$ -mixture)