Decomposable Probability-of-Success Metrics in Algorithmic Search

Tyler Sam¹

, Jake Williams¹

, Abel Tadesse²

, Huey Sun³

, and George Montañez¹

¹ Harvey Mudd College, California, USA
² Claremont McKenna College, California, USA
³ Pomona College, California, USA
{tsam, jlwilliams, gmontanez}@g.hmc.edu, [email protected], [email protected]

https://orcid.org/0000-0001-7974-3226

https://orcid.org/0000-0001-9714-1851

https://orcid.org/0000-0002-3337-9454

https://orcid.org/0000-0002-0949-3169

https://orcid.org/0000-0002-1333-4611

Abstract

Previous studies have used a specific success metric within an algorithmic search framework to prove machine learning impossibility results. However, this specific success metric prevents us from applying these results on other forms of machine learning, e.g. transfer learning. We define decomposable metrics as a category of success metrics for search problems which can be expressed as a linear operation on a probability distribution to solve this issue. Using an arbitrary decomposable metric to measure the success of a search, we demonstrate theorems which bound success in various ways, generalizing several existing results in the literature.

1 INTRODUCTION

Many machine learning tasks, such as classification, regression and clustering, can be reduced to search problems [Montañez, 2017]. Through this reduction, one can apply concepts from information theory to derive results about machine learning. To compare the success of different algorithms, or the expected probability of finding a desired element, Montañez defined a metric of success that averaged the probability of success over all iterations of an algorithm [Montañez, 2017]. While this metric has many applications, it is not appropriate for cases where the probability of success for a given iteration of an algorithm is required. An example of this is transfer learning, where the probability of success at the final step of the algorithm is more relevant than the average probability of success.

Building on this work, we define decomposability as a property of probability-of-success metrics and show that the expected per-query probability of success [Montañez, 2017] and more general probability of success metrics are decomposable. We then show that the results previously proven for the expected per-query probability of success hold for all decomposable probability-of-success metrics. Under this generalization,we can prove results related to the probability of success for specific iterations of a search rather than just uniformly averaged over the entire search, giving the results much broader applicability.

2 RELATED WORK

Several decades ago, Mitchell proposed that classification could be viewed as search, and reduced the problem of learning generalizations to a search problem within a hypothesis space [Mitchell, 1980, Mitchell, 1982]. Montañez subsequently expanded this idea into a formal search framework [Montañez, 2017].

Montañez showed that for a given algorithm with a fixed information resource, favorable target sets, or the target sets on which the algorithm would perform better than uniform random sampling, are rare. He did this by proving that the proportion of $b$ -bit favorable problems has an exponentially decaying restrictive bound [Montañez, 2017]. He further showed that this scarcity of favorable problems exists even for $k$ -sparse target sets.

Montañez et al. later defined bias, the degree to which an algorithm is predisposed to a fixed target, with respect to the expected per-query probability of success metric, and proved that there were a limited number of favorable information resources for a given bias [Montañez et al., 2019]. Using the search framework, they proved that an algorithm cannot be favorably biased towards many distinct targets simultaneously.

As machine learning grew in prominence, researchers began to probe what was possible within machine learning. Valiant considered learnability of a task as the ability to generate a program for performing the task without explicit programming of the task [Valiant, 1984]. By restricting the tasks to a specific context, Valiant demonstrated a set of tasks which were provably learnable.

Schaffer provided an early foundation to the idea of bounding universal performance of an algorithm [Schaffer, 1994]. Schaffer analyzed generalization performance, the ability of a learner to classify objects outside of its training set, in a classification task. Using a baseline of uniform sampling from the classifiers, he showed that, over the set of all learning situations, a learner’s generalization performance sums to zero, which makes generalization performance a conserved quantity.

Wolpert and Macready demonstrated that the historical performance of a deterministic optimization algorithm provides no a priori justification whatsoever for its continued use over any other alternative going forward [Wolpert and Macready, 1997], implying that there is no utility in rationally choosing a thus-far better algorithm over choosing the opposite. Furthermore, just as there does not exist a single algorithm that performs better than random on all possible optimization problems, they proved that there also does not exist an optimization problem on which all algorithms perform better than average.

Continuing the application of prior knowledge to learning and optimization, Gülçehre and Bengio showed that the worse-than-chance performance of certain machine learning algorithms can be improved through learning with hints, namely, guidance using a curriculum [Gülçehre and Bengio, 2016]. So, while Wolpert’s results might make certain tasks seem futile and infeasible, Gülçehre’s empirical results show that there exist some alternate means through which we can utilize prior knowledge to get better results in both learning and optimization.

Others have worked towards meaningful bounds on algorithmic success through different approaches. Sterkenburg approached this concept from the perspective of Putnam, who originally claimed that a universal learning machine is impossible through the use of a diagonalization argument [Sterkenburg, 2019]. Sterkenburg follows up on this claim, attempting to find a universal inductive rule by exploring a measure which cannot be diagonalized. Even when attempting to evade Putnam’s original diagonalization, Sterkenburg is able to apply a new diagonalization that reinforces Putnam’s original claim of the impossibility of a universal learning machine.

There has also been work on proving learning bounds for specific problems. Kumagai and Kanamori analyzed the theoretical bounds of parameter transfer algorithms and self-taught learning [Kumagai and Kanamori, 2019]. By looking at the local stability, or the degree to which a feature is affected by shifting parameters, they developed a definition for parameter transfer learnability, which describes the probability of effective transfer.

2.1 Distinctions from Prior Work

The expected per-query probability of success metric previously defined in the algorithmic search framework [Montañez, 2017] tells us, for a given information resource, algorithm, and target set, how often (in expectation) our algorithm will successfully locate elements of the target set. While this metric is useful when making general claims about the performance of an algorithm or the favorability of an algorithm and information resource to the target set, it lacks the specificity to make claims about similar performance and favorability on a per-iteration basis. This trade-off calls for a more general metric that can be used to make both general and specific (per iteration) claims. For instance, in transfer learning tasks, the performance and favorability of the last pre-transfer iteration is more relevant than the overall expected per-query probability of success. The general probability of success, which we will define as a particular decomposable probability-of-success metric, is a tool through which we can make claims at such specific and relevant steps.

3 BACKGROUND

In this section, we will present definitions for the main framework that we will use throughout this paper.

3.1 The Search Framework

Montañez describes a framework which formalizes search problems in order to analyze search and learning algorithms [Montañez, 2017]. There are three components to a search problem. The first is the finite discrete search space, $\Omega$ , which is the set of elements to be examined. Next is the target set, $T$ , which is a nonempty subset of the search space that we are trying to find. Finally, we have an external information resource, $F$ , which provides an evaluation of elements of the search space. Typically, there is a tight relationship between the target set and the external information resource, as the resource is expected to lead to or describe the target set in some way, such as the target set being elements which meet a certain threshold under the external information resource.

Within the framework, we have an iterative algorithm which seeks to find elements of the target set, shown in Figure 1. The algorithm is a black-box that has access to a search history and produces a probability distribution over the search space. At each step, the algorithm samples over the search space using the probability distribution, evaluates that element using the information resource, adds the result to the search history, and determines the next probability distribution. The abstraction of finding the next probability distribution as a black-box algorithm allows the search framework to work with all types of search problems.

Refer to caption — Figure 1: Black-box search algorithm. We iteratively populate the history with samples from a distribution that is determined by the black-box at each iteration, using the history [Montañez, 2017].

The ML-as-search framework is valuable because it provides a structure to understand and reason about different machine learning problems within the same formalism. For example, we can understand regression as a search through a space of possible regression functions, or parameter estimation as a search through possible vectors for a black-box process [Montañez, 2017]. Therefore, we can apply results about search problems to any machine learning problem we can cast into the search framework.

3.2 Expected Per-Query Probability of Success

In order to compare search algorithms, Montañez defined the expected per-query probability of success,

q(t,f)=\mathbb{E}_{\tilde{P},H}\left[\frac{1}{|\tilde{P}|}\sum_{i=1}^{|\tilde{P}|}P_{i}(w\in t)\bigg{|}f\right]=\overline{P}(X\in t|f)

(3.1)

where $\tilde{P}$ is the sequence of probability distributions generated by the black box, $H$ is the search history, and $t$ and $f$ are the target set and information resource of the search problem, respectively [Montañez, 2017]. This metric of success is particularly useful because it can be shown that $q(t,f)=\textbf{t}^{\top}\overline{\bf P}_{f}$ , where $\overline{\bf P}_{f}$ is the average of the vector representation of the probability distribution from the search algorithm at each step, conditioned on an information resource $f$ .

Measuring success using the expected per-query probability of success, Montañez demonstrated bounds on the success of any search algorithm [Montañez, 2017]. The Famine of Forte states that for a given algorithm, the proportion of target-information resource pairs yielding a success level above a given threshold is inversely related to the threshold. Thus, the greater the threshold for success, the fewer problems you can be successful on, regardless of the algorithm. The expected per-query probability of success can also be used to prove a version of the No Free Lunch theorems, demonstrating that all algorithms perform the same averaged over all target sets and information resources, as is done in Theorem 1 of the current manuscript.

3.3 Bias

Using the search framework, Montañez defined a measure of bias between a distribution over information resources and a fixed target [Montañez et al., 2019]. For a distribution $\mathcal{D}$ over a collection of possible information resources $\mathcal{F}$ , with $F\sim\mathcal{D}$ , and a fixed $k$ -hot¹¹1 $k$ -hot vectors are binary vectors of length $|\Omega|$ with exactly $k$ ones. target t, the bias between the distribution and the target is defined as

$\displaystyle\text{Bias}(\mathcal{D},\textbf{t})$	$\displaystyle=\mathbb{E}_{\mathcal{D}}[\textbf{t}^{\top}\overline{\bf P}_{F}]-\frac{k}{\|\Omega\|}$	(3.2)
	$\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\overline{\bf P}_{F}]-\frac{\\|\textbf{t}\\|^{2}}{\|\Omega\|}$	(3.3)
	$\displaystyle=\textbf{t}^{\top}\int_{\mathcal{F}}\overline{\bf P}_{f}\mathcal{D}(f)\text{d}f-\frac{\\|\textbf{t}\\|^{2}}{\|\Omega\|}.$	(3.4)

Recall from above that $\overline{\bf P}_{f}$ was the averaged probability distribution over $\Omega$ from a search.

The bias term measures the performance of an algorithm in expectation (over a given distribution of information resources) compared to uniform sampling. Mathematically, this is computed by taking the difference between the expected value of the average performance of an algorithm and the performance of uniform sampling. The distribution $\mathcal{D}$ captures what information resources (e.g., datasets) one is likely to encounter.

For a non-mathematical example of the effect of bias, suppose we are searching for parking space within a parking lot. If we randomly choose parking spaces to check, we are searching without bias. However, if we consider the location of the parking spaces, we may find that parking spaces furthest from the entrance are usually free, and could find an open parking space with a higher probability. Here, the information resource telling us the distance of each parking space from the entrance and our belief that parking spaces further from the entrance tend to be open creates a distribution over possible parking spaces, favoring those that are further away for being checked first.

4 PRELIMINARIES

In this section, we introduce a new property of success metrics called decomposability, which allows us to generalize concepts of success and bias. We provide a number of prelimimary lemmata, with full proofs given in the Appendix.

4.1 Decomposability

We now give a formal definition for a decomposable probability-of-success metric, which will be used throughout the rest of the paper.

Definition 4.1.

A probability-of-success metric $\phi$ is decomposable if and only if there exists a $\mathbf{P}_{\phi,f}$ such that

\phi(t,f)=\mathbf{t}^{\top}\mathbf{P}_{\phi,f}=P_{\phi}(X\in t|f),

(4.1)

where $\mathbf{P}_{\phi,f}$ is not a function of $t$ , being conditionally independent of it given $f$ .

As we stated previously, what makes the expected per-query probability of success particularly useful is that it can be represented as a linear function of a probability distribution. This definition allows us to reference any probability-of-success metric having this property.

As a first example, we show that the expected per-query probability of success is a decomposable probability-of-success metric.

Lemma 4.2 (Decomposability of the Expected Per-Query Probability of Success ).

The expected per-query probability of success is decomposable, namely,

\displaystyle q(t,f)=\mathbf{t}^{\top}\overline{\mathbf{P}}_{f}.

(4.2)

Our goal is to show that the theorems proved for the expected per-query probability of success hold for all decomposable metrics. Showing that the expected per-query probability of success is decomposable suggests that these theorems may be generalizable to any metrics sharing that property.

4.1.1 The General Probability of Success

While the expected per-query probability of success averages the probability of success over each of the queries in a search history, we may care more about a specific query in the search history, e.g., the final query of a sequence. Thus, we can generalize the expected per-query probability of success by replacing the averaging with an arbitrary distribution $\alpha$ over the probability distributions in the search history. We define the General Probability of Success as

q_{\alpha}(t,f)=\mathbb{E}_{\tilde{P},H}\left[\sum_{i=1}^{|\tilde{P}|}\alpha_{i}P_{i}(w\in t)\bigg{|}f\right]=P_{\alpha}(X\in t|f)

(4.3)

where $P_{\alpha}$ is a valid probability distribution on the search space $\Omega$ and $\alpha_{i}$ is the weight allocated to the $i$ th probability distribution in our sequence. This formula allows us to consider a wide variety of success metrics as being instances of the general probability of success metric. For example, the expected per-query probability of success is equivalent to setting $\mathbf{P}_{\alpha,f}=\overline{\mathbf{P}}_{f}$ , with $\alpha_{i}=1/|\tilde{P}|$ . Similarly, a metric of success which only cares about the final query can be represented by letting $\mathbf{P}_{\alpha,f}=\overline{\mathbf{P}}_{n,f}$ where $n$ is the length of the sequence of queries, and $\overline{\mathbf{P}}_{n,f}$ is the average of the distributions from the $n$ -th iteration of our search.

It should be noted that $\alpha$ within the expectation will be random, being defined over the random number of steps within $\tilde{P}$ . Our operative definition of the $\alpha$ distribution, however, will allow us to generate the corresponding distribution for the needed number of steps, such as when we place all mass on the $n$ -th iteration of the search. With a slight abuse of notation, we thus let $\alpha$ signify both the process by which the distribution is generated as well as the particular distribution produced for a given number of steps.

As the general probability of success provides a layer of abstraction above the expected per-query probability of success, if we prove that results about the expected per-query probability of success also hold for the general probability of success, we gain a more powerful tool set. To do so, we must first demonstrate that the general probability of success is a decomposable probability-of-success metric.

Lemma 4.3 (Decomposability of the General Probability of Success Metric).

The general probability of success is decomposable, namely,

\displaystyle q_{\alpha}(t,f)=\mathbf{t}^{\top}\mathbf{P}_{\alpha,f}.

(4.4)

These lemmata allow us to apply later theorems about decomposable metrics to these two useful metrics. Given a metric of interest, performing a similar proof of decomposability will allow for the application of the subsequent theorems.

Lemma 4.4 (Decomposability closed under expectation).

Given a set $S=\{\phi_{i}\}$ of decomposable probability-of-success metrics and a distribution $\mathcal{D}$ over $S$ , it holds that

\phi^{\prime}(t,f)=\mathbb{E}_{\mathcal{D}}[\phi(t,f)]

(4.5)

is also a decomposable probability-of-success metric.

Lemma 4.4 gives us an easy way to construct a new decomposable metric from a set of known decomposable metrics. Note that not every success metric is decomposable; we can create non-decomposable success metrics by taking non-convex combinations of decomposable probability-of-success metrics.

4.2 Generalization of Bias

Our definition of decomposability allows us to redefine bias in terms of any decomposable metric, $\phi(T,F)$ . We replace $\overline{\textbf{P}}_{F}$ with $\textbf{P}_{\phi,F}$ and obtain

$\displaystyle\text{Bias}(\mathcal{D},\textbf{t})$	$\displaystyle=\mathbb{E}_{\mathcal{D}}[\textbf{t}^{\top}\textbf{P}_{\phi,F}]-\frac{k}{\|\Omega\|}.$	(4.6)
	$\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\textbf{P}_{\phi,F}]-\frac{\\|\textbf{t}\\|^{2}}{\|\Omega\|}$	(4.7)
	$\displaystyle=\textbf{t}^{\top}\int_{\mathcal{F}}\textbf{P}_{\phi,f}\mathcal{D}(f)\text{d}f-\frac{\\|\textbf{t}\\|^{2}}{\|\Omega\|}.$	(4.8)

Because $\phi(t,f)$ is decomposable, it is equal to $\textbf{t}^{\top}\textbf{P}_{\phi,f}$ . This makes results about the bias particularly interesting, since they relate directly to any probability-of-success metric we create, so long as the metric is decomposable.

5 RESULTS

Montañez proved a number of results and bounds on the success of machine learning algorithms relative to the expected per-query probability of success, along with its corresponding definition of bias [Montañez, 2017, Montañez et al., 2019]. We now generalize these to apply to any decomposable probability-of-success metric, with full proofs given in the Appendix (available online).

5.1 No Free Lunch for Search

First, we prove a version of the No Free Lunch Theorems for any decomposable probability-of-success metric within the search framework.

Theorem 1 (No Free Lunch for Search and Machine Learning).

For any pair of search/learning algorithms $\mathcal{A}_{1}$ , $\mathcal{A}_{2}$ operating on discrete finite search space $\Omega$ , any closed under permutation set of target sets $\uptau$ , any set of information resources $\mathcal{B}$ , and decomposable probability-of-success metric $\phi$ ,

\displaystyle\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{1}}(t,f)=\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{2}}(t,f).

(5.1)

This means that performance, in terms of our decomposable probability-of-success metric, is conserved in the sense that increased performance of one algorithm over another on some information resource-target pair comes at the cost of a loss in performance elsewhere.

5.2 The Fraction of Favorable Targets

Montañez proved that for a fixed information resource, a given algorithm $\mathcal{A}$ will perform favorably relative to uniform random sampling on only a few target sets, under the expected per-query probability of success [Montañez, 2017]. We generalize this result with a decomposable probability-of-success metric and define a version of active information of expectations for decomposable metrics $I_{\phi(t,f)}:=-\log_{2}\frac{p}{\phi(t,f)}$ . This transforms the ratio of success probabilities into bits where $p=|t|/|\Omega|$ , the per-query probability of success for uniform random sampling with replacement. $I_{\phi(t,f)}$ denotes the advantage $\mathcal{A}$ has over uniform random sampling with replacement, in bits.

Theorem 2 (The Fraction of Favorable Targets).

Let $\uptau=\{t\mid t\subseteq\Omega\}$ , $\uptau_{b}=\{t\mid\emptyset\not=t\subseteq\Omega,I_{\phi(t,f)}\geq b\}$ , and decomposable probability-of-success metric $\phi$ . Then for $b\geq 3$ ,

\displaystyle\frac{|\uptau_{b}|}{|\uptau|}\leq 2^{-b}.

(5.2)

Thus, the scarcity of $b$ -bit favorable targets still holds under for any decomposable probability-of-success metric.

5.3 The Famine of Favorable Targets

Following up on the previous result, we can show a similar bound in terms of the success of a given algorithm, for targets of a fixed size.

Theorem 3 (The Famine of Favorable Targets).

For fixed $k\in\mathbb{N}$ , fixed information resource $f$ , and decomposable probability-of-success metric $\phi$ , define

	$\displaystyle\uptau$	$\displaystyle=\{T\mid T\subseteq\Omega,\|T\|=k\},\text{ and }$
	$\displaystyle\uptau_{q_{\text{min}}}$	$\displaystyle=\{T\mid T\subseteq\Omega,\|T\|=k,\phi(T,f)\geq q_{\text{min}}\}.$

Then,

\displaystyle\frac{|\uptau_{q_{\text{min}}}|}{|\uptau|}\leq\frac{p}{q_{\text{min}}}

(5.3)

where $p=\frac{k}{|\Omega|}$ .

Here, we compare success not against uniform sampling but against a fixed constant $q_{\text{min}}$ . This theorem thus upper bounds the number of targets for which the probability of success of the search is greater than $q_{\text{min}}$ .

5.4 Famine of Forte

We generalize the Famine of Forte [Montañez, 2017], showing a bound that holds in the $k$ -sparse case using any decomposable probability-of-success metric.

Theorem 4 (The Famine of Forte).

Define

\uptau_{k}=\{T\mid T\subseteq\Omega,|T|=k\in\mathbb{N}\}

and let $\mathcal{B}_{m}$ denote any set of binary strings, such that the strings are of length $m$ or less. Let

	$\displaystyle R$	$\displaystyle=\{(T,F)\mid T\in\uptau_{k},F\in\mathcal{B}_{m}\},\text{ and }$
	$\displaystyle R_{q_{\text{min}}}$	$\displaystyle=\{(T,F)\mid T\in\uptau_{k},F\in\mathcal{B}_{m},\phi(T,F)\geq q_{\text{min}}\},$

where $\phi(T,F)$ is the decomposable probability-of-success metric for algorithm $\mathcal{A}$ on problem $(\Omega,T,F)$ . Then for any $m\in\mathbb{N}$ ,

\displaystyle\frac{|R_{q_{\text{min}}}|}{|R|}

\displaystyle\leq\frac{p}{q_{\text{min}}}.

(5.4)

This demonstrates that for any decomposable metric there is an upper bound on the proportion of problems an algorithm is successful on. Here, we measure success as being above a certain threshold with respect to a decomposable metric, and the upper bound is inversely related to this threshold.

5.5 Learning Under Dependence

While the previous theorems highlight cases where an algorithm is unlikely to succeed, we now consider the conditions that make an algorithm likely to succeed. To begin, we consider how the target and information resource can influence an algorithm’s success by generalizing the Learning Under Dependence theorem [Montañez, 2017].

Theorem 5 (Learning Under Dependence).

Define $\uptau_{k}=\{T\mid T\subseteq\Omega,|T|=k\in\mathbb{N}\}$ and let $\mathcal{B}_{m}$ denote any set of binary strings (information resources), such that the strings are of length $m$ or less. Define $q$ as the expected decomposable probability of success under the joint distribution on $T\in\uptau_{k}$ and $F\in\mathcal{B}_{m}$ for any fixed algorithm $\mathcal{A}$ , such that $q:=\mathbb{E}_{T,F}\left[\phi(T,F)\right]$ , namely,

q=\mathbb{E}_{T,F}\left[\;P_{\phi}(\omega\in T|F)\right]=\Pr(\omega\in T;\mathcal{A}).

Then,

\displaystyle q\leq\frac{I(T;F)+D(P_{T}\|\mathcal{U}_{T})+1}{I_{\Omega}}

(5.5)

where $I_{\Omega}=-\log k/|\Omega|$ , $D(P_{T}\|\mathcal{U}_{T})$ is the Kullback-Liebler divergence between the marginal distribution on $T$ and the uniform distribution on $T$ , and $I(T;F)$ is the mutual information. Alternatively, we can write

\displaystyle\Pr(\omega\in T;\mathcal{A})\leq\frac{H(\mathcal{U}_{T})-H(T\mid F)+1}{I_{\Omega}}

(5.6)

where $H(\mathcal{U}_{T})=\log\binom{|\Omega|}{k}$ .

The value of $q$ defined here represents the expected single-query probability of success of an algorithm relative to a randomly selected target and information resource, distributed according to some joint distribution. The probability of success for a single query (marginalized over information resources) is equivalent to the expectation of the conditional probability of success, conditioned on the random information resource. Upper bounding this value states that regardless of the choice of decomposable probability-of-success metric, the probability of success depends on the amount of information regarding the target contained within the information resource, as measured by the mutual information.

5.6 Famine of Favorable Information Resources

We now demonstrate the effect of the general bias term defined earlier on the probability of a success of an algorithm. We begin with a generalization of the Famine of Favorable Information Resources [Montañez et al., 2019].

Theorem 6 (Famine of Favorable Information Resources).

Let $\mathcal{B}$ be a finite set of information resources and let $t\subseteq\Omega$ be an arbitrary fixed $k$ -size target set with corresponding target function $\mathbf{t}$ . Define

\displaystyle\mathcal{B}_{q_{\mathrm{min}}}

\displaystyle=\{f\mid f\in\mathcal{B},\phi(t,f)\geq q_{\mathrm{min}}\},

where $\phi(t,f)$ is an arbitrary decomposable probability-of-success metric for algorithm $\mathcal{A}$ on search problem $(\Omega,t,f)$ and $q_{\mathrm{min}}\in(0,1]$ represents the minimally acceptable probability of success. Then,

\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|}

\displaystyle\leq\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}}

(5.7)

where $p=\frac{k}{|\Omega|}$ .

This result demonstrates the mathematical effect of bias, of which we have previously provided one hypothetical example (car parking). Now, we can show that the bias of our expected information resources towards the target upper bounds the probability of a given information resource leading to a successful search.

5.7 Futility of Bias-Free Search

We can also use our definition of bias to generalize the Futility of Bias-Free Search [Montañez, 2017], which demonstrates the inability of an algorithm to perform better than uniform random sampling without bias, defined with respect to the expected per-query probability of success. Our generalization proves that the theorem holds for bias defined with respect to any decomposable probability-of-success metric.

Theorem 7 (Futility of Bias-Free Search).

For any fixed algorithm $\mathcal{A}$ , fixed target $t\subseteq\Omega$ with corresponding target function $\mathbf{t}$ , and distribution over information resources $\mathcal{D}$ , if $\mathrm{Bias}(\mathcal{D},\mathbf{t})=0$ , then

\displaystyle\Pr(\omega\in t;\mathcal{A})

\displaystyle=p

(5.8)

where $\Pr(\omega\in t;\mathcal{A})$ represents the expected decomposable probability of successfully sampling an element of $t$ using $\mathcal{A}$ , marginalized over information resources $F\sim\mathcal{D}$ , and $p$ is the single-query probability of success under uniform random sampling.

This result demonstrates that, regardless of how we measure the success of an algorithm with respect to a decomposable metric, it cannot perform better than uniform random sampling without bias.

5.8 Famine of Favorable Biasing Distributions

Montañez proved that the percentage of minimally favorable distributions (biased over some threshold towards some specific target) is inversely proportional to the threshold value and directly proportional to the bias between the information resource and target function [Montañez, 2017]. We will show that this scarcity of favorable biasing distributions holds, in general, for bias under any decomposable probability-of-success metric.

Theorem 8 (Famine of Favorable Biasing Distributions).

Given a fixed target function $\mathbf{t}$ , a finite set of information resources $\mathcal{B}$ , a distribution over information resources $\mathcal{D}$ , and a set $\mathcal{P}=\{\mathcal{D}\mid\mathcal{D}\in\mathbb{R}^{|\mathcal{B}|},\sum_{f\in\mathcal{B}}\mathcal{D}(f)=1\}$ of all discrete $|\mathcal{B}|$ -dimensional simplex vectors,

\frac{\mu(\mathcal{G}_{\mathbf{t},q_{\mathrm{min}}})}{\mu(\mathcal{P})}\leq\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}}

(5.9)

where $\mathcal{G}_{\mathbf{t},q_{\mathrm{min}}}=\{\mathcal{D}\mid\mathcal{D}\in\mathcal{P},\mathrm{Bias}(\mathcal{D},\mathbf{t})\geq q_{\mathrm{min}}\},p=\frac{k}{\Omega}$ , and $\mu$ is Lebesgue measure.

This result shows that the more bias there is between our set of information resources $\mathcal{B}$ and the target function $\mathbf{t}$ , the easier it is to find a minimally favorable distribution, and the higher the threshold for what qualifies as a minimally favorable distribution, the harder our search becomes. Thus, unless we want to suppose that we begin with a set of information resources already favorable towards our fixed target, finding a highly favorable distribution is difficult.

6 CONCLUSION

Casting machine learning problems as search provides a common formalism within which to prove bounds and impossibility results for a wide variety of learning algorithms and tasks. In this paper, we introduce a property of probability-of-success metrics called decomposability, and show that the expected per-query probability of success and general probability of success are decomposable. To demonstrate the value of this property, we prove that a number of existing algorithmic search framework results continue to hold for all decomposable probability-of-success metrics. These results provide a number of useful insights: we show that algorithmic performance is conserved with respect to all decomposable probability-of-success metrics, favorable targets are scarce no matter your decomposable probability-of-success metric, and that without the generalized bias defined here, an algorithm will not perform better than uniform random sampling.

The goal of this work is to offer additional machinery within the search framework, allowing for more general application. Concretely, we can develop decomposable probability-of-success metrics for problems concerned with the state of an algorithm at specific steps, and leverage existing results as a foundation for additional insight into those problems.

ACKNOWLEDGEMENTS

This work was supported by the Walter Bradley Center for Natural and Artificial Intelligence. We thank Dr. Robert J. Marks II (Baylor University) for providing support and feedback. We also thank Harvey Mudd College’s Department of Computer Science for their continued resources and support.

REFERENCES

Fano and Hawkins, 1961 Fano, R. M. and Hawkins, D. (1961). Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794.
Gülçehre and Bengio, 2016 Gülçehre, Ç. and Bengio, Y. (2016). Knowledge matters: Importance of prior information for optimization. The Journal of Machine Learning Research, 17(1):226–257.
Kumagai and Kanamori, 2019 Kumagai, W. and Kanamori, T. (2019). Risk bound of transfer learning using parameric feature mapping and its application to sparse coding. Machine Learning, 108:1975–2008.
Mitchell, 1980 Mitchell, T. M. (1980). The need for biases in learning generalizations. Technical report, Computer Science Department, Rutgers University, New Brunswick, MA.
Mitchell, 1982 Mitchell, T. M. (1982). Generalization as Search. Artificial intelligence, 18(2):203–226.
Montañez, 2017 Montañez, G. D. (2017). The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm. In Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, pages 477–482. IEEE.
Montañez, 2017 Montañez, G. D. (2017). Why Machine Learning Works. PhD thesis, Carnegie Mellon University.
Montañez et al., 2019 Montañez, G. D., Hayase, J., Lauw, J., Macias, D., Trikha, A., and Vendemiatti, J. (2019). The Futility of Bias-Free Learning and Search. CoRR, abs/1907.06010.
Sauer, 1972 Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147.
Schaffer, 1994 Schaffer, C. (1994). A Conservation Law for Generalization Performance. Machine Learning Proceedings 1994, 1:259–265.
Sterkenburg, 2019 Sterkenburg, T. F. (2019). Putnam’s Diagonal Argument and the Impossibility of a Universal Learning Machine. Erkenntnis, 84(3):633–656.
Valiant, 1984 Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27:1134–1142.
Wolpert and Macready, 1997 Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Trans. Evolutionary Computation, 1:67–82.

7 Appendix

Lemma 7.1, Lemma 7.2, Lemma 7.3, and Lemma 7.4 with their proofs are taken from [Montañez, 2017], with Lemma 7.4 being adapted for decomposable probability-of-success metrics.

7.1 Lemmata

Lemma 7.1 (Sauer-Shelah Inequality).

For $d\leq n$ , $\sum_{j=0}^{d}\binom{n}{j}\leq\left(\frac{en}{d}\right)^{d}$ .

Proof.

We reproduce a simple proof of the Sauer-Shelah inequality [Sauer, 1972] for completeness.

	$\displaystyle\sum_{j=0}^{d}\binom{n}{j}$	$\displaystyle\leq\left(\frac{n}{d}\right)^{d}\sum_{j=0}^{d}\binom{n}{j}\left(\frac{d}{n}\right)^{j}$
		$\displaystyle\leq\left(\frac{n}{d}\right)^{d}\sum_{j=0}^{n}\binom{n}{j}\left(\frac{d}{n}\right)^{j}$
		$\displaystyle=\left(\frac{n}{d}\right)^{d}\left(1+\frac{d}{n}\right)^{n}$
		$\displaystyle\leq\left(\frac{n}{d}\right)^{d}\lim_{n\rightarrow\infty}\left(1+\frac{d}{n}\right)^{n}$
		$\displaystyle=\left(\frac{en}{d}\right)^{d}.$

∎

Lemma 7.2 (Binomial Approximation).

$\sum_{j=0}^{\left\lfloor\frac{n}{2^{b}}\right\rfloor}\binom{n}{j}\leq 2^{n-b}$ for $b\geq 3$ and $n\geq 2^{b}$ .

Proof.

By the condition $b\geq 3$ , we have

\displaystyle 2b+\log_{2}e

\displaystyle\leq 2^{b}

which implies $2^{-b}(2b+\log_{2}e)\leq 1$ . Therefore,

	$\displaystyle 1$	$\displaystyle\geq 2^{-b}(2b+\log_{2}e)$
		$\displaystyle=\frac{b}{2^{b}}+\frac{b+\log_{2}e}{2^{b}}$
		$\displaystyle\geq\frac{b}{n}+\frac{b+\log_{2}e}{2^{b}},$

using the condition $n\geq 2^{b}$ , which implies

\displaystyle n

\displaystyle\geq b+\frac{n}{2^{b}}(b+\log_{2}e).

Thus,

	$\displaystyle 2^{n}$	$\displaystyle\geq 2^{b+\frac{n}{2^{b}}(b+\log_{2}e)}$
		$\displaystyle=2^{b}2^{\frac{n}{2^{b}}(b+\log_{2}e)}$
		$\displaystyle=2^{b}\left(2^{b}2^{\log_{2}e}\right)^{\frac{n}{2^{b}}}$
		$\displaystyle=2^{b}\left(2^{b}e\right)^{\frac{n}{2^{b}}}$
		$\displaystyle=2^{b}\left(\frac{en}{\frac{n}{2^{b}}}\right)^{\frac{n}{2^{b}}}$
		$\displaystyle\geq 2^{b}\sum_{j=0}^{\frac{n}{2^{b}}}\binom{n}{j}$
		$\displaystyle\geq 2^{b}\sum_{j=0}^{\lfloor\frac{n}{2^{b}}\rfloor}\binom{n}{j},$

where the penultimate inequality follows from the Sauer-Shelah inequality [Sauer, 1972]. Dividing through by $2^{b}$ gives the desired result. ∎

Lemma 7.3.

(Maximum Number of Satisfying Vectors) Given an integer $1\leq k\leq n$ , a set $\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{n},\|\mathbf{s}\|=\sqrt{k}\}$ of all $n$ -length $k$ -hot binary vectors, a set $\mathcal{P}=\{\mathbf{P}:\mathbf{P}\in\mathbb{R}^{n},\sum_{j}\mathbf{P}_{j}=1\}$ of discrete $n$ -dimensional simplex vectors, and a fixed scalar threshold $\epsilon\in[0,1]$ , then for any fixed $\mathbf{P}\in\mathcal{P}$ ,

\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon}\leq\frac{1}{\epsilon}\binom{n-1}{k-1}

where $\mathbf{s}^{\top}\mathbf{P}$ denotes the vector dot product between $\mathbf{s}$ and $\mathbf{P}$ .

Proof.

For $\epsilon=0$ , the bound holds trivially. For $\epsilon>0$ , let $\mathbf{S}$ be a random quantity that takes values $\mathbf{s}$ uniformly in the set $\mathcal{S}$ . Then, for any fixed $\mathbf{P}\in\mathcal{P}$ ,

	$\displaystyle\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon}$	$\displaystyle=\binom{n}{k}\mathbb{E}\left[\mathds{1}_{\mathbf{S}^{\top}\mathbf{P}\geq\epsilon}\right]$
		$\displaystyle=\binom{n}{k}\Pr\left(\mathbf{S}^{\top}\mathbf{P}\geq\epsilon\right).$

Let $\mathbf{1}$ denotes the all ones vector. Under a uniform distribution on random quantity $\mathbf{S}$ and because $\mathbf{P}$ does not change with respect to $\mathbf{s}$ , we have

	$\displaystyle\mathbb{E}\left[\mathbf{S}^{\top}\mathbf{P}\right]$	$\displaystyle=\binom{n}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathbf{s}^{\top}\mathbf{P}$
		$\displaystyle=\mathbf{P}^{\top}\binom{n}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathbf{s}$
		$\displaystyle=\mathbf{P}^{\top}\frac{\mathbf{1}\binom{n-1}{k-1}}{\binom{n}{k}}$
		$\displaystyle=\mathbf{P}^{\top}\frac{\mathbf{1}\binom{n-1}{k-1}}{\frac{n}{k}\binom{n-1}{k-1}}$
		$\displaystyle=\frac{k}{n}\mathbf{P}^{\top}\mathbf{1}$
		$\displaystyle=\frac{k}{n}$

since $\mathbf{P}$ must sum to $1$ .

Noting that $\mathbf{S}^{\top}\mathbf{P}\geq 0$ , we use Markov’s inequality to obtain

	$\displaystyle\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon}$	$\displaystyle=\binom{n}{k}\Pr\left(\mathbf{S}^{\top}\mathbf{P}\geq\epsilon\right)$
		$\displaystyle\leq\binom{n}{k}\frac{1}{\epsilon}\mathbb{E}\left[\mathbf{S}^{\top}\mathbf{P}\right]$
		$\displaystyle=\binom{n}{k}\frac{1}{\epsilon}\frac{k}{n}$
		$\displaystyle=\frac{1}{\epsilon}\frac{k}{n}\frac{n}{k}\binom{n-1}{k-1}$
		$\displaystyle=\frac{1}{\epsilon}\binom{n-1}{k-1}.$

∎

Lemma 7.4.

If $X\perp T|F$ , then

\Pr(X\in T;\mathcal{A})=\mathbb{E}_{T,F}[\phi(T,F)].

Proof.

$\Pr(X\in T;\mathcal{A})$ is the probability that random variable $X$ is in target $T$ over all values of $F$ , for random $T$ and for $X$ drawn from $P_{\phi}(X|F)$ . Then,

	$\displaystyle\Pr(X\in T;\mathcal{A})$	$\displaystyle=\mathbb{E}_{T,X}[\mathds{1}_{X\in T}\mid\mathcal{A}]$
		$\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid T,\mathcal{A}]]$
		$\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid T,F,\mathcal{A}]\mid T]]$
		$\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid F,\mathcal{A}]\mid T]]$
		$\displaystyle=\mathbb{E}_{T,F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid F,\mathcal{A}]]$
		$\displaystyle=\mathbb{E}_{T,F}[P_{\phi}(X\in T\|F)]$
		$\displaystyle=\mathbb{E}_{T,F}[\phi(T,F)],$

where the third equality makes use of the law of iterated expectation, the fourth follows from the conditional independence assumption, and the final equality follows from the definition of decomposability. ∎

See 4.2

Proof.

By definition,

q(t,f)=\overline{P}(X\in t|f)=\mathbf{t}^{\top}\overline{\mathbf{P}}_{f}.

∎

See 4.3

Proof.

Observe that $P_{\alpha}(X\in t|f)=\textbf{t}^{\top}\textbf{P}_{\alpha,f}$ . Since $P(x)$ is a probability distribution,

	$\displaystyle P_{\alpha}(X\in t\|f)$	$\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\mathbb{E}_{P\sim P_{\alpha}}[P(x)]\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}P_{i}(x)\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\int\left[\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}\left(\sum_{x}\mathds{1}_{x\in t}P_{i}(x)\right)\right]\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\mathbb{E}_{\tilde{P},H}\left[\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}P_{i}(x\in t)\bigg{\|}f\right]$
		$\displaystyle=q_{\alpha}(t,f).$

∎

See 4.4

Proof.

	$\displaystyle\phi^{\prime}(t,f)=\mathbb{E}_{\mathcal{D}}[\phi(t,f)]$	$\displaystyle=\sum_{i}\phi_{i}(t,f)\mathcal{D}(\phi_{i})$
		$\displaystyle=\sum_{i}\left(\textbf{t}^{\top}\textbf{P}_{\phi_{i},f}\right)\mathcal{D}(\phi_{i})$
		$\displaystyle=\textbf{t}^{\top}\sum_{i}\textbf{P}_{\phi_{i},f}\mathcal{D}(\phi_{i})$
		$\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\textbf{P}_{\phi,f}]$
		$\displaystyle=\textbf{t}^{\top}\overline{\textbf{P}}_{\phi,f}.$

∎

7.2 Proofs of Theorems

See 1

Proof.

Note that the closed under permutation condition implies $\sum_{\mathbf{t}\in\uptau}\mathbf{t}=[c,c,\ldots,c]=\mathbf{1}\cdot c$ for some constant $c$ .

	$\displaystyle\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{1}}(t,f)$	$\displaystyle=\sum_{\mathbf{t}\in\uptau}\sum_{f\in\mathcal{B}}\mathbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{t}$
		$\displaystyle=\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\sum_{\mathbf{t}\in\uptau}\mathbf{t}$
		$\displaystyle=\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{1}\cdot c$
		$\displaystyle=c\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{1}$
		$\displaystyle=c\sum_{f\in\mathcal{B}}1$
		$\displaystyle=c\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{2}}^{\top}\mathbf{1}$
		$\displaystyle=\sum_{\mathbf{t}\in\uptau}\sum_{f\in\mathcal{B}}\mathbf{P}_{\phi,f,\mathcal{A}_{2}}^{\top}\mathbf{t}$
		$\displaystyle=\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{2}}(t,f)$

where the first and final equalities follow from the definition of decomposability. ∎

See 2

Proof.

First, by the definition of active information of expectations, $I_{\phi(T,F)}\geq b$ implies $|T|\leq\frac{|\Omega|}{2^{b}}$ , since $\phi(T,F)\leq 1$ . Thus,

\displaystyle\uptau_{b}

\displaystyle\subseteq\uptau^{\prime}_{b}=\left\{T\mid T\subseteq\Omega,1\leq|T|\leq\frac{|\Omega|}{2^{b}}\right\}.

(7.1)

For $|\Omega|<2^{b}$ and $I_{\phi(T,F)}\geq b$ , we have $|T|<1$ for all elements of $\uptau^{\prime}_{b}$ (making the set empty) and the theorem follows immediately. Thus, $|\Omega|\geq 2^{b}$ for the remainder.

By Lemma 7.2, we have

$\displaystyle\frac{\|\uptau_{b}\|}{\|\uptau\|}$	$\displaystyle\leq\frac{\|\uptau^{\prime}_{b}\|}{\|\uptau\|}$	(7.2)
	$\displaystyle=2^{-\|\Omega\|}\sum_{k=0}^{\left\lfloor\frac{\|\Omega\|}{2^{b}}\right\rfloor}\binom{\|\Omega\|}{k}$	(7.3)
	$\displaystyle\leq 2^{-\|\Omega\|}2^{\|\Omega\|-b}$	(7.4)
	$\displaystyle=2^{-b}.$	(7.5)

∎

See 3

Proof.

Let $\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{|\Omega|},\|\mathbf{s}\|=\sqrt{k}\}$ . For brevity, we will allow $\mathbf{s}$ to also denote its corresponding target set, letting the context make clear whether the target set or target function is meant. Then,

	$\displaystyle\frac{\|\uptau_{q_{\text{min}}}\|}{\|\uptau\|}$	$\displaystyle=\frac{\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}}{\binom{\|\Omega\|}{k}}$
		$\displaystyle=\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}$
		$\displaystyle=\mathbb{E}_{\mathcal{U}[\mathcal{S}]}\left[\mathds{1}_{\phi(\mathbf{S},f)\geq q_{\text{min}}}\right]$
		$\displaystyle=\Pr(\phi(\mathbf{S},f)\geq q_{\text{min}})$
		$\displaystyle\leq\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\phi(\mathbf{S},f)]}{q_{\text{min}}},$

where the final step follows from Markov’s inequality. By decomposability of $\phi$ and linearity of expectation, we have

	$\displaystyle\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\phi(\mathbf{S},f)]}{q_{\text{min}}}$	$\displaystyle=\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\mathbf{S}^{\top}\textbf{P}_{\phi,f}]}{q_{\text{min}}}$
		$\displaystyle=\frac{\textbf{P}_{\phi,f}^{\top}\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\mathbf{S}]}{q_{\text{min}}}$
		$\displaystyle=\frac{\textbf{P}_{\phi,f}^{\top}\mathbf{1}\left[\binom{\|\Omega\|}{k}^{-1}\binom{\|\Omega\|-1}{k-1}\right]}{q_{\text{min}}}$
		$\displaystyle=\frac{k}{\|\Omega\|}\frac{\textbf{P}_{\phi,f}^{\top}\mathbf{1}}{q_{\text{min}}}$
		$\displaystyle=\frac{p}{q_{\text{min}}}.$

∎

See 4

Proof.

We begin by defining a set $\mathcal{S}$ of all $|\Omega|$ -length target functions with exactly $k$ ones, namely, $\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{|\Omega|},\|\mathbf{s}\|=\sqrt{k}\}$ . As in Theorem 3, we again allow $\mathbf{s}$ to also denote its corresponding target set. For each of these $\mathbf{s}$ , we have $|\mathcal{B}_{m}|$ information resources. The total number of search problems is therefore

\displaystyle\binom{|\Omega|}{k}|\mathcal{B}_{m}|.

(7.6)

We seek to bound the proportion of possible search problems for which $\phi(\mathbf{s},f)\geq q_{\text{min}}$ for any threshold $q_{\text{min}}\in(0,1]$ . Thus,

	$\displaystyle\frac{\|R_{q_{\text{min}}}\|}{\|R\|}$	$\displaystyle\leq\frac{\|\mathcal{B}_{m}\|\underset{f}{\text{ sup}}\left[\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}\right]}{\|\mathcal{B}_{m}\|\binom{\|\Omega\|}{k}}$		(7.7)
		$\displaystyle=\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f^{*})\geq q_{\text{min}}}$		(7.8)

where $f^{*}\in\mathcal{B}_{m}$ denotes the arg sup of the expression. Therefore,

	$\displaystyle\frac{\|R_{q_{\text{min}}}\|}{\|R\|}$	$\displaystyle\leq\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f^{*})\geq q_{\text{min}}}$
		$\displaystyle=\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\textbf{P}_{\phi,f^{*}}\geq q_{\text{min}}}$

where the equality follows decomposability of $\phi(\mathbf{s},f^{*})$ and $\textbf{P}_{\phi,f^{*}}$ represents the $|\Omega|$ -length probability vector defined by $P_{\phi}(\cdot|f^{*})$ . By Lemma 7.3, we have

$\displaystyle\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}_{\phi,f^{*}}\geq q_{\text{min}}}$	$\displaystyle\leq\binom{\|\Omega\|}{k}^{-1}\left[\frac{1}{q_{\text{min}}}\binom{\|\Omega\|-1}{k-1}\right]{}$
	$\displaystyle=\frac{k}{\|\Omega\|}\frac{1}{q_{\text{min}}}{}$
	$\displaystyle=p/q_{\text{min}}$	(7.9)

proving the result for finite information resources.

∎

See 5

Proof.

This proof loosely follows that of Fano’s Inequality [Fano and Hawkins, 1961], being a reversed generalization of it. Let $Z=\mathds{1}(\mathrm{X}\in T)$ . Using the chain rule for entropy to expand $H(Z,T|\mathrm{X})$ in two different ways, we get

	$\displaystyle H(Z,T\|\mathrm{X})$	$\displaystyle=H(Z\|T,\mathrm{X})+H(T\|\mathrm{X})$
		$\displaystyle=H(T\|Z,\mathrm{X})+H(Z\|\mathrm{X}).$

By definition, $H(Z|T,\mathrm{X})=0$ , and by the data processing inequality $H(T|F)\leq H(T|\mathrm{X})$ . Thus,

\displaystyle H(T|F)

\displaystyle\leq H(T|Z,\mathrm{X})+H(Z|\mathrm{X}).

Define $P_{g}=\Pr(\mathrm{X}\in T;\mathcal{A})=P(Z=1)$ . Then,

	$\displaystyle H(T\|Z,\mathrm{X})$	$\displaystyle=(1-P_{g})H(T\|Z=0,\mathrm{X})+P_{g}H(T\|Z=1,\mathrm{X})$
		$\displaystyle\leq(1-P_{g})\log\binom{\|\Omega\|}{k}+P_{g}\log\binom{\|\Omega\|-1}{k-1}$
		$\displaystyle=\log\binom{\|\Omega\|}{k}-P_{g}\log\frac{\|\Omega\|}{k}.$

We let $H(\mathcal{U}_{T})=\log\binom{|\Omega|}{k}$ , being the entropy of the uniform distribution over $k$ -sparse target sets in $\Omega$ . Therefore,

\displaystyle H(T|F)

\displaystyle\leq H(\mathcal{U}_{T})-P_{g}\log\frac{|\Omega|}{k}+H(Z|\mathrm{X}).

Using the definitions of conditional entropy and $I_{\Omega}$ , we get

\displaystyle H(T)-I(T;F)

\displaystyle\leq H(\mathcal{U}_{T})-P_{g}I_{\Omega}+H(Z|\mathrm{X}),

which implies

	$\displaystyle P_{g}I_{\Omega}$	$\displaystyle\leq I(T;F)+H(\mathcal{U}_{T})-H(T)+H(Z\|\mathrm{X})$
		$\displaystyle=I(T;F)+D(P_{T}\\|\mathcal{U}_{T})+H(Z\|\mathrm{X}).$

Examining $H(Z|\mathrm{X})$ , we see it captures how much entropy of $Z$ is due to the randomness of $T$ . We upperbound this by its maximum value of 1 and obtain

\displaystyle\Pr(\mathrm{X}\in T;\mathcal{A})

\displaystyle\leq\frac{I(T;F)+D(P_{T}\|\mathcal{U}_{T})+1}{I_{\Omega}},

and substitute $q$ for $\Pr(\mathrm{X}\in T;\mathcal{A})$ to obtain the first result, noting that $q=\mathbb{E}_{T,F}\left[\;P_{\phi}(\omega\in T|F)\right]$ specifies a proper probability distribution by the linearity and boundedness of the expectation. To obtain the second form, use the definitions $I(T;F)=H(T)-H(T|F)$ and $D(P_{T}\|\mathcal{U}_{T})=H(\mathcal{U}_{T})-H(T)$ . ∎

See 6

Proof.

We seek to bound the proportion of successful search problems for which $\phi(t,f)\geq q_{\mathrm{min}}$ for any threshold $q_{\mathrm{min}}\in(0,1]$ . Let $F\sim\mathcal{U}[\mathcal{B}]$ . Then,

	$\displaystyle\frac{\|\mathcal{B}_{q_{\mathrm{min}}}\|}{\|\mathcal{B}\|}$	$\displaystyle=\frac{1}{\|\mathcal{B}\|}\sum_{f\in\mathcal{B}}\mathds{1}_{\phi(t,f)\geq q_{\mathrm{min}}}$
		$\displaystyle=\mathbb{E}_{\mathcal{U}[\mathcal{B}]}[\mathds{1}_{\phi(t,F)\geq q_{\mathrm{min}}}]$
		$\displaystyle=\Pr(\phi(t,F)\geq q_{\mathrm{min}}).$

By decomposability, we have

\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|}

\displaystyle=\Pr(\mathbf{t}^{\top}\mathbf{P}_{\phi,F}\geq q_{\mathrm{min}}).

Applying Markov’s Inequality and by the definition of $\mathrm{Bias}(\mathcal{B},\mathbf{t})$ , we obtain

	$\displaystyle\frac{\|\mathcal{B}_{q_{\mathrm{min}}}\|}{\|\mathcal{B}\|}$	$\displaystyle\leq\frac{\mathbb{E}_{\mathcal{U}[\mathcal{B}]}[\mathbf{t}^{\top}\textbf{P}_{\phi,F}]}{q_{\mathrm{min}}}$
		$\displaystyle=\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}}.$

∎

See 7

Proof.

Let $\mathcal{F}$ be the space of possible information resources. Then,

	$\displaystyle\Pr(\omega\in t;\mathcal{A})$	$\displaystyle=\int_{\mathcal{F}}\Pr(\omega\in t,f;\mathcal{A})\text{d}f$
		$\displaystyle=\int_{\mathcal{F}}\Pr(\omega\in t\mid f;\mathcal{A})\Pr(f)\text{d}f.$

Since we are considering the per-query probability of success for algorithm $\mathcal{A}$ on $t$ using information resource $f$ , we have

\Pr(\omega\in t\mid f;\mathcal{A})=P_{\phi}(\omega\in t\mid f).

Also note that $\Pr(f)=\mathcal{D}(f)$ by the fact that $F\sim\mathcal{D}$ . Making these substitutions, we obtain

	$\displaystyle\Pr(\omega\in t;\mathcal{A})$	$\displaystyle=\int_{\mathcal{F}}P_{\phi}(\omega\in t\mid f)\mathcal{D}(f)\text{d}f$
		$\displaystyle=\mathbb{E}_{\mathcal{D}}\left[P_{\phi}(\omega\in t\mid F)\right]$
		$\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\mathbf{t}^{\top}\textbf{P}_{\phi,F}\right]$
		$\displaystyle=\mathrm{Bias}(\mathcal{D},\bm{t})+p$
		$\displaystyle=p.$

∎

See 8

Proof.

This result follows from Montañez’s [Montañez et al., 2019] proof of the Famine of Favorable Biasing Distributions but instead using the generalized form of bias. No other changes to the proof are needed. ∎

	$\displaystyle P_{\alpha}(X\in t\|f)$	$\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\mathbb{E}_{P\sim P_{\alpha}}[P(x)]\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}P_{i}(x)\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\int\left[\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}\left(\sum_{x}\mathds{1}_{x\in t}P_{i}(x)\right)\right]\text{d}\nu(\tilde{P},h\|f)$
		$\displaystyle=\mathbb{E}_{\tilde{P},H}\left[\sum_{i=1}^{\|\tilde{P}\|}\alpha_{i}P_{i}(x\in t)\bigg{\|}f\right]$
		$\displaystyle=q_{\alpha}(t,f).$

$\displaystyle\binom{\|\Omega\|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}_{\phi,f^{*}}\geq q_{\text{min}}}$	$\displaystyle\leq\binom{\|\Omega\|}{k}^{-1}\left[\frac{1}{q_{\text{min}}}\binom{\|\Omega\|-1}{k-1}\right]{}$
	$\displaystyle=\frac{k}{\|\Omega\|}\frac{1}{q_{\text{min}}}{}$
	$\displaystyle=p/q_{\text{min}}$	(7.9)

	$\displaystyle H(T\|Z,\mathrm{X})$	$\displaystyle=(1-P_{g})H(T\|Z=0,\mathrm{X})+P_{g}H(T\|Z=1,\mathrm{X})$
		$\displaystyle\leq(1-P_{g})\log\binom{\|\Omega\|}{k}+P_{g}\log\binom{\|\Omega\|-1}{k-1}$
		$\displaystyle=\log\binom{\|\Omega\|}{k}-P_{g}\log\frac{\|\Omega\|}{k}.$