This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decomposable Probability-of-Success Metrics in Algorithmic Search

Tyler Sam1 [Uncaptioned image] , Jake Williams1 [Uncaptioned image] , Abel Tadesse2 [Uncaptioned image] , Huey Sun3 [Uncaptioned image] , and George Montañez1 [Uncaptioned image]
1 Harvey Mudd College, California, USA
2 Claremont McKenna College, California, USA
3 Pomona College, California, USA
{tsam, jlwilliams, gmontanez}@g.hmc.edu, [email protected], [email protected]
[Uncaptioned image] https://orcid.org/0000-0001-7974-3226[Uncaptioned image] https://orcid.org/0000-0001-9714-1851[Uncaptioned image] https://orcid.org/0000-0002-3337-9454[Uncaptioned image] https://orcid.org/0000-0002-0949-3169[Uncaptioned image] https://orcid.org/0000-0002-1333-4611
Abstract

Previous studies have used a specific success metric within an algorithmic search framework to prove machine learning impossibility results. However, this specific success metric prevents us from applying these results on other forms of machine learning, e.g. transfer learning. We define decomposable metrics as a category of success metrics for search problems which can be expressed as a linear operation on a probability distribution to solve this issue. Using an arbitrary decomposable metric to measure the success of a search, we demonstrate theorems which bound success in various ways, generalizing several existing results in the literature.

1 INTRODUCTION

Many machine learning tasks, such as classification, regression and clustering, can be reduced to search problems [Montañez, 2017]. Through this reduction, one can apply concepts from information theory to derive results about machine learning. To compare the success of different algorithms, or the expected probability of finding a desired element, Montañez defined a metric of success that averaged the probability of success over all iterations of an algorithm [Montañez, 2017]. While this metric has many applications, it is not appropriate for cases where the probability of success for a given iteration of an algorithm is required. An example of this is transfer learning, where the probability of success at the final step of the algorithm is more relevant than the average probability of success.

Building on this work, we define decomposability as a property of probability-of-success metrics and show that the expected per-query probability of success [Montañez, 2017] and more general probability of success metrics are decomposable. We then show that the results previously proven for the expected per-query probability of success hold for all decomposable probability-of-success metrics. Under this generalization,we can prove results related to the probability of success for specific iterations of a search rather than just uniformly averaged over the entire search, giving the results much broader applicability.

2 RELATED WORK

Several decades ago, Mitchell proposed that classification could be viewed as search, and reduced the problem of learning generalizations to a search problem within a hypothesis space [Mitchell, 1980, Mitchell, 1982]. Montañez subsequently expanded this idea into a formal search framework [Montañez, 2017].

Montañez showed that for a given algorithm with a fixed information resource, favorable target sets, or the target sets on which the algorithm would perform better than uniform random sampling, are rare. He did this by proving that the proportion of bb-bit favorable problems has an exponentially decaying restrictive bound [Montañez, 2017]. He further showed that this scarcity of favorable problems exists even for kk-sparse target sets.

Montañez et al. later defined bias, the degree to which an algorithm is predisposed to a fixed target, with respect to the expected per-query probability of success metric, and proved that there were a limited number of favorable information resources for a given bias [Montañez et al., 2019]. Using the search framework, they proved that an algorithm cannot be favorably biased towards many distinct targets simultaneously.

As machine learning grew in prominence, researchers began to probe what was possible within machine learning. Valiant considered learnability of a task as the ability to generate a program for performing the task without explicit programming of the task [Valiant, 1984]. By restricting the tasks to a specific context, Valiant demonstrated a set of tasks which were provably learnable.

Schaffer provided an early foundation to the idea of bounding universal performance of an algorithm [Schaffer, 1994]. Schaffer analyzed generalization performance, the ability of a learner to classify objects outside of its training set, in a classification task. Using a baseline of uniform sampling from the classifiers, he showed that, over the set of all learning situations, a learner’s generalization performance sums to zero, which makes generalization performance a conserved quantity.

Wolpert and Macready demonstrated that the historical performance of a deterministic optimization algorithm provides no a priori justification whatsoever for its continued use over any other alternative going forward [Wolpert and Macready, 1997], implying that there is no utility in rationally choosing a thus-far better algorithm over choosing the opposite. Furthermore, just as there does not exist a single algorithm that performs better than random on all possible optimization problems, they proved that there also does not exist an optimization problem on which all algorithms perform better than average.

Continuing the application of prior knowledge to learning and optimization, Gülçehre and Bengio showed that the worse-than-chance performance of certain machine learning algorithms can be improved through learning with hints, namely, guidance using a curriculum [Gülçehre and Bengio, 2016]. So, while Wolpert’s results might make certain tasks seem futile and infeasible, Gülçehre’s empirical results show that there exist some alternate means through which we can utilize prior knowledge to get better results in both learning and optimization.

Others have worked towards meaningful bounds on algorithmic success through different approaches. Sterkenburg approached this concept from the perspective of Putnam, who originally claimed that a universal learning machine is impossible through the use of a diagonalization argument [Sterkenburg, 2019]. Sterkenburg follows up on this claim, attempting to find a universal inductive rule by exploring a measure which cannot be diagonalized. Even when attempting to evade Putnam’s original diagonalization, Sterkenburg is able to apply a new diagonalization that reinforces Putnam’s original claim of the impossibility of a universal learning machine.

There has also been work on proving learning bounds for specific problems. Kumagai and Kanamori analyzed the theoretical bounds of parameter transfer algorithms and self-taught learning [Kumagai and Kanamori, 2019]. By looking at the local stability, or the degree to which a feature is affected by shifting parameters, they developed a definition for parameter transfer learnability, which describes the probability of effective transfer.

2.1 Distinctions from Prior Work

The expected per-query probability of success metric previously defined in the algorithmic search framework [Montañez, 2017] tells us, for a given information resource, algorithm, and target set, how often (in expectation) our algorithm will successfully locate elements of the target set. While this metric is useful when making general claims about the performance of an algorithm or the favorability of an algorithm and information resource to the target set, it lacks the specificity to make claims about similar performance and favorability on a per-iteration basis. This trade-off calls for a more general metric that can be used to make both general and specific (per iteration) claims. For instance, in transfer learning tasks, the performance and favorability of the last pre-transfer iteration is more relevant than the overall expected per-query probability of success. The general probability of success, which we will define as a particular decomposable probability-of-success metric, is a tool through which we can make claims at such specific and relevant steps.

3 BACKGROUND

In this section, we will present definitions for the main framework that we will use throughout this paper.

3.1 The Search Framework

Montañez describes a framework which formalizes search problems in order to analyze search and learning algorithms [Montañez, 2017]. There are three components to a search problem. The first is the finite discrete search space, Ω\Omega, which is the set of elements to be examined. Next is the target set, TT, which is a nonempty subset of the search space that we are trying to find. Finally, we have an external information resource, FF, which provides an evaluation of elements of the search space. Typically, there is a tight relationship between the target set and the external information resource, as the resource is expected to lead to or describe the target set in some way, such as the target set being elements which meet a certain threshold under the external information resource.

Within the framework, we have an iterative algorithm which seeks to find elements of the target set, shown in Figure 1. The algorithm is a black-box that has access to a search history and produces a probability distribution over the search space. At each step, the algorithm samples over the search space using the probability distribution, evaluates that element using the information resource, adds the result to the search history, and determines the next probability distribution. The abstraction of finding the next probability distribution as a black-box algorithm allows the search framework to work with all types of search problems.

Refer to caption
Figure 1: Black-box search algorithm. We iteratively populate the history with samples from a distribution that is determined by the black-box at each iteration, using the history [Montañez, 2017].

The ML-as-search framework is valuable because it provides a structure to understand and reason about different machine learning problems within the same formalism. For example, we can understand regression as a search through a space of possible regression functions, or parameter estimation as a search through possible vectors for a black-box process [Montañez, 2017]. Therefore, we can apply results about search problems to any machine learning problem we can cast into the search framework.

3.2 Expected Per-Query Probability of Success

In order to compare search algorithms, Montañez defined the expected per-query probability of success,

q(t,f)=𝔼P~,H[1|P~|i=1|P~|Pi(wt)|f]=P¯(Xt|f)q(t,f)=\mathbb{E}_{\tilde{P},H}\left[\frac{1}{|\tilde{P}|}\sum_{i=1}^{|\tilde{P}|}P_{i}(w\in t)\bigg{|}f\right]=\overline{P}(X\in t|f) (3.1)

where P~\tilde{P} is the sequence of probability distributions generated by the black box, HH is the search history, and tt and ff are the target set and information resource of the search problem, respectively [Montañez, 2017]. This metric of success is particularly useful because it can be shown that q(t,f)=t𝐏¯fq(t,f)=\textbf{t}^{\top}\overline{\bf P}_{f}, where 𝐏¯f\overline{\bf P}_{f} is the average of the vector representation of the probability distribution from the search algorithm at each step, conditioned on an information resource ff.

Measuring success using the expected per-query probability of success, Montañez demonstrated bounds on the success of any search algorithm [Montañez, 2017]. The Famine of Forte states that for a given algorithm, the proportion of target-information resource pairs yielding a success level above a given threshold is inversely related to the threshold. Thus, the greater the threshold for success, the fewer problems you can be successful on, regardless of the algorithm. The expected per-query probability of success can also be used to prove a version of the No Free Lunch theorems, demonstrating that all algorithms perform the same averaged over all target sets and information resources, as is done in Theorem 1 of the current manuscript.

3.3 Bias

Using the search framework, Montañez defined a measure of bias between a distribution over information resources and a fixed target [Montañez et al., 2019]. For a distribution 𝒟\mathcal{D} over a collection of possible information resources \mathcal{F}, with F𝒟F\sim\mathcal{D}, and a fixed kk-hot111kk-hot vectors are binary vectors of length |Ω||\Omega| with exactly kk ones. target t, the bias between the distribution and the target is defined as

Bias(𝒟,t)\displaystyle\text{Bias}(\mathcal{D},\textbf{t}) =𝔼𝒟[t𝐏¯F]k|Ω|\displaystyle=\mathbb{E}_{\mathcal{D}}[\textbf{t}^{\top}\overline{\bf P}_{F}]-\frac{k}{|\Omega|} (3.2)
=t𝔼𝒟[𝐏¯F]t2|Ω|\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\overline{\bf P}_{F}]-\frac{\|\textbf{t}\|^{2}}{|\Omega|} (3.3)
=t𝐏¯f𝒟(f)dft2|Ω|.\displaystyle=\textbf{t}^{\top}\int_{\mathcal{F}}\overline{\bf P}_{f}\mathcal{D}(f)\text{d}f-\frac{\|\textbf{t}\|^{2}}{|\Omega|}. (3.4)

Recall from above that 𝐏¯f\overline{\bf P}_{f} was the averaged probability distribution over Ω\Omega from a search.

The bias term measures the performance of an algorithm in expectation (over a given distribution of information resources) compared to uniform sampling. Mathematically, this is computed by taking the difference between the expected value of the average performance of an algorithm and the performance of uniform sampling. The distribution 𝒟\mathcal{D} captures what information resources (e.g., datasets) one is likely to encounter.

For a non-mathematical example of the effect of bias, suppose we are searching for parking space within a parking lot. If we randomly choose parking spaces to check, we are searching without bias. However, if we consider the location of the parking spaces, we may find that parking spaces furthest from the entrance are usually free, and could find an open parking space with a higher probability. Here, the information resource telling us the distance of each parking space from the entrance and our belief that parking spaces further from the entrance tend to be open creates a distribution over possible parking spaces, favoring those that are further away for being checked first.

4 PRELIMINARIES

In this section, we introduce a new property of success metrics called decomposability, which allows us to generalize concepts of success and bias. We provide a number of prelimimary lemmata, with full proofs given in the Appendix.

4.1 Decomposability

We now give a formal definition for a decomposable probability-of-success metric, which will be used throughout the rest of the paper.

Definition 4.1.

A probability-of-success metric ϕ\phi is decomposable if and only if there exists a 𝐏ϕ,f\mathbf{P}_{\phi,f} such that

ϕ(t,f)=𝐭𝐏ϕ,f=Pϕ(Xt|f),\phi(t,f)=\mathbf{t}^{\top}\mathbf{P}_{\phi,f}=P_{\phi}(X\in t|f), (4.1)

where 𝐏ϕ,f\mathbf{P}_{\phi,f} is not a function of tt, being conditionally independent of it given ff.

As we stated previously, what makes the expected per-query probability of success particularly useful is that it can be represented as a linear function of a probability distribution. This definition allows us to reference any probability-of-success metric having this property.

As a first example, we show that the expected per-query probability of success is a decomposable probability-of-success metric.

Lemma 4.2 (Decomposability of the Expected Per-Query Probability of Success ).

The expected per-query probability of success is decomposable, namely,

q(t,f)=𝐭𝐏¯f.\displaystyle q(t,f)=\mathbf{t}^{\top}\overline{\mathbf{P}}_{f}. (4.2)

Our goal is to show that the theorems proved for the expected per-query probability of success hold for all decomposable metrics. Showing that the expected per-query probability of success is decomposable suggests that these theorems may be generalizable to any metrics sharing that property.

4.1.1 The General Probability of Success

While the expected per-query probability of success averages the probability of success over each of the queries in a search history, we may care more about a specific query in the search history, e.g., the final query of a sequence. Thus, we can generalize the expected per-query probability of success by replacing the averaging with an arbitrary distribution α\alpha over the probability distributions in the search history. We define the General Probability of Success as

qα(t,f)=𝔼P~,H[i=1|P~|αiPi(wt)|f]=Pα(Xt|f)q_{\alpha}(t,f)=\mathbb{E}_{\tilde{P},H}\left[\sum_{i=1}^{|\tilde{P}|}\alpha_{i}P_{i}(w\in t)\bigg{|}f\right]=P_{\alpha}(X\in t|f) (4.3)

where PαP_{\alpha} is a valid probability distribution on the search space Ω\Omega and αi\alpha_{i} is the weight allocated to the iith probability distribution in our sequence. This formula allows us to consider a wide variety of success metrics as being instances of the general probability of success metric. For example, the expected per-query probability of success is equivalent to setting 𝐏α,f=𝐏¯f\mathbf{P}_{\alpha,f}=\overline{\mathbf{P}}_{f}, with αi=1/|P~|\alpha_{i}=1/|\tilde{P}|. Similarly, a metric of success which only cares about the final query can be represented by letting 𝐏α,f=𝐏¯n,f\mathbf{P}_{\alpha,f}=\overline{\mathbf{P}}_{n,f} where nn is the length of the sequence of queries, and 𝐏¯n,f\overline{\mathbf{P}}_{n,f} is the average of the distributions from the nn-th iteration of our search.

It should be noted that α\alpha within the expectation will be random, being defined over the random number of steps within P~\tilde{P}. Our operative definition of the α\alpha distribution, however, will allow us to generate the corresponding distribution for the needed number of steps, such as when we place all mass on the nn-th iteration of the search. With a slight abuse of notation, we thus let α\alpha signify both the process by which the distribution is generated as well as the particular distribution produced for a given number of steps.

As the general probability of success provides a layer of abstraction above the expected per-query probability of success, if we prove that results about the expected per-query probability of success also hold for the general probability of success, we gain a more powerful tool set. To do so, we must first demonstrate that the general probability of success is a decomposable probability-of-success metric.

Lemma 4.3 (Decomposability of the General Probability of Success Metric).

The general probability of success is decomposable, namely,

qα(t,f)=𝐭𝐏α,f.\displaystyle q_{\alpha}(t,f)=\mathbf{t}^{\top}\mathbf{P}_{\alpha,f}. (4.4)

These lemmata allow us to apply later theorems about decomposable metrics to these two useful metrics. Given a metric of interest, performing a similar proof of decomposability will allow for the application of the subsequent theorems.

Lemma 4.4 (Decomposability closed under expectation).

Given a set S={ϕi}S=\{\phi_{i}\} of decomposable probability-of-success metrics and a distribution 𝒟\mathcal{D} over SS, it holds that

ϕ(t,f)=𝔼𝒟[ϕ(t,f)]\phi^{\prime}(t,f)=\mathbb{E}_{\mathcal{D}}[\phi(t,f)] (4.5)

is also a decomposable probability-of-success metric.

Lemma 4.4 gives us an easy way to construct a new decomposable metric from a set of known decomposable metrics. Note that not every success metric is decomposable; we can create non-decomposable success metrics by taking non-convex combinations of decomposable probability-of-success metrics.

4.2 Generalization of Bias

Our definition of decomposability allows us to redefine bias in terms of any decomposable metric, ϕ(T,F)\phi(T,F). We replace P¯F\overline{\textbf{P}}_{F} with Pϕ,F\textbf{P}_{\phi,F} and obtain

Bias(𝒟,t)\displaystyle\text{Bias}(\mathcal{D},\textbf{t}) =𝔼𝒟[tPϕ,F]k|Ω|.\displaystyle=\mathbb{E}_{\mathcal{D}}[\textbf{t}^{\top}\textbf{P}_{\phi,F}]-\frac{k}{|\Omega|}. (4.6)
=t𝔼𝒟[Pϕ,F]t2|Ω|\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\textbf{P}_{\phi,F}]-\frac{\|\textbf{t}\|^{2}}{|\Omega|} (4.7)
=tPϕ,f𝒟(f)dft2|Ω|.\displaystyle=\textbf{t}^{\top}\int_{\mathcal{F}}\textbf{P}_{\phi,f}\mathcal{D}(f)\text{d}f-\frac{\|\textbf{t}\|^{2}}{|\Omega|}. (4.8)

Because ϕ(t,f)\phi(t,f) is decomposable, it is equal to tPϕ,f\textbf{t}^{\top}\textbf{P}_{\phi,f}. This makes results about the bias particularly interesting, since they relate directly to any probability-of-success metric we create, so long as the metric is decomposable.

5 RESULTS

Montañez proved a number of results and bounds on the success of machine learning algorithms relative to the expected per-query probability of success, along with its corresponding definition of bias [Montañez, 2017, Montañez et al., 2019]. We now generalize these to apply to any decomposable probability-of-success metric, with full proofs given in the Appendix (available online).

5.1 No Free Lunch for Search

First, we prove a version of the No Free Lunch Theorems for any decomposable probability-of-success metric within the search framework.

Theorem 1 (No Free Lunch for Search and Machine Learning).

For any pair of search/learning algorithms 𝒜1\mathcal{A}_{1}, 𝒜2\mathcal{A}_{2} operating on discrete finite search space Ω\Omega, any closed under permutation set of target sets τ\uptau, any set of information resources \mathcal{B}, and decomposable probability-of-success metric ϕ\phi,

tτfϕ𝒜1(t,f)=tτfϕ𝒜2(t,f).\displaystyle\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{1}}(t,f)=\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{2}}(t,f). (5.1)

This means that performance, in terms of our decomposable probability-of-success metric, is conserved in the sense that increased performance of one algorithm over another on some information resource-target pair comes at the cost of a loss in performance elsewhere.

5.2 The Fraction of Favorable Targets

Montañez proved that for a fixed information resource, a given algorithm 𝒜\mathcal{A} will perform favorably relative to uniform random sampling on only a few target sets, under the expected per-query probability of success [Montañez, 2017]. We generalize this result with a decomposable probability-of-success metric and define a version of active information of expectations for decomposable metrics Iϕ(t,f):=log2pϕ(t,f)I_{\phi(t,f)}:=-\log_{2}\frac{p}{\phi(t,f)}. This transforms the ratio of success probabilities into bits where p=|t|/|Ω|p=|t|/|\Omega|, the per-query probability of success for uniform random sampling with replacement. Iϕ(t,f)I_{\phi(t,f)} denotes the advantage 𝒜\mathcal{A} has over uniform random sampling with replacement, in bits.

Theorem 2 (The Fraction of Favorable Targets).

Let τ={ttΩ}\uptau=\{t\mid t\subseteq\Omega\}, τb={ttΩ,Iϕ(t,f)b}\uptau_{b}=\{t\mid\emptyset\not=t\subseteq\Omega,I_{\phi(t,f)}\geq b\}, and decomposable probability-of-success metric ϕ\phi. Then for b3b\geq 3,

|τb||τ|2b.\displaystyle\frac{|\uptau_{b}|}{|\uptau|}\leq 2^{-b}. (5.2)

Thus, the scarcity of bb-bit favorable targets still holds under for any decomposable probability-of-success metric.

5.3 The Famine of Favorable Targets

Following up on the previous result, we can show a similar bound in terms of the success of a given algorithm, for targets of a fixed size.

Theorem 3 (The Famine of Favorable Targets).

For fixed kk\in\mathbb{N}, fixed information resource ff, and decomposable probability-of-success metric ϕ\phi, define

τ\displaystyle\uptau ={TTΩ,|T|=k}, and\displaystyle=\{T\mid T\subseteq\Omega,|T|=k\},\text{ and }
τqmin\displaystyle\uptau_{q_{\text{min}}} ={TTΩ,|T|=k,ϕ(T,f)qmin}.\displaystyle=\{T\mid T\subseteq\Omega,|T|=k,\phi(T,f)\geq q_{\text{min}}\}.

Then,

|τqmin||τ|pqmin\displaystyle\frac{|\uptau_{q_{\text{min}}}|}{|\uptau|}\leq\frac{p}{q_{\text{min}}} (5.3)

where p=k|Ω|p=\frac{k}{|\Omega|}.

Here, we compare success not against uniform sampling but against a fixed constant qminq_{\text{min}}. This theorem thus upper bounds the number of targets for which the probability of success of the search is greater than qminq_{\text{min}}.

5.4 Famine of Forte

We generalize the Famine of Forte  [Montañez, 2017], showing a bound that holds in the kk-sparse case using any decomposable probability-of-success metric.

Theorem 4 (The Famine of Forte).

Define

τk={TTΩ,|T|=k}\uptau_{k}=\{T\mid T\subseteq\Omega,|T|=k\in\mathbb{N}\}

and let m\mathcal{B}_{m} denote any set of binary strings, such that the strings are of length mm or less. Let

R\displaystyle R ={(T,F)Tτk,Fm}, and\displaystyle=\{(T,F)\mid T\in\uptau_{k},F\in\mathcal{B}_{m}\},\text{ and }
Rqmin\displaystyle R_{q_{\text{min}}} ={(T,F)Tτk,Fm,ϕ(T,F)qmin},\displaystyle=\{(T,F)\mid T\in\uptau_{k},F\in\mathcal{B}_{m},\phi(T,F)\geq q_{\text{min}}\},

where ϕ(T,F)\phi(T,F) is the decomposable probability-of-success metric for algorithm 𝒜\mathcal{A} on problem (Ω,T,F)(\Omega,T,F). Then for any mm\in\mathbb{N},

|Rqmin||R|\displaystyle\frac{|R_{q_{\text{min}}}|}{|R|} pqmin.\displaystyle\leq\frac{p}{q_{\text{min}}}. (5.4)

This demonstrates that for any decomposable metric there is an upper bound on the proportion of problems an algorithm is successful on. Here, we measure success as being above a certain threshold with respect to a decomposable metric, and the upper bound is inversely related to this threshold.

5.5 Learning Under Dependence

While the previous theorems highlight cases where an algorithm is unlikely to succeed, we now consider the conditions that make an algorithm likely to succeed. To begin, we consider how the target and information resource can influence an algorithm’s success by generalizing the Learning Under Dependence theorem [Montañez, 2017].

Theorem 5 (Learning Under Dependence).

Define τk={TTΩ,|T|=k}\uptau_{k}=\{T\mid T\subseteq\Omega,|T|=k\in\mathbb{N}\} and let m\mathcal{B}_{m} denote any set of binary strings (information resources), such that the strings are of length mm or less. Define qq as the expected decomposable probability of success under the joint distribution on TτkT\in\uptau_{k} and FmF\in\mathcal{B}_{m} for any fixed algorithm 𝒜\mathcal{A}, such that q:=𝔼T,F[ϕ(T,F)]q:=\mathbb{E}_{T,F}\left[\phi(T,F)\right], namely,

q=𝔼T,F[Pϕ(ωT|F)]=Pr(ωT;𝒜).q=\mathbb{E}_{T,F}\left[\;P_{\phi}(\omega\in T|F)\right]=\Pr(\omega\in T;\mathcal{A}).

Then,

qI(T;F)+D(PT𝒰T)+1IΩ\displaystyle q\leq\frac{I(T;F)+D(P_{T}\|\mathcal{U}_{T})+1}{I_{\Omega}} (5.5)

where IΩ=logk/|Ω|I_{\Omega}=-\log k/|\Omega|, D(PT𝒰T)D(P_{T}\|\mathcal{U}_{T}) is the Kullback-Liebler divergence between the marginal distribution on TT and the uniform distribution on TT, and I(T;F)I(T;F) is the mutual information. Alternatively, we can write

Pr(ωT;𝒜)H(𝒰T)H(TF)+1IΩ\displaystyle\Pr(\omega\in T;\mathcal{A})\leq\frac{H(\mathcal{U}_{T})-H(T\mid F)+1}{I_{\Omega}} (5.6)

where H(𝒰T)=log(|Ω|k)H(\mathcal{U}_{T})=\log\binom{|\Omega|}{k}.

The value of qq defined here represents the expected single-query probability of success of an algorithm relative to a randomly selected target and information resource, distributed according to some joint distribution. The probability of success for a single query (marginalized over information resources) is equivalent to the expectation of the conditional probability of success, conditioned on the random information resource. Upper bounding this value states that regardless of the choice of decomposable probability-of-success metric, the probability of success depends on the amount of information regarding the target contained within the information resource, as measured by the mutual information.

5.6 Famine of Favorable Information Resources

We now demonstrate the effect of the general bias term defined earlier on the probability of a success of an algorithm. We begin with a generalization of the Famine of Favorable Information Resources  [Montañez et al., 2019].

Theorem 6 (Famine of Favorable Information Resources).

Let \mathcal{B} be a finite set of information resources and let tΩt\subseteq\Omega be an arbitrary fixed kk-size target set with corresponding target function 𝐭\mathbf{t}. Define

qmin\displaystyle\mathcal{B}_{q_{\mathrm{min}}} ={ff,ϕ(t,f)qmin},\displaystyle=\{f\mid f\in\mathcal{B},\phi(t,f)\geq q_{\mathrm{min}}\},

where ϕ(t,f)\phi(t,f) is an arbitrary decomposable probability-of-success metric for algorithm 𝒜\mathcal{A} on search problem (Ω,t,f)(\Omega,t,f) and qmin(0,1]q_{\mathrm{min}}\in(0,1] represents the minimally acceptable probability of success. Then,

|qmin|||\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|} p+Bias(,𝐭)qmin\displaystyle\leq\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}} (5.7)

where p=k|Ω|p=\frac{k}{|\Omega|}.

This result demonstrates the mathematical effect of bias, of which we have previously provided one hypothetical example (car parking). Now, we can show that the bias of our expected information resources towards the target upper bounds the probability of a given information resource leading to a successful search.

5.7 Futility of Bias-Free Search

We can also use our definition of bias to generalize the Futility of Bias-Free Search [Montañez, 2017], which demonstrates the inability of an algorithm to perform better than uniform random sampling without bias, defined with respect to the expected per-query probability of success. Our generalization proves that the theorem holds for bias defined with respect to any decomposable probability-of-success metric.

Theorem 7 (Futility of Bias-Free Search).

For any fixed algorithm 𝒜\mathcal{A}, fixed target tΩt\subseteq\Omega with corresponding target function 𝐭\mathbf{t}, and distribution over information resources 𝒟\mathcal{D}, if Bias(𝒟,𝐭)=0\mathrm{Bias}(\mathcal{D},\mathbf{t})=0, then

Pr(ωt;𝒜)\displaystyle\Pr(\omega\in t;\mathcal{A}) =p\displaystyle=p (5.8)

where Pr(ωt;𝒜)\Pr(\omega\in t;\mathcal{A}) represents the expected decomposable probability of successfully sampling an element of tt using 𝒜\mathcal{A}, marginalized over information resources F𝒟F\sim\mathcal{D}, and pp is the single-query probability of success under uniform random sampling.

This result demonstrates that, regardless of how we measure the success of an algorithm with respect to a decomposable metric, it cannot perform better than uniform random sampling without bias.

5.8 Famine of Favorable Biasing Distributions

Montañez proved that the percentage of minimally favorable distributions (biased over some threshold towards some specific target) is inversely proportional to the threshold value and directly proportional to the bias between the information resource and target function [Montañez, 2017]. We will show that this scarcity of favorable biasing distributions holds, in general, for bias under any decomposable probability-of-success metric.

Theorem 8 (Famine of Favorable Biasing Distributions).

Given a fixed target function 𝐭\mathbf{t}, a finite set of information resources \mathcal{B}, a distribution over information resources 𝒟\mathcal{D}, and a set 𝒫={𝒟𝒟||,f𝒟(f)=1}\mathcal{P}=\{\mathcal{D}\mid\mathcal{D}\in\mathbb{R}^{|\mathcal{B}|},\sum_{f\in\mathcal{B}}\mathcal{D}(f)=1\} of all discrete |||\mathcal{B}|-dimensional simplex vectors,

μ(𝒢𝐭,qmin)μ(𝒫)p+Bias(,𝐭)qmin\frac{\mu(\mathcal{G}_{\mathbf{t},q_{\mathrm{min}}})}{\mu(\mathcal{P})}\leq\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}} (5.9)

where 𝒢𝐭,qmin={𝒟𝒟𝒫,Bias(𝒟,𝐭)qmin},p=kΩ\mathcal{G}_{\mathbf{t},q_{\mathrm{min}}}=\{\mathcal{D}\mid\mathcal{D}\in\mathcal{P},\mathrm{Bias}(\mathcal{D},\mathbf{t})\geq q_{\mathrm{min}}\},p=\frac{k}{\Omega}, and μ\mu is Lebesgue measure.

This result shows that the more bias there is between our set of information resources \mathcal{B} and the target function 𝐭\mathbf{t}, the easier it is to find a minimally favorable distribution, and the higher the threshold for what qualifies as a minimally favorable distribution, the harder our search becomes. Thus, unless we want to suppose that we begin with a set of information resources already favorable towards our fixed target, finding a highly favorable distribution is difficult.

6 CONCLUSION

Casting machine learning problems as search provides a common formalism within which to prove bounds and impossibility results for a wide variety of learning algorithms and tasks. In this paper, we introduce a property of probability-of-success metrics called decomposability, and show that the expected per-query probability of success and general probability of success are decomposable. To demonstrate the value of this property, we prove that a number of existing algorithmic search framework results continue to hold for all decomposable probability-of-success metrics. These results provide a number of useful insights: we show that algorithmic performance is conserved with respect to all decomposable probability-of-success metrics, favorable targets are scarce no matter your decomposable probability-of-success metric, and that without the generalized bias defined here, an algorithm will not perform better than uniform random sampling.

The goal of this work is to offer additional machinery within the search framework, allowing for more general application. Concretely, we can develop decomposable probability-of-success metrics for problems concerned with the state of an algorithm at specific steps, and leverage existing results as a foundation for additional insight into those problems.

ACKNOWLEDGEMENTS

This work was supported by the Walter Bradley Center for Natural and Artificial Intelligence. We thank Dr. Robert J. Marks II (Baylor University) for providing support and feedback. We also thank Harvey Mudd College’s Department of Computer Science for their continued resources and support.

REFERENCES

  • Fano and Hawkins, 1961 Fano, R. M. and Hawkins, D. (1961). Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794.
  • Gülçehre and Bengio, 2016 Gülçehre, Ç. and Bengio, Y. (2016). Knowledge matters: Importance of prior information for optimization. The Journal of Machine Learning Research, 17(1):226–257.
  • Kumagai and Kanamori, 2019 Kumagai, W. and Kanamori, T. (2019). Risk bound of transfer learning using parameric feature mapping and its application to sparse coding. Machine Learning, 108:1975–2008.
  • Mitchell, 1980 Mitchell, T. M. (1980). The need for biases in learning generalizations. Technical report, Computer Science Department, Rutgers University, New Brunswick, MA.
  • Mitchell, 1982 Mitchell, T. M. (1982). Generalization as Search. Artificial intelligence, 18(2):203–226.
  • Montañez, 2017 Montañez, G. D. (2017). The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm. In Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, pages 477–482. IEEE.
  • Montañez, 2017 Montañez, G. D. (2017). Why Machine Learning Works. PhD thesis, Carnegie Mellon University.
  • Montañez et al., 2019 Montañez, G. D., Hayase, J., Lauw, J., Macias, D., Trikha, A., and Vendemiatti, J. (2019). The Futility of Bias-Free Learning and Search. CoRR, abs/1907.06010.
  • Sauer, 1972 Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147.
  • Schaffer, 1994 Schaffer, C. (1994). A Conservation Law for Generalization Performance. Machine Learning Proceedings 1994, 1:259–265.
  • Sterkenburg, 2019 Sterkenburg, T. F. (2019). Putnam’s Diagonal Argument and the Impossibility of a Universal Learning Machine. Erkenntnis, 84(3):633–656.
  • Valiant, 1984 Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27:1134–1142.
  • Wolpert and Macready, 1997 Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Trans. Evolutionary Computation, 1:67–82.

7 Appendix

Lemma 7.1, Lemma 7.2, Lemma 7.3, and Lemma 7.4 with their proofs are taken from [Montañez, 2017], with Lemma 7.4 being adapted for decomposable probability-of-success metrics.

7.1 Lemmata

Lemma 7.1 (Sauer-Shelah Inequality).

For dnd\leq n, j=0d(nj)(end)d\sum_{j=0}^{d}\binom{n}{j}\leq\left(\frac{en}{d}\right)^{d}.

Proof.

We reproduce a simple proof of the Sauer-Shelah inequality [Sauer, 1972] for completeness.

j=0d(nj)\displaystyle\sum_{j=0}^{d}\binom{n}{j} (nd)dj=0d(nj)(dn)j\displaystyle\leq\left(\frac{n}{d}\right)^{d}\sum_{j=0}^{d}\binom{n}{j}\left(\frac{d}{n}\right)^{j}
(nd)dj=0n(nj)(dn)j\displaystyle\leq\left(\frac{n}{d}\right)^{d}\sum_{j=0}^{n}\binom{n}{j}\left(\frac{d}{n}\right)^{j}
=(nd)d(1+dn)n\displaystyle=\left(\frac{n}{d}\right)^{d}\left(1+\frac{d}{n}\right)^{n}
(nd)dlimn(1+dn)n\displaystyle\leq\left(\frac{n}{d}\right)^{d}\lim_{n\rightarrow\infty}\left(1+\frac{d}{n}\right)^{n}
=(end)d.\displaystyle=\left(\frac{en}{d}\right)^{d}.

Lemma 7.2 (Binomial Approximation).

j=0n2b(nj)2nb\sum_{j=0}^{\left\lfloor\frac{n}{2^{b}}\right\rfloor}\binom{n}{j}\leq 2^{n-b} for b3b\geq 3 and n2bn\geq 2^{b}.

Proof.

By the condition b3b\geq 3, we have

2b+log2e\displaystyle 2b+\log_{2}e 2b\displaystyle\leq 2^{b}

which implies 2b(2b+log2e)12^{-b}(2b+\log_{2}e)\leq 1. Therefore,

1\displaystyle 1 2b(2b+log2e)\displaystyle\geq 2^{-b}(2b+\log_{2}e)
=b2b+b+log2e2b\displaystyle=\frac{b}{2^{b}}+\frac{b+\log_{2}e}{2^{b}}
bn+b+log2e2b,\displaystyle\geq\frac{b}{n}+\frac{b+\log_{2}e}{2^{b}},

using the condition n2bn\geq 2^{b}, which implies

n\displaystyle n b+n2b(b+log2e).\displaystyle\geq b+\frac{n}{2^{b}}(b+\log_{2}e).

Thus,

2n\displaystyle 2^{n} 2b+n2b(b+log2e)\displaystyle\geq 2^{b+\frac{n}{2^{b}}(b+\log_{2}e)}
=2b2n2b(b+log2e)\displaystyle=2^{b}2^{\frac{n}{2^{b}}(b+\log_{2}e)}
=2b(2b2log2e)n2b\displaystyle=2^{b}\left(2^{b}2^{\log_{2}e}\right)^{\frac{n}{2^{b}}}
=2b(2be)n2b\displaystyle=2^{b}\left(2^{b}e\right)^{\frac{n}{2^{b}}}
=2b(enn2b)n2b\displaystyle=2^{b}\left(\frac{en}{\frac{n}{2^{b}}}\right)^{\frac{n}{2^{b}}}
2bj=0n2b(nj)\displaystyle\geq 2^{b}\sum_{j=0}^{\frac{n}{2^{b}}}\binom{n}{j}
2bj=0n2b(nj),\displaystyle\geq 2^{b}\sum_{j=0}^{\lfloor\frac{n}{2^{b}}\rfloor}\binom{n}{j},

where the penultimate inequality follows from the Sauer-Shelah inequality [Sauer, 1972]. Dividing through by 2b2^{b} gives the desired result. ∎

Lemma 7.3.

(Maximum Number of Satisfying Vectors) Given an integer 1kn1\leq k\leq n, a set 𝒮={𝐬:𝐬{0,1}n,𝐬=k}\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{n},\|\mathbf{s}\|=\sqrt{k}\} of all nn-length kk-hot binary vectors, a set 𝒫={𝐏:𝐏n,j𝐏j=1}\mathcal{P}=\{\mathbf{P}:\mathbf{P}\in\mathbb{R}^{n},\sum_{j}\mathbf{P}_{j}=1\} of discrete nn-dimensional simplex vectors, and a fixed scalar threshold ϵ[0,1]\epsilon\in[0,1], then for any fixed 𝐏𝒫\mathbf{P}\in\mathcal{P},

𝐬𝒮𝟙𝐬𝐏ϵ1ϵ(n1k1)\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon}\leq\frac{1}{\epsilon}\binom{n-1}{k-1}

where 𝐬𝐏\mathbf{s}^{\top}\mathbf{P} denotes the vector dot product between 𝐬\mathbf{s} and 𝐏\mathbf{P}.

Proof.

For ϵ=0\epsilon=0, the bound holds trivially. For ϵ>0\epsilon>0, let 𝐒\mathbf{S} be a random quantity that takes values 𝐬\mathbf{s} uniformly in the set 𝒮\mathcal{S}. Then, for any fixed 𝐏𝒫\mathbf{P}\in\mathcal{P},

𝐬𝒮𝟙𝐬𝐏ϵ\displaystyle\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon} =(nk)𝔼[𝟙𝐒𝐏ϵ]\displaystyle=\binom{n}{k}\mathbb{E}\left[\mathds{1}_{\mathbf{S}^{\top}\mathbf{P}\geq\epsilon}\right]
=(nk)Pr(𝐒𝐏ϵ).\displaystyle=\binom{n}{k}\Pr\left(\mathbf{S}^{\top}\mathbf{P}\geq\epsilon\right).

Let 𝟏\mathbf{1} denotes the all ones vector. Under a uniform distribution on random quantity 𝐒\mathbf{S} and because 𝐏\mathbf{P} does not change with respect to 𝐬\mathbf{s}, we have

𝔼[𝐒𝐏]\displaystyle\mathbb{E}\left[\mathbf{S}^{\top}\mathbf{P}\right] =(nk)1𝐬𝒮𝐬𝐏\displaystyle=\binom{n}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathbf{s}^{\top}\mathbf{P}
=𝐏(nk)1𝐬𝒮𝐬\displaystyle=\mathbf{P}^{\top}\binom{n}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathbf{s}
=𝐏𝟏(n1k1)(nk)\displaystyle=\mathbf{P}^{\top}\frac{\mathbf{1}\binom{n-1}{k-1}}{\binom{n}{k}}
=𝐏𝟏(n1k1)nk(n1k1)\displaystyle=\mathbf{P}^{\top}\frac{\mathbf{1}\binom{n-1}{k-1}}{\frac{n}{k}\binom{n-1}{k-1}}
=kn𝐏𝟏\displaystyle=\frac{k}{n}\mathbf{P}^{\top}\mathbf{1}
=kn\displaystyle=\frac{k}{n}

since 𝐏\mathbf{P} must sum to 11.

Noting that 𝐒𝐏0\mathbf{S}^{\top}\mathbf{P}\geq 0, we use Markov’s inequality to obtain

𝐬𝒮𝟙𝐬𝐏ϵ\displaystyle\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}\geq\epsilon} =(nk)Pr(𝐒𝐏ϵ)\displaystyle=\binom{n}{k}\Pr\left(\mathbf{S}^{\top}\mathbf{P}\geq\epsilon\right)
(nk)1ϵ𝔼[𝐒𝐏]\displaystyle\leq\binom{n}{k}\frac{1}{\epsilon}\mathbb{E}\left[\mathbf{S}^{\top}\mathbf{P}\right]
=(nk)1ϵkn\displaystyle=\binom{n}{k}\frac{1}{\epsilon}\frac{k}{n}
=1ϵknnk(n1k1)\displaystyle=\frac{1}{\epsilon}\frac{k}{n}\frac{n}{k}\binom{n-1}{k-1}
=1ϵ(n1k1).\displaystyle=\frac{1}{\epsilon}\binom{n-1}{k-1}.

Lemma 7.4.

If XT|FX\perp T|F, then

Pr(XT;𝒜)=𝔼T,F[ϕ(T,F)].\Pr(X\in T;\mathcal{A})=\mathbb{E}_{T,F}[\phi(T,F)].
Proof.

Pr(XT;𝒜)\Pr(X\in T;\mathcal{A}) is the probability that random variable XX is in target TT over all values of FF, for random TT and for XX drawn from Pϕ(X|F)P_{\phi}(X|F). Then,

Pr(XT;𝒜)\displaystyle\Pr(X\in T;\mathcal{A}) =𝔼T,X[𝟙XT𝒜]\displaystyle=\mathbb{E}_{T,X}[\mathds{1}_{X\in T}\mid\mathcal{A}]
=𝔼T[𝔼X[𝟙XTT,𝒜]]\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid T,\mathcal{A}]]
=𝔼T[𝔼F[𝔼X[𝟙XTT,F,𝒜]T]]\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid T,F,\mathcal{A}]\mid T]]
=𝔼T[𝔼F[𝔼X[𝟙XTF,𝒜]T]]\displaystyle=\mathbb{E}_{T}[\mathbb{E}_{F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid F,\mathcal{A}]\mid T]]
=𝔼T,F[𝔼X[𝟙XTF,𝒜]]\displaystyle=\mathbb{E}_{T,F}[\mathbb{E}_{X}[\mathds{1}_{X\in T}\mid F,\mathcal{A}]]
=𝔼T,F[Pϕ(XT|F)]\displaystyle=\mathbb{E}_{T,F}[P_{\phi}(X\in T|F)]
=𝔼T,F[ϕ(T,F)],\displaystyle=\mathbb{E}_{T,F}[\phi(T,F)],

where the third equality makes use of the law of iterated expectation, the fourth follows from the conditional independence assumption, and the final equality follows from the definition of decomposability. ∎

See 4.2

Proof.

By definition,

q(t,f)=P¯(Xt|f)=𝐭𝐏¯f.q(t,f)=\overline{P}(X\in t|f)=\mathbf{t}^{\top}\overline{\mathbf{P}}_{f}.

See 4.3

Proof.

Observe that Pα(Xt|f)=tPα,fP_{\alpha}(X\in t|f)=\textbf{t}^{\top}\textbf{P}_{\alpha,f}. Since P(x)P(x) is a probability distribution,

Pα(Xt|f)\displaystyle P_{\alpha}(X\in t|f) =x𝟙xt𝔼PPα[P(x)]dν(P~,h|f)\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\mathbb{E}_{P\sim P_{\alpha}}[P(x)]\text{d}\nu(\tilde{P},h|f)
=x𝟙xti=1|P~|αiPi(x)dν(P~,h|f)\displaystyle=\sum_{x}\mathds{1}_{x\in t}\int\sum_{i=1}^{|\tilde{P}|}\alpha_{i}P_{i}(x)\text{d}\nu(\tilde{P},h|f)
=[i=1|P~|αi(x𝟙xtPi(x))]dν(P~,h|f)\displaystyle=\int\left[\sum_{i=1}^{|\tilde{P}|}\alpha_{i}\left(\sum_{x}\mathds{1}_{x\in t}P_{i}(x)\right)\right]\text{d}\nu(\tilde{P},h|f)
=𝔼P~,H[i=1|P~|αiPi(xt)|f]\displaystyle=\mathbb{E}_{\tilde{P},H}\left[\sum_{i=1}^{|\tilde{P}|}\alpha_{i}P_{i}(x\in t)\bigg{|}f\right]
=qα(t,f).\displaystyle=q_{\alpha}(t,f).

See 4.4

Proof.
ϕ(t,f)=𝔼𝒟[ϕ(t,f)]\displaystyle\phi^{\prime}(t,f)=\mathbb{E}_{\mathcal{D}}[\phi(t,f)] =iϕi(t,f)𝒟(ϕi)\displaystyle=\sum_{i}\phi_{i}(t,f)\mathcal{D}(\phi_{i})
=i(tPϕi,f)𝒟(ϕi)\displaystyle=\sum_{i}\left(\textbf{t}^{\top}\textbf{P}_{\phi_{i},f}\right)\mathcal{D}(\phi_{i})
=tiPϕi,f𝒟(ϕi)\displaystyle=\textbf{t}^{\top}\sum_{i}\textbf{P}_{\phi_{i},f}\mathcal{D}(\phi_{i})
=t𝔼𝒟[Pϕ,f]\displaystyle=\textbf{t}^{\top}\mathbb{E}_{\mathcal{D}}[\textbf{P}_{\phi,f}]
=tP¯ϕ,f.\displaystyle=\textbf{t}^{\top}\overline{\textbf{P}}_{\phi,f}.

7.2 Proofs of Theorems

See 1

Proof.

Note that the closed under permutation condition implies 𝐭τ𝐭=[c,c,,c]=𝟏c\sum_{\mathbf{t}\in\uptau}\mathbf{t}=[c,c,\ldots,c]=\mathbf{1}\cdot c for some constant cc.

tτfϕ𝒜1(t,f)\displaystyle\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{1}}(t,f) =𝐭τf𝐏ϕ,f,𝒜1𝐭\displaystyle=\sum_{\mathbf{t}\in\uptau}\sum_{f\in\mathcal{B}}\mathbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{t}
=fPϕ,f,𝒜1𝐭τ𝐭\displaystyle=\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\sum_{\mathbf{t}\in\uptau}\mathbf{t}
=fPϕ,f,𝒜1𝟏c\displaystyle=\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{1}\cdot c
=cfPϕ,f,𝒜1𝟏\displaystyle=c\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{1}}^{\top}\mathbf{1}
=cf1\displaystyle=c\sum_{f\in\mathcal{B}}1
=cfPϕ,f,𝒜2𝟏\displaystyle=c\sum_{f\in\mathcal{B}}\textbf{P}_{\phi,f,\mathcal{A}_{2}}^{\top}\mathbf{1}
=𝐭τf𝐏ϕ,f,𝒜2𝐭\displaystyle=\sum_{\mathbf{t}\in\uptau}\sum_{f\in\mathcal{B}}\mathbf{P}_{\phi,f,\mathcal{A}_{2}}^{\top}\mathbf{t}
=tτfϕ𝒜2(t,f)\displaystyle=\sum_{t\in\uptau}\sum_{f\in\mathcal{B}}\phi_{\mathcal{A}_{2}}(t,f)

where the first and final equalities follow from the definition of decomposability. ∎

See 2

Proof.

First, by the definition of active information of expectations, Iϕ(T,F)bI_{\phi(T,F)}\geq b implies |T||Ω|2b|T|\leq\frac{|\Omega|}{2^{b}}, since ϕ(T,F)1\phi(T,F)\leq 1. Thus,

τb\displaystyle\uptau_{b} τb={TTΩ,1|T||Ω|2b}.\displaystyle\subseteq\uptau^{\prime}_{b}=\left\{T\mid T\subseteq\Omega,1\leq|T|\leq\frac{|\Omega|}{2^{b}}\right\}. (7.1)

For |Ω|<2b|\Omega|<2^{b} and Iϕ(T,F)bI_{\phi(T,F)}\geq b, we have |T|<1|T|<1 for all elements of τb\uptau^{\prime}_{b} (making the set empty) and the theorem follows immediately. Thus, |Ω|2b|\Omega|\geq 2^{b} for the remainder.

By Lemma 7.2, we have

|τb||τ|\displaystyle\frac{|\uptau_{b}|}{|\uptau|} |τb||τ|\displaystyle\leq\frac{|\uptau^{\prime}_{b}|}{|\uptau|} (7.2)
=2|Ω|k=0|Ω|2b(|Ω|k)\displaystyle=2^{-|\Omega|}\sum_{k=0}^{\left\lfloor\frac{|\Omega|}{2^{b}}\right\rfloor}\binom{|\Omega|}{k} (7.3)
2|Ω|2|Ω|b\displaystyle\leq 2^{-|\Omega|}2^{|\Omega|-b} (7.4)
=2b.\displaystyle=2^{-b}. (7.5)

See 3

Proof.

Let 𝒮={𝐬:𝐬{0,1}|Ω|,𝐬=k}\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{|\Omega|},\|\mathbf{s}\|=\sqrt{k}\}. For brevity, we will allow 𝐬\mathbf{s} to also denote its corresponding target set, letting the context make clear whether the target set or target function is meant. Then,

|τqmin||τ|\displaystyle\frac{|\uptau_{q_{\text{min}}}|}{|\uptau|} =𝐬𝒮𝟙ϕ(𝐬,f)qmin(|Ω|k)\displaystyle=\frac{\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}}{\binom{|\Omega|}{k}}
=(|Ω|k)1𝐬𝒮𝟙ϕ(𝐬,f)qmin\displaystyle=\binom{|\Omega|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}
=𝔼𝒰[𝒮][𝟙ϕ(𝐒,f)qmin]\displaystyle=\mathbb{E}_{\mathcal{U}[\mathcal{S}]}\left[\mathds{1}_{\phi(\mathbf{S},f)\geq q_{\text{min}}}\right]
=Pr(ϕ(𝐒,f)qmin)\displaystyle=\Pr(\phi(\mathbf{S},f)\geq q_{\text{min}})
𝔼𝒰[𝒮][ϕ(𝐒,f)]qmin,\displaystyle\leq\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\phi(\mathbf{S},f)]}{q_{\text{min}}},

where the final step follows from Markov’s inequality. By decomposability of ϕ\phi and linearity of expectation, we have

𝔼𝒰[𝒮][ϕ(𝐒,f)]qmin\displaystyle\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\phi(\mathbf{S},f)]}{q_{\text{min}}} =𝔼𝒰[𝒮][𝐒Pϕ,f]qmin\displaystyle=\frac{\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\mathbf{S}^{\top}\textbf{P}_{\phi,f}]}{q_{\text{min}}}
=Pϕ,f𝔼𝒰[𝒮][𝐒]qmin\displaystyle=\frac{\textbf{P}_{\phi,f}^{\top}\mathbb{E}_{\mathcal{U}[\mathcal{S}]}[\mathbf{S}]}{q_{\text{min}}}
=Pϕ,f𝟏[(|Ω|k)1(|Ω|1k1)]qmin\displaystyle=\frac{\textbf{P}_{\phi,f}^{\top}\mathbf{1}\left[\binom{|\Omega|}{k}^{-1}\binom{|\Omega|-1}{k-1}\right]}{q_{\text{min}}}
=k|Ω|Pϕ,f𝟏qmin\displaystyle=\frac{k}{|\Omega|}\frac{\textbf{P}_{\phi,f}^{\top}\mathbf{1}}{q_{\text{min}}}
=pqmin.\displaystyle=\frac{p}{q_{\text{min}}}.

See 4

Proof.

We begin by defining a set 𝒮\mathcal{S} of all |Ω||\Omega|-length target functions with exactly kk ones, namely, 𝒮={𝐬:𝐬{0,1}|Ω|,𝐬=k}\mathcal{S}=\{\mathbf{s}:\mathbf{s}\in\{0,1\}^{|\Omega|},\|\mathbf{s}\|=\sqrt{k}\}. As in Theorem 3, we again allow 𝐬\mathbf{s} to also denote its corresponding target set. For each of these 𝐬\mathbf{s}, we have |m||\mathcal{B}_{m}| information resources. The total number of search problems is therefore

(|Ω|k)|m|.\displaystyle\binom{|\Omega|}{k}|\mathcal{B}_{m}|. (7.6)

We seek to bound the proportion of possible search problems for which ϕ(𝐬,f)qmin\phi(\mathbf{s},f)\geq q_{\text{min}} for any threshold qmin(0,1]q_{\text{min}}\in(0,1]. Thus,

|Rqmin||R|\displaystyle\frac{|R_{q_{\text{min}}}|}{|R|} |m| sup𝑓[𝐬𝒮𝟙ϕ(𝐬,f)qmin]|m|(|Ω|k)\displaystyle\leq\frac{|\mathcal{B}_{m}|\underset{f}{\text{ sup}}\left[\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f)\geq q_{\text{min}}}\right]}{|\mathcal{B}_{m}|\binom{|\Omega|}{k}} (7.7)
=(|Ω|k)1𝐬𝒮𝟙ϕ(𝐬,f)qmin\displaystyle=\binom{|\Omega|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f^{*})\geq q_{\text{min}}} (7.8)

where fmf^{*}\in\mathcal{B}_{m} denotes the arg sup of the expression. Therefore,

|Rqmin||R|\displaystyle\frac{|R_{q_{\text{min}}}|}{|R|} (|Ω|k)1𝐬𝒮𝟙ϕ(𝐬,f)qmin\displaystyle\leq\binom{|\Omega|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\phi(\mathbf{s},f^{*})\geq q_{\text{min}}}
=(|Ω|k)1𝐬𝒮𝟙𝐬Pϕ,fqmin\displaystyle=\binom{|\Omega|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\textbf{P}_{\phi,f^{*}}\geq q_{\text{min}}}

where the equality follows decomposability of ϕ(𝐬,f)\phi(\mathbf{s},f^{*}) and Pϕ,f\textbf{P}_{\phi,f^{*}} represents the |Ω||\Omega|-length probability vector defined by Pϕ(|f)P_{\phi}(\cdot|f^{*}). By Lemma 7.3, we have

(|Ω|k)1𝐬𝒮𝟙𝐬𝐏ϕ,fqmin\displaystyle\binom{|\Omega|}{k}^{-1}\sum_{\mathbf{s}\in\mathcal{S}}\mathds{1}_{\mathbf{s}^{\top}\mathbf{P}_{\phi,f^{*}}\geq q_{\text{min}}} (|Ω|k)1[1qmin(|Ω|1k1)]\displaystyle\leq\binom{|\Omega|}{k}^{-1}\left[\frac{1}{q_{\text{min}}}\binom{|\Omega|-1}{k-1}\right]{}
=k|Ω|1qmin\displaystyle=\frac{k}{|\Omega|}\frac{1}{q_{\text{min}}}{}
=p/qmin\displaystyle=p/q_{\text{min}} (7.9)

proving the result for finite information resources.

See 5

Proof.

This proof loosely follows that of Fano’s Inequality [Fano and Hawkins, 1961], being a reversed generalization of it. Let Z=𝟙(XT)Z=\mathds{1}(\mathrm{X}\in T). Using the chain rule for entropy to expand H(Z,T|X)H(Z,T|\mathrm{X}) in two different ways, we get

H(Z,T|X)\displaystyle H(Z,T|\mathrm{X}) =H(Z|T,X)+H(T|X)\displaystyle=H(Z|T,\mathrm{X})+H(T|\mathrm{X})
=H(T|Z,X)+H(Z|X).\displaystyle=H(T|Z,\mathrm{X})+H(Z|\mathrm{X}).

By definition, H(Z|T,X)=0H(Z|T,\mathrm{X})=0, and by the data processing inequality H(T|F)H(T|X)H(T|F)\leq H(T|\mathrm{X}). Thus,

H(T|F)\displaystyle H(T|F) H(T|Z,X)+H(Z|X).\displaystyle\leq H(T|Z,\mathrm{X})+H(Z|\mathrm{X}).

Define Pg=Pr(XT;𝒜)=P(Z=1)P_{g}=\Pr(\mathrm{X}\in T;\mathcal{A})=P(Z=1). Then,

H(T|Z,X)\displaystyle H(T|Z,\mathrm{X}) =(1Pg)H(T|Z=0,X)+PgH(T|Z=1,X)\displaystyle=(1-P_{g})H(T|Z=0,\mathrm{X})+P_{g}H(T|Z=1,\mathrm{X})
(1Pg)log(|Ω|k)+Pglog(|Ω|1k1)\displaystyle\leq(1-P_{g})\log\binom{|\Omega|}{k}+P_{g}\log\binom{|\Omega|-1}{k-1}
=log(|Ω|k)Pglog|Ω|k.\displaystyle=\log\binom{|\Omega|}{k}-P_{g}\log\frac{|\Omega|}{k}.

We let H(𝒰T)=log(|Ω|k)H(\mathcal{U}_{T})=\log\binom{|\Omega|}{k}, being the entropy of the uniform distribution over kk-sparse target sets in Ω\Omega. Therefore,

H(T|F)\displaystyle H(T|F) H(𝒰T)Pglog|Ω|k+H(Z|X).\displaystyle\leq H(\mathcal{U}_{T})-P_{g}\log\frac{|\Omega|}{k}+H(Z|\mathrm{X}).

Using the definitions of conditional entropy and IΩI_{\Omega}, we get

H(T)I(T;F)\displaystyle H(T)-I(T;F) H(𝒰T)PgIΩ+H(Z|X),\displaystyle\leq H(\mathcal{U}_{T})-P_{g}I_{\Omega}+H(Z|\mathrm{X}),

which implies

PgIΩ\displaystyle P_{g}I_{\Omega} I(T;F)+H(𝒰T)H(T)+H(Z|X)\displaystyle\leq I(T;F)+H(\mathcal{U}_{T})-H(T)+H(Z|\mathrm{X})
=I(T;F)+D(PT𝒰T)+H(Z|X).\displaystyle=I(T;F)+D(P_{T}\|\mathcal{U}_{T})+H(Z|\mathrm{X}).

Examining H(Z|X)H(Z|\mathrm{X}), we see it captures how much entropy of ZZ is due to the randomness of TT. We upperbound this by its maximum value of 1 and obtain

Pr(XT;𝒜)\displaystyle\Pr(\mathrm{X}\in T;\mathcal{A}) I(T;F)+D(PT𝒰T)+1IΩ,\displaystyle\leq\frac{I(T;F)+D(P_{T}\|\mathcal{U}_{T})+1}{I_{\Omega}},

and substitute qq for Pr(XT;𝒜)\Pr(\mathrm{X}\in T;\mathcal{A}) to obtain the first result, noting that q=𝔼T,F[Pϕ(ωT|F)]q=\mathbb{E}_{T,F}\left[\;P_{\phi}(\omega\in T|F)\right] specifies a proper probability distribution by the linearity and boundedness of the expectation. To obtain the second form, use the definitions I(T;F)=H(T)H(T|F)I(T;F)=H(T)-H(T|F) and D(PT𝒰T)=H(𝒰T)H(T)D(P_{T}\|\mathcal{U}_{T})=H(\mathcal{U}_{T})-H(T). ∎

See 6

Proof.

We seek to bound the proportion of successful search problems for which ϕ(t,f)qmin\phi(t,f)\geq q_{\mathrm{min}} for any threshold qmin(0,1]q_{\mathrm{min}}\in(0,1]. Let F𝒰[]F\sim\mathcal{U}[\mathcal{B}]. Then,

|qmin|||\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|} =1||f𝟙ϕ(t,f)qmin\displaystyle=\frac{1}{|\mathcal{B}|}\sum_{f\in\mathcal{B}}\mathds{1}_{\phi(t,f)\geq q_{\mathrm{min}}}
=𝔼𝒰[][𝟙ϕ(t,F)qmin]\displaystyle=\mathbb{E}_{\mathcal{U}[\mathcal{B}]}[\mathds{1}_{\phi(t,F)\geq q_{\mathrm{min}}}]
=Pr(ϕ(t,F)qmin).\displaystyle=\Pr(\phi(t,F)\geq q_{\mathrm{min}}).

By decomposability, we have

|qmin|||\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|} =Pr(𝐭𝐏ϕ,Fqmin).\displaystyle=\Pr(\mathbf{t}^{\top}\mathbf{P}_{\phi,F}\geq q_{\mathrm{min}}).

Applying Markov’s Inequality and by the definition of Bias(,𝐭)\mathrm{Bias}(\mathcal{B},\mathbf{t}), we obtain

|qmin|||\displaystyle\frac{|\mathcal{B}_{q_{\mathrm{min}}}|}{|\mathcal{B}|} 𝔼𝒰[][𝐭Pϕ,F]qmin\displaystyle\leq\frac{\mathbb{E}_{\mathcal{U}[\mathcal{B}]}[\mathbf{t}^{\top}\textbf{P}_{\phi,F}]}{q_{\mathrm{min}}}
=p+Bias(,𝐭)qmin.\displaystyle=\frac{p+\mathrm{Bias}(\mathcal{B},\mathbf{t})}{q_{\mathrm{min}}}.

See 7

Proof.

Let \mathcal{F} be the space of possible information resources. Then,

Pr(ωt;𝒜)\displaystyle\Pr(\omega\in t;\mathcal{A}) =Pr(ωt,f;𝒜)df\displaystyle=\int_{\mathcal{F}}\Pr(\omega\in t,f;\mathcal{A})\text{d}f
=Pr(ωtf;𝒜)Pr(f)df.\displaystyle=\int_{\mathcal{F}}\Pr(\omega\in t\mid f;\mathcal{A})\Pr(f)\text{d}f.

Since we are considering the per-query probability of success for algorithm 𝒜\mathcal{A} on tt using information resource ff, we have

Pr(ωtf;𝒜)=Pϕ(ωtf).\Pr(\omega\in t\mid f;\mathcal{A})=P_{\phi}(\omega\in t\mid f).

Also note that Pr(f)=𝒟(f)\Pr(f)=\mathcal{D}(f) by the fact that F𝒟F\sim\mathcal{D}. Making these substitutions, we obtain

Pr(ωt;𝒜)\displaystyle\Pr(\omega\in t;\mathcal{A}) =Pϕ(ωtf)𝒟(f)df\displaystyle=\int_{\mathcal{F}}P_{\phi}(\omega\in t\mid f)\mathcal{D}(f)\text{d}f
=𝔼𝒟[Pϕ(ωtF)]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[P_{\phi}(\omega\in t\mid F)\right]
=𝔼𝒟[𝐭Pϕ,F]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\mathbf{t}^{\top}\textbf{P}_{\phi,F}\right]
=Bias(𝒟,𝒕)+p\displaystyle=\mathrm{Bias}(\mathcal{D},\bm{t})+p
=p.\displaystyle=p.

See 8

Proof.

This result follows from Montañez’s [Montañez et al., 2019] proof of the Famine of Favorable Biasing Distributions but instead using the generalized form of bias. No other changes to the proof are needed. ∎