NESTED BANDITS

Matthieu Martin^∗ ^∗ Criteo AI Lab. [email protected] , Panayotis Mertikopoulos^⋄,∗ ^⋄ Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France. [email protected] ,
Thibaud Rahier^∗ [email protected] and Houssam Zenati^∗,§ ^§ Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. [email protected]

Abstract.

In many online decision processes, the optimizing agent is called to choose between large numbers of alternatives with many inherent similarities; in turn, these similarities imply closely correlated losses that may confound standard discrete choice models and bandit algorithms. We study this question in the context of nested bandits, a class of adversarial multi-armed bandit problems where the learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. In this setting, optimal algorithms based on the exponential weights blueprint (like Hedge, EXP3, and their variants) may incur significant regret because they tend to spend excessive amounts of time exploring irrelevant alternatives with similar, suboptimal costs. To account for this, we propose a nested exponential weights (NEW) algorithm that performs a layered exploration of the learner’s set of alternatives based on a nested, step-by-step selection method. In so doing, we obtain a series of tight bounds for the learner’s regret showing that online learning problems with a high degree of similarity between alternatives can be resolved efficiently, without a red bus / blue bus paradox occurring.

Key words and phrases:

Online learning; nested logit choice; similarity structures; multi-armed bandits.

2020 Mathematics Subject Classification:

Primary 68Q32; secondary 91B06.

Authors appear in alphabetical order.

1. Introduction

Consider the following discrete choice problem (known as the “red bus / blue bus paradox” in the context of transportation economics). A commuter has a choice between taking a car or bus to work: commuting by car takes on average half an hour modulo random fluctuations, whereas commuting by bus takes an hour, again modulo random fluctuations (it’s a long commute). Then, under the classical multinomial logit choice model for action selection [19, 20], the commuter’s odds for selecting a car over a bus would be $\exp(-1/2)/\exp(-1)\approx 1.6:1$ . This indicates a very clear preference for taking a car to work and is commensurate with the fact that, on average, commuting by bus takes twice as long.

Consider now the same model but with a twist. The company operating the bus network purchases a fleet of new buses that are otherwise completely identical to the existing ones, except for their color: old buses are red, the new buses are blue. This change has absolutely no effect on the travel time of the bus; however, since the new set of alternatives presented to the commuter is $\{\textrm{car},\textrm{red bus},\textrm{blue bus}\}$ , the odds of selecting a car over a bus (red or blue, it doesn’t matter) now drops to $\exp(-1/2)/[\exp(-1)+\exp(-1)]\approx 0.8:1$ . Thus, by introducing an irrelevant feature (the color of the bus), the odds of selecting the alternative with the highest utility have dropped dramatically, to the extent that commuting by car is no longer the most probable outcome in this example.

Of course, the shift in choice probabilities may not always be that dramatic, but the point of this example is that the presence of an irrelevant alternative (the blue bus) would always induce such a shift – which is, of course, absurd. In fact, the red bus / blue bus paradox was originally proposed as a sharp criticism of the independence from irrelevant alternatives (IIA) axiom that underlies the multinomial logit choice model [19] and which makes it unsuitable for choice problems with inherent similarities between different alternatives. In turn, this has led to a vast corpus of literature in social choice and decision theory, with an extensive array of different axioms and models proposed to overcome the failures of the IIA assumption. For an introduction to the topic, we refer the reader to the masterful accounts of McFadden [20], Ben-Akiva & Lerman [6] and Anderson et al. [2].

Perhaps surprisingly, the implications of the red bus / blue bus paradox have not been explored in the context of online learning, despite the fact that similarities between alternatives are prevalent in the field’s application domains – for example, in recommender systems with categorized product recommendation catalogues, in the economics of transport and product differentiation, etc. What makes this gap particularly pronounced is the fact that logit choice underlies some of the most widely used algorithmic schemes for learning in multi-armed bandit problems – namely the exponential weights algorithm for exploration and exploitation (EXP3) [27, 18, 3] as well as its variants, Hedge [4], EXP3.P [5], EXP3-IX [16], EXP4 [5] / EXP4-IX [22], etc. Thus, given the vulnerability of logit choice to irrelevant alternatives, it stands to reason that said algorithms may be suboptimal when faced with a set of alternatives with many inherent similarities.

Our contributions.

Our paper examines this question in the context of repeated decision problems where a learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. This similarity structure, which we formalize in Section 2, is defined in terms of a nested series of attributes – like “type” or “color” – and induces commensurate similarities to the losses of alternatives that lie in the same class (just as the red and blue buses have identical losses in the example described above).

Inspired by the nested logit choice model introduced by McFadden [20] to resolve the original red bus / blue bus paradox, we develop in Section 3 a nested exponential weights (NEW) algorithm for no-regret learning in decision problems of this type. Our main result is that the regret incurred by nested exponential weights (NEW) is bounded as $\operatorname{\mathcal{O}}(\sqrt{n_{\mathrm{eff}}\log n\cdot T})$ , where $n$ is the total number of alternatives and $n_{\mathrm{eff}}$ is the “effective” number when taking similarities into account (for example, in the standard red bus / blue bus paradox, $n_{\mathrm{eff}}=2$ , cf. Section 4). The gap between nested and non-nested algorithms can be quantified by the problem’s price of affinity (PoAf), defined here as the ratio $\alpha=\sqrt{n/n_{\mathrm{eff}}}$ measuring the worst-case ratio between the regret guarantees of the NEW and EXP3 algorithms (the latter scaling as $\operatorname{\mathcal{O}}(\sqrt{n\log n\cdot T})$ in the problem at hand).

In practical applications (such as the type of recommendation problems that arise in online advertising), $\alpha$ can be exponential in the number of attributes, indicating that the NEW algorithm could lead to significant performance gains in this context. We verify that this is indeed the case in a range of synthetic experiments in Section 5.

Related Work.

The problem of exploiting the structure of the loss model and/or any side information available to the learner is a staple of the bandit literature. More precisely, in the setting of contextual bandits, the learner is assumed to observe some “context-based” information and tries to learn the “context to reward” mapping underlying the model in order to make better predictions. Bandit algorithms of this type – like EXP4 – are often studied as “expert” models [5, 10] or attempt to model the agent’s loss function with a semi-parametric contextual dependency in the stochastic setting to derive optimistic action selection rules [1]; for a survey, we refer the reader to [17] and references therein. While the nested bandit model we study assumes an additional layer of information relative to standard bandit models, there are no experts or a contextual mapping conditioning the action taken, so it is not comparable to the contextual setup.

The type of feedback we consider assumes that the learner observes the “intra-class” losses of their chosen alternative, similar to the semi-bandit in the study of combinatorial bandit algorithms [11, 14]. However, the similarity with combinatorial bandit models ends there: even though the categorization of alternatives gives rise to a tree structure with losses obtained at its leaves, there is no combinatorial structure defining these costs, and modeling this as a combinatorial bandit would lead to the same number of arms and ground elements, thus invalidating the concept.

Besides these major threads in the literature, [26] recently showed that the range of losses can be exploited with an additional free observation, while [12] improves the regret guarantees by using effective loss estimates. However, both works are susceptible to the advent of irrelevant alternatives and can incur significant regret when faced with such a problem. Finally, in the Lipschitz bandit setting, [13, 15] obtain order-optimal regret bounds by building a hierarchical covering model in the spirit of [9]; the correlations induced by a Lipschitz loss model cannot be compared to our model, so there is no overlap of techniques or results.

2. The model

We begin in this section by defining our general nested choice model. Because the technical details involved can become cumbersome at times, it will help to keep in mind the running example of a music catalogue where songs are classified by, say, genre (classical music, jazz, rock,…), artist (Rachmaninov, Miles Davis, Led Zeppelin,…), and album. This is a simple – but not simplistic – use case which requires the full capacity of our model, so we will use it as our “go-to” example throughout.

2.1. Attributes, classes, and the relations between them

Let $\mathcal{A}=\{a_{i}:i=1,\dotsc,n\}$ be a set of alternatives (or atoms) indexed by $i=1,\dotsc,n$ . A similarity structure (or structure of attributes) on $\mathcal{A}$ is defined as a tower of nested similarity partitions (or attributes) $\mathcal{S}_{\ell}$ , $\ell=0,\dotsc,L$ , of $\mathcal{A}$ with $\{\mathcal{A}\}\eqqcolon\mathcal{S}_{0}\mathrel{}\succcurlyeq\mathrel{}\mathcal{S}_{1}\mathrel{}\succcurlyeq\mathrel{}\dotsm\mathrel{}\succcurlyeq\mathrel{}\mathcal{S}_{L}\coloneqq\{\{a\}:a\in\mathcal{A}\}$ . As a result of this definition, each partition $\mathcal{S}_{\ell}$ captures successively finer attributes of the elements of $\mathcal{A}$ (in our music catalogue example, these attributes would correspond to genre, artist, album, etc.).¹¹1The trivial partitions $\mathcal{S}_{0}=\{\mathcal{A}\}$ and $\mathcal{S}_{L}=\{\{a\}:a\in\mathcal{A}\}$ do not carry much information in themselves, but they are included for completeness and notational convenience later on. Accordingly, each constituent set $A$ of a partition $\mathcal{S}_{\ell}$ will be referred to as a similarity class and we assume it collects all elements of $\mathcal{A}$ that share the attribute defining $\mathcal{S}_{\ell}$ : for example, a similarity class for the attribute “artist” might consist of all Beethoven symphonies, all songs by Led Zeppelin, etc.

Collectively, a structure of attributes will be represented by the disjoint union

\mathcal{S}\coloneqq\coprod\nolimits_{\ell=0}^{L}\mathcal{S}_{\ell}\equiv\operatorname*{\bigcup}\nolimits_{\ell=0}^{L}\{(A,\ell):A\in\mathcal{S}_{\ell}\}

(1)

of all class/attribute pairs of the form $(A,\ell)$ for $A\in\mathcal{S}_{\ell}$ . In a slight abuse of terminology (and when there is no danger of confusion), the pair $S=(A,\ell)$ will also be referred to as a “class”, and we will write $S\in\mathcal{S}_{\ell}$ and $a\in S$ instead of $A\in\mathcal{S}_{\ell}$ and $a\in A$ respectively. By contrast, when we need to clearly distinguish between a class and its underlying set, we will write $A=\mathrm{elem}(S)$ for the set of atoms contained in $S$ and $\ell=\mathrm{attr}(S)$ for the attached attribute label.

Remark 1.

The reason for including the attribute label $\ell$ in the definition of $\mathcal{S}$ is that a set of alternatives may appear in different partitions of $\mathcal{A}$ in a different context. For example, if “IV” is the only album by Led Zeppelin in the catalogue, the album’s track list represents both the set of “all songs in IV” as well as the set of “all Led Zeppelin songs”. However, the focal attribute in each case is different – “artist” in the former versus “album” in the latter – and this additional information would be lost in the non-discriminating union $\operatorname*{\bigcup}_{\ell=0}^{L}\mathcal{S}_{\ell}$ (unless, of course, the partitions $\mathcal{S}_{\ell}$ happen to be mutually disjoint, in which case the distinction between “union” and “disjoint union” becomes set-theoretically superfluous). ¶

Moving forward, if a class $S\in\mathcal{S}_{\ell}$ contains the class $S^{\prime}\in\mathcal{S}_{k}$ for some $k>\ell$ , we will say that $S^{\prime}$ is a descendant of $S$ (resp. $S$ is an ancestor of $S^{\prime}$ ), and we will write “ $S^{\prime}\mathrel{}\prec\mathrel{}S$ ” (resp. “ $S\mathrel{}\succ\mathrel{}S^{\prime}$ ”).²²2More formally, we will write $S^{\prime}\mathrel{}\prec\mathrel{}S$ when $\mathrm{elem}(S^{\prime})\subseteq\mathrm{elem}(S)$ and $\mathrm{attr}(S^{\prime})>\mathrm{attr}(S)$ . The corresponding weak relation “ $\mathrel{}\preccurlyeq\mathrel{}$ ” is defined in the standard way, i.e., allowing for the case $\mathrm{attr}(S^{\prime})=\mathrm{attr}(S)$ which in turn implies that $S^{\prime}=S$ . As a special case of this relation, if $S^{\prime}\mathrel{}\prec\mathrel{}S$ and $k=\ell+1$ , we will say that $S^{\prime}$ is a child of $S$ (resp. $S$ is parent of $S^{\prime}$ ) and we will write “ $S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S$ ” (resp. “ $S\mathrel{}\vartriangleright\mathrel{}S^{\prime}$ ”). For completeness, we will also say that $S^{\prime}$ and $S^{\prime\prime}$ are siblings if they are children of the same parent, and we will write $S^{\prime}\sim S^{\prime\prime}$ in this case. Finally, when we wish to focus on descendants sharing a certain attribute, we will write “ $S^{\prime}\mathrel{}\prec_{\ell}\mathrel{}S$ ” as shorthand for the predicate “ $S^{\prime}\mathrel{}\prec\mathrel{}S$ and $\mathrm{attr}(S^{\prime})=\ell$ ”.

Building on this, a similarity structure on $\mathcal{A}$ can also be represented graphically as a rooted directed tree – an arborescence – by connecting two classes $S,S^{\prime}\in\mathcal{S}$ with a directed edge $S\to S^{\prime}$ whenever $S\mathrel{}\vartriangleright\mathrel{}S^{\prime}$ . By construction, the root of this tree is $\mathcal{A}$ itself,³³3Stricto sensu, the root of the tree is $(\mathcal{A},0)$ , but since there is no danger of confusion, the attribute label “0” will be dropped. and the unique directed path $\mathcal{A}\equiv S_{0}\mathrel{}\vartriangleright\mathrel{}S_{1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{\ell}\equiv S$ from $\mathcal{A}$ to any class $S\in\mathcal{S}$ will be referred to as the lineage of $S$ . For notational simplicity, we will not distinguish between $\mathcal{S}$ and its graphical representation, and we will use the two interchangeably; for an illustration, see Fig. 1.

Figure 1. A structure with

L=3

attributes on the set

\mathcal{A}=\{a_{1},\dotsc,a_{8}\}

; for example, the class

S_{2}^{1}

consists of

\{a_{3},a_{4}\}

2.2. The loss model

Throughout what follows, we will consider loss models in which alternatives that share a common set of attributes incur similar costs, with the degree of similarity depending on the number of shared attributes. More precisely, given a similarity class $S\in\mathcal{S}$ , we will assume that all its immediate subclasses $S^{\prime}$ share the same base cost $c_{S}$ (determined by the parent class $S$ ) plus an idiosyncratic cost increment $r_{S^{\prime}}$ (which is specific to the child $S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S$ in question). Formally, starting with $c_{\mathcal{A}}=0$ (for the root class $\mathcal{A}$ ), this boils down to the recursive definition

c_{S^{\prime}}=c_{S}+r_{S^{\prime}}\quad\text{for all $S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S$},

(2)

which, when unrolled over the lineage $\mathcal{A}\equiv S_{0}\mathrel{}\vartriangleright\mathrel{}S_{1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{\ell}\equiv S$ of a target class $S\in\mathcal{S}_{\ell}$ , yields the expression

c_{S}=\sum\nolimits_{S^{\prime}\mathrel{}\succcurlyeq\mathrel{}S}r_{S^{\prime}}=r_{S_{1}}+\dotsm+r_{S_{\ell}}.

(3)

Thus, in particular, when $S\leftarrow a\in\mathcal{A}$ , the cost assigned to an individual alternative $a\in\mathcal{A}$ will be given by

c_{a}=\sum\nolimits_{\ell=1}^{L}r_{S_{\ell}}=\sum\nolimits_{S\ni a}r_{S}\quad\text{for all $a\in\mathcal{A}$}.

(4)

Finally, to quantify the “intra-class” variability of costs, we will assume throughout that the idiosyncratic cost increments within a given parent class $S$ are bounded as

r_{S^{\prime}}\in[0,R_{S}]\quad\text{for all $S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S$}.

(5)

This terminology is justified by the fact that, under the loss model (2), the costs $c_{S^{\prime}},c_{S^{\prime\prime}}$ to any two sibling classes $S^{\prime},S^{\prime\prime}\mathrel{}\vartriangleleft\mathrel{}{S}$ (i.e., any two classes parented by $S$ ) differ by at most $R_{S}$ . Analogously, the costs to any two alternatives $a,a^{\prime}\in\mathcal{A}$ that share a set of common attributes $S_{1},\dotsc,S_{\ell}$ will differ by at most $\sum_{k=\ell+1}^{L}R_{S_{k}}$ .

Example 1.

To represent the original red bus / blue bus problem as an instance of the above framework, let $\mathcal{S}_{1}=\{\{\textrm{red bus},\textrm{blue bus}\},\textrm{car}\}$ be the partition of the set $\mathcal{A}=\{\textrm{red bus},\textrm{blue bus},\textrm{car}\}$ by type (“bus” or “car”), and let $\mathcal{S}_{2}$ be the corresponding sub-partition by color (“red” or “blue” for elements of the class “bus”). The fact that color does not affect travel times may then be represented succinctly by taking $R_{\textrm{color}}=0$ (representing the fact that color does not affect travel times). ¶

Remark 2.

We make no distinction here between $c_{a}$ and $c_{\{a\}}$ , i.e., between an alternative $a$ of $\mathcal{A}$ and the (unique) singleton class of $\{a\}\in\mathcal{S}_{L}$ containing it. This is done purely for reasons of notational convenience. ¶

Remark 3.

For posterity, we also note that the optimizing agent is assumed to be aware of the cost decomposition (4) after selecting an alternative $a\in\mathcal{A}$ . In the context of combinatorial bandits [11] this would correspond to the so-called “semi-bandit” setting. ¶

2.3. Sequence of events

With all this in hand, we will consider a generic online decision process that unfolds over a set of alternatives $\mathcal{A}$ endowed with a similarity structure $\mathcal{S}=\coprod_{\ell}\mathcal{S}_{\ell}$ as follows:

(1)

At each stage $t=1,2,\dotsc$ , the learner selects an alternative $a_{t}\in\mathcal{A}$ by selecting attributes from $\mathcal{S}$ one-by-one.
(2)

Concurrently, nature sets the idiosyncratic, intra-class losses $r_{S,t}$ for each similarity class $S\in\mathcal{S}$ .
(3)

The learner incurs $r_{S,t}$ for each chosen class $S\ni a_{t}$ for a total cost of $c_{t}=\sum_{S\ni a_{t}}r_{S,t}$ , and the process repeats.

To align our presentation with standard bandit models with losses in $[0,1]$ , we will assume throughout that $\sum_{S\ni a}R_{S}\leq 1$ for all $a\in\mathcal{A}$ , meaning in particular that the maximal cost incurred by any alternative $a\in\mathcal{A}$ is upper bounded by $1$ . Other than this normalization, the sequence of idiosyncratic loss vectors $r_{t}\in\mathbb{R}^{\mathcal{S}}$ , $t=1,2,\dotsc$ , is assumed arbitrary and unknown to the learner as per the standard adversarial setting [10, 24].

To avoid deterministic strategies that could be exploited by an adversary, we will assume that the learner selects an alternative $a_{t}$ at time $t$ based on a mixed strategy $X_{t}\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ , i.e., $a_{t}\sim X_{t}$ . The regret of a policy $X_{t}$ , $t=1,2,\dotsc$ , against a benchmark strategy $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ is then defined as the cumulative difference between the player’s mean cost under $p$ and $X_{t}$ , that is

\operatorname{Reg}_{p}(T)=\sum_{t=1}^{T}\left[\operatorname{\mathbb{E}}_{X_{t}}[c_{a_{t},t}]-\operatorname{\mathbb{E}}_{p}[c_{a_{t},t}]\right]=\sum_{t=1}^{T}\langle c_{t},X_{t}-p\rangle

(6)

where $c_{t}=(c_{a,t})_{a\in\mathcal{A}}\in\mathbb{R}^{\mathcal{A}}$ denotes the vector of costs encountered by the learner at time $t$ , i.e., $c_{a,t}=\sum_{S\ni a}r_{S,t}$ for all $a\in\mathcal{A}$ . This definition will be our main figure of merit in the sequel.

3. The nested exponential weights algorithm

Our goal in what follows will be to design a learning policy capable of exploiting the type of similarity structures introduced in the previous section. The main ingredients of our method are a nested attribute selection and cost estimation rule, which we describe in detail in Sections 3.1 and 3.2 respectively; the proposed nested exponential weights (NEW) algorithm is then developed and discussed in Section 3.3.

3.1. Probabilities, propensities, and nested logit choice

We begin by introducing the attribute selection scheme that forms the backbone of our proposed policy. Our guiding principle in this is the nested logit choice (NLC) rule of McFadden [20] which selects an alternative $a\in\mathcal{A}$ by traversing $\mathcal{S}$ one attribute at a time and prescribing the corresponding conditional choice probabilities at each level of $\mathcal{S}$ .

To set the stage for all this, if $x=(x_{1},\dotsc,x_{n})\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ is a mixed strategy on $\mathcal{A}$ we will write

	$\displaystyle x_{S}$	$\displaystyle\textstyle=\sum_{a\in S}x_{a}$	(7)
for the probability of choosing $S\in\mathcal{S}$ under $x$ , and
	$\displaystyle x_{S^{\prime}\|S}$	$\displaystyle=x_{S^{\prime}}/x_{S}$	(8)

for the conditional probability of choosing a descendant $S^{\prime}$ of $S$ assuming that $S$ has already been selected under $x$ .⁴⁴4Note here that the joint probability of selecting both $S$ and $S^{\prime}$ under $x$ is simply $x_{S^{\prime}}$ whenever $S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S$ . Then the nested logit choice (NLC) rule proceeds as follows: first, it prescribes choice probabilities $x_{S_{1}}$ for all classes $S_{1}\in\mathcal{S}_{1}$ (i.e., the coarsest ones); subsequently, once a class $S_{1}\in\mathcal{S}_{1}$ has been selected, NLC prescribes the conditional choice probabilities $x_{S_{2}|S_{1}}$ for all children $S_{2}$ of $S_{1}$ and draws a class from $\mathcal{S}_{2}$ based on $x_{S_{2}|S_{1}}$ . The process then continues downwards along $\mathcal{S}$ until reaching the finest partition $\mathcal{S}_{L}$ and selecting an atom $\{a\}\equiv S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}\mathrel{}\vartriangleleft\mathrel{}\dotsm\mathrel{}\vartriangleleft\mathrel{}S_{0}\equiv\mathcal{A}$ .

This step-by-step selection process captures the “nested” part of the nested logit choice rule; the “logit” part refers to the way that the conditional probabilities (8) are actually prescribed given the agent’s predisposition towards each alternative $a\in\mathcal{A}$ . To make this precise, suppose that the learner associates to each element $a\in\mathcal{A}$ a propensity score $y_{a}\in\mathbb{R}$ indicating their tendency – or propensity – to select it. The associated propensity score of a similarity class $S_{\ell-1}\in\mathcal{S}_{\ell-1}$ , $\ell=1,\dotsc,L$ , is then defined inductively as

y_{S_{\ell-1}}=\mu_{\ell}\log\sum\nolimits_{S_{\ell}\mathrel{}\vartriangleleft\mathrel{}S_{\ell-1}}\exp(y_{S_{\ell}}/\mu_{\ell})

(9)

where $\mu_{\ell}>0$ is a tunable parameter that reflects the learner’s uncertainty level regarding the $\ell$ -th attribute $\mathcal{S}_{\ell}$ of $\mathcal{S}$ . In words, this means that the score of a class is the weighted softmax of the scores of its children; thus, starting with the individual alternatives of $\mathcal{A}$ – that is, the leaves of $\mathcal{S}$ – propensity scores are propagated backwards along $\mathcal{S}$ , and this is repeated one attribute at a time until reaching the root of $\mathcal{S}$ .

Remark 4.

We should also note that Eq. 9 assigns a propensity score to any similarity class $S\in\mathcal{S}$ . However, because the primitives of this assignment are the original scores assigned to each alternative $a\in\mathcal{A}$ , we will reserve the notation $y=(y_{1},\dotsc,y_{n})\in\mathbb{R}^{\mathcal{A}}$ for the profile of propensity scores $(y_{a})_{a\in\mathcal{A}}$ that comprises the basis of the recursive definition (9). ¶

With all this in hand, given a propensity score profile $y=(y_{1},\dotsc,y_{n})\in\mathbb{R}^{\mathcal{A}}$ , the nested logit choice (NLC) rule is defined via the family of conditional selection probabilities

P_{S_{\ell}|S_{\ell-1}}(y)=\frac{\exp(y_{S_{\ell}}/\mu_{\ell})}{\exp(y_{S_{\ell-1}}/\mu_{\ell})}

(NLC)

where:

(1)

$S_{\ell}\in\mathcal{S}_{\ell}$ and $S_{\ell-1}\in\mathcal{S}_{\ell-1}$ is a child / parent pair of similarity classes of $\mathcal{S}$ .
(2)

$\mu_{1}\geq\dotsm\geq\mu_{L}>0$ is a nonincreasing sequence of uncertainty parameters (indicating a higher uncertainty level for coarser attributes; we discuss this later).

In more detail, the choice of an alternative $a\in\mathcal{A}$ under (NLC) proceeds as follows: given a propensity score $y_{a}\in\mathbb{R}$ for each $a\in\mathcal{A}$ , every similarity class $S_{L-1}\in\mathcal{S}_{L-1}$ is assigned a propensity score via the recursive softmax expression (9), and the same procedure is applied inductively up to the root $\mathcal{A}$ of $\mathcal{S}$ . Then, to select an alternative $a\in\mathcal{A}$ , the conditional logit choice rule (NLC) proceeds in a top-down manner, first by selecting a similarity class $S_{1}\mathrel{}\vartriangleleft\mathrel{}S_{0}\equiv\mathcal{A}$ , then by selecting a child $S_{2}\mathrel{}\vartriangleleft\mathrel{}S_{1}$ of $S_{1}$ , and so on until reaching a leaf $\{a\}\equiv S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}\mathrel{}\vartriangleleft\mathrel{}\dotsm\mathrel{}\vartriangleleft\mathrel{}S_{0}\equiv\mathcal{A}$ of $\mathcal{S}$ .

Equivalently, unrolling (NLC) over the lineage $\mathcal{A}\equiv S_{0}\mathrel{}\vartriangleright\mathrel{}S_{1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{\ell}\equiv S$ of a target class $S\in\mathcal{S}_{\ell}$ , we obtain the expression

P_{S}(y)=\prod\nolimits_{k=1}^{\ell}\frac{\exp(y_{S_{k}}/\mu_{k})}{\exp(y_{S_{k-1}}/\mu_{k})}

(10)

for the total probability of selecting class $S$ under the propensity score profile $y\in\mathbb{R}^{\mathcal{A}}$ . Clearly, (NLC) and (10) are mathematically equivalent, so we will refer to either one as the definition of the nested logit choice rule.

3.2. The nested importance weighted estimator

The second key ingredient of our method is how to estimate the costs of alternatives that were not chosen under (NLC). To that end, given a cost vector $c\in[0,1]^{\mathcal{A}}$ and a mixed strategy $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ with full support, a standard way to do this is via the importance-weighted estimator [8, 17]

\hat{c}_{a}=\frac{\operatorname{\mathds{1}}\{a=\hat{a}\}}{x_{a}}c_{a}

(IWE)

where $\hat{a}\sim x$ is the (random) element of $\mathcal{A}$ chosen under $x$ .

This estimator enjoys the following important properties:

\edefnit\selectfonta\edefnn)

It is non-negative.
\edefnit\selectfonta\edefnn)

It is unbiased, i.e.,

$\operatorname{\mathbb{E}}[\hat{c}_{a}]=c_{a}\quad\text{for all $a\in\mathcal{A}$}.$ (11)
\edefnit\selectfonta\edefnn)

Its importance-weighted mean square is bounded as

$\operatorname{\mathbb{E}}\left[\sum\nolimits_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}\right]\leq n$ (12)

This trifecta of properties plays a key role in establishing the no-regret guarantees of the vanilla exponential weights algorithm [4, 18, 27]; at the same time however, (IWE) fails to take into account any side information provided by similarities between different elements of $\mathcal{A}$ . This is perhaps most easily seen in the original red bus / blue bus paradox: if the commuter takes a red bus, the observed utility would be immediately translateable to the blue bus (and vice versa). However, (IWE) is treating the red and blue buses as unrelated, so $\hat{c}_{\textrm{blue bus}}$ is not updated under (IWE), even though $c_{\textrm{blue bus}}=c_{\textrm{red bus}}$ by default.

To exploit this type of similarities, we introduce below a layered estimator that shadows the step-by-step selection process of (NLC). To define it, let $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ be a mixed strategy on $\mathcal{A}$ with full support, and assume that an element $\hat{a}\in\mathcal{A}$ is selected progressively according to $x$ as in the case of (NLC):⁵⁵5To clarify, this process adheres to the “nested” part of (NLC); the conditional probabilities $x_{S^{\prime}|S}$ may of course differ. First, the learner chooses a similarity class $\hat{S}_{1}\in\mathcal{S}_{1}$ with probability $\operatorname{\mathbb{P}}(\hat{S}_{1}=S_{1})=x_{S_{1}}$ ; subsequently, conditioned on the choice of $\hat{S}_{1}$ , a class $\hat{S}_{2}\mathrel{}\vartriangleleft\mathrel{}\hat{S}_{1}$ is selected with probability $\operatorname{\mathbb{P}}(\hat{S}_{2}=S_{2}|\hat{S}_{1})=x_{S_{2}|\hat{S}_{1}}$ , and the process repeats until reaching a leaf $\hat{S}_{L}=\{\hat{a}\}$ of $\mathcal{S}$ (at which point the selection procedure terminates and returns $\hat{a}$ ). Then, given a loss profile $r\in[0,+\infty)^{\mathcal{S}}$ and a mixed strategy $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ , the nested importance weighted estimator (NIWE) is defined for all $\ell=1,\dotsc,L$ as

\hat{r}_{S_{\ell}}=\frac{\operatorname{\mathds{1}}\big{\{}S_{\ell}=\hat{S}_{\ell},\dotsc,S_{1}=\hat{S}_{1}\big{\}}}{x_{S_{\ell}|S_{\ell-1}}\!\dotsm x_{S_{2}|S_{1}}x_{S_{1}}}r_{S_{\ell}}

(NIWE)

where the chain of categorical random variables $\mathcal{A}\equiv\hat{S}_{0}\mathrel{}\vartriangleright\mathrel{}\hat{S}_{1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}\hat{S}_{L}=\{\hat{a}\}$ is drawn according to $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ as outlined above.⁶⁶6The indicator in (NIWE) is assumed to take precedence over $x_{S_{k}|S_{k-1}}$ , i.e., $\hat{c}_{S_{\ell}}=0$ if $S_{k}\neq\hat{S}_{k}$ for some $k=1,\dotsc,\ell$ .

This estimator will play a central part in our analysis, so some remarks are in order. First and foremost, the non-nested estimator (IWE) is recovered as a special case of (NIWE) when there are no similarity attributes on $\mathcal{A}$ (i.e., $L=1$ ). Second, in a bona fide nested model, we should note that $\hat{c}_{S_{\ell}}$ is $\hat{S}_{\ell}$ -measurable but not $\hat{S}_{\ell-1}$ -measurable: this property has no analogue in (IWE), and it is an intrinsic feature of the step-by-step selection process underlying (NIWE). Third, it is also important to note that (NIWE) concerns the idiosyncratic losses of each chosen class, not the base costs $c_{a}$ of each alternative $a\in\mathcal{A}$ . This distinction is again redundant in the non-nested case, but it leads to a distinct estimator for $c_{a}$ in nested environments, namely

\hat{c}_{a}=\sum\nolimits_{S\ni a}\hat{r}_{S}\quad\text{for all $a\in\mathcal{A}$}.

(13)

In particular, in the red bus / blue bus paradox, this means that an observation for the class “bus” automatically updates both $\hat{c}_{\textrm{red bus}}$ and $\hat{c}_{\textrm{blue bus}}$ , thus overcoming one of the main drawbacks of (IWE) when facing irrelevant alternatives.

To complete the comparison with the non-nested setting, we summarize below the most important properties of the layered estimator (NIWE):

Proposition 1.

Let $\mathcal{S}=\coprod_{\ell=1}^{L}\mathcal{S}_{\ell}$ be a similarity structure on $\mathcal{A}$ . Then, given a mixed strategy $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and a vector of cost increments $r\in\mathbb{R}^{\mathcal{S}}$ as per (5), the estimator (NIWE) satisfies the following:

(1)

It is unbiased:

$\operatorname{\mathbb{E}}\left[\hat{r}_{S}\right]=r_{S}\quad\text{for all $S\in\mathcal{S}$}.$ (14)
(2)

It enjoys the importance-weighted mean-square bound

$\operatorname{\mathbb{E}}\left[x_{S}\hat{r}_{S}^{2}\right]\leq R_{S}^{2}\quad\text{for all $S\in\mathcal{S}$}.$ (15)

Accordingly, the loss estimator (13) is itself unbiased and enjoys the bound

\operatorname{\mathbb{E}}\left[\sum\nolimits_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}\right]\leq n_{\mathrm{eff}}

(16)

where $n_{\mathrm{eff}}$ is defined as

\sqrt{n_{\mathrm{eff}}}=\sum\nolimits_{\ell=1}^{L}\sqrt{n_{\ell}}\bar{R}_{\ell}

(17)

with $n_{\ell}=\lvert\mathcal{S}_{\ell}\rvert$ denoting the number of classes of attribute $\mathcal{S}_{\ell}$ , and

\bar{R}_{\ell}=\sqrt{\frac{1}{n_{\ell}}\sum\nolimits_{S_{\ell}\in\mathcal{S}_{\ell}}R_{S_{\ell}}^{2}}

(18)

denoting the “root-mean-square” range of all classes in $\mathcal{S}_{\ell}$ .

Of course, Proposition 1 yields the standard properties of (IWE) as a special case when $L=1$ (in which case there are no similarities to exploit between alternatives). To streamline our presentation, we prove this result in Appendix B.

3.3. The nested exponential weights algorithm

We are finally in a position to present the nested exponential weights (NEW) algorithm in detail. Building on the original exponential weights blueprint [18, 4, 27], the main steps of the NEW algorithm can be summed up as follows:

(1)

For each stage $t=1,2,\dotsc$ , the learner maintains and updates a propensity score profile $Y_{t}\in\mathbb{R}^{\mathcal{A}}$ .
(2)

The learner selects an action $a_{t}\in\mathcal{A}$ based on the nested logit choice rule $a_{t}\sim P(\eta_{t}Y_{t})$ where $\eta_{t}\geq 0$ is the method’s learning rate and $P$ is given by (NLC).
(3)

The learner incurs $r_{S,t}$ for each class $S\ni a_{t}$ and constructs a model $\hat{c}_{t}$ of the cost vector $c_{t}$ of stage $t$ via (NIWE).
(4)

The learner updates their propensity score profile based on $\hat{c}_{t}$ and the process repeats.

For a presentation of the algorithm in pseudocode form, see Algorithm 1; the tuning of the method’s uncertainty parameters $\mu_{1}\geq\dotsc\geq\mu_{L}>0$ and the learning rate $\eta_{t}$ is discussed in the next section, where we undertake the analysis of the NEW algorithm.

Algorithm 1 \AcfNEW

1:set of alternatives

\mathcal{A}

; attribute structure

\mathcal{S}=\coprod_{\ell=1}^{L}\mathcal{S}_{\ell}

2: Params: uncertainty levels

\mu_{1},\dotsc,\mu_{L}>0

; learning rate

\eta_{t}\geq 0

3: Input: sequence of class costs

r_{t}\in[0,1]^{\mathcal{S}}

t=1,2,\dotsc

4:initialize

Y\leftarrow 0\in\mathbb{R}^{\mathcal{A}}

S_{0}=\mathcal{A}

# initialization

5:for

t=1,2,\dotsc

do # scoring phase

6: for

\ell=L-1,\dotsc,0

and for all

S\in\mathcal{S}_{\ell}

Y_{S}\leftarrow\mu_{\ell+1}\log\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp(Y_{S^{\prime}}/\mu_{\ell+1})

# as per (9)

8: set

\hat{r}_{S}\leftarrow 0

# baseline guess

9: end for

10: for

\ell=1,\dotsc,L

do # selection phase

11: select class

S_{\ell}\mathrel{}\vartriangleleft\mathrel{}S_{\ell-1}

# class choice

S_{\ell}\sim X_{S_{\ell}|S_{\ell-1}}=\frac{\exp(\eta_{t}Y_{S_{\ell}}/\mu_{\ell})}{\exp(\eta_{t}Y_{S_{\ell-1}}/\mu_{\ell})}

# (NLC)

12: get

r_{S_{\ell},t}

# intra-class cost

13: set

\displaystyle\hat{r}_{S_{\ell}}\leftarrow\hat{r}_{S_{\ell}}+\frac{r_{S_{\ell},t}}{X_{S_{\ell}|S_{\ell-1}}\!\!\dotsm X_{S_{1}|S_{0}}}

# (NIWE)

14: end for

15: set

\hat{c}_{a}\leftarrow\sum_{S\ni a}\hat{r}_{S}

for all

a\in\mathcal{A}

# loss model

16: set

Y\leftarrow Y-\hat{c}

# update propensities

17:end for

4. Analysis and results

We are now in a position to state and discuss our main regret guarantees for the NEW algorithm. These are as follows:

Theorem 1.

Suppose that Algorithm 1 is run with a non-increasing learning rate $\eta_{t}>0$ and uncertainty parameters $\mu_{1}\geq\dotsm\geq\mu_{L}>0$ against a sequence of cost vectors $c_{t}\in[0,1]^{\mathcal{A}}$ , $t=1,2,\dotsc$ , as per (4). Then, for all $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ , the learner enjoys the regret bound

\displaystyle\operatorname{\mathbb{E}}[\operatorname{Reg}_{p}(T)]

\displaystyle\leq\frac{H}{\eta_{T+1}}+\frac{n_{\mathrm{eff}}}{2\mu_{L}}\sum_{t=1}^{T}\eta_{t}

(19)

with $n_{\mathrm{eff}}$ given by (17) and $H\equiv H(\mu_{1},\dotsc,\mu_{L})$ defined by setting $y=0$ in (9) and taking $H=y_{\mathcal{A}}$ , i.e.,

H=\log\left[\sum_{S_{1}\mathrel{}\vartriangleleft\mathrel{}S_{0}}\left[\sum_{S_{2}\mathrel{}\vartriangleleft\mathrel{}S_{1}}\!\dotsi\left[\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}\!\!\!\!1\right]^{\frac{\mu_{L}}{\mu_{L-1}}}\!\!\!\!\!\dotsi\,\right]^{\frac{\mu_{2}}{\mu_{1}}}\right]^{\mu_{1}}

(20)

In particular, if Algorithm 1 is run with $\mu_{1}=\dotsm=\mu_{L}=\sqrt{n_{\mathrm{eff}}/2}$ and $\eta_{t}=\sqrt{\log n/(2t)}$ , we have

\operatorname{\mathbb{E}}[\operatorname{Reg}_{p}(T)]\leq 2\sqrt{n_{\mathrm{eff}}\log n\cdot T}.

(21)

Theorem 1 is our main regret guarantee for NEW so, before discussing its proof (which we carry out in detail in Appendices A, B and C), some remarks are in order.

The first thing of note is the comparison to the corresponding bound for EXP3, namely

\operatorname{\mathbb{E}}[\operatorname{Reg}_{p}(T)]\leq 2\sqrt{n\log n\cdot T}.

(22)

This shows that the guarantees of NEW and EXP3 differ by a factor of⁷⁷7Depending on the source, the bound (22) may differ up to a factor of $\sqrt{2}$ , compare for example [24, Corollary 4.2] and [17, Theorem 11.2]. This factor is due to the fact that (22) is usually stated for a known horizon $T$ (which saves a factor of $\sqrt{2}$ relative to anytime algorithms). Ceteris paribus, the bound (21) can be sharpened by the same factor, but we omit the details.

\alpha=\sqrt{n/n_{\mathrm{eff}}},

(23)

which, for reasons that become clear below, we call the price of affinity (PoAf).

Since the variabilities of the idiosyncratic losses within each attribute have been normalized to $1$ (recall the relevant discussion in Section 2.3), Hölder’s inequality trivially gives $n_{\mathrm{eff}}\leq n$ , no matter the underlying similarity structure. Of course, if there are no similarities to exploit ( $L=1$ ), we get $n_{\mathrm{eff}}=n$ , in which case the two bounds coincide ( $\alpha=1$ ).

At the other extreme, suppose we have a red bus / blue bus type of problem with, say, $n_{1}=2$ similarity classes, $n_{2}=100$ alternatives per class, and a negligible intra-class loss differential ( $R_{2}\approx 0$ ). In this case, EXP3 would have to wrestle with $n=n_{1}n_{2}=200$ alternatives, while NEW would only need to discriminate between $n_{\mathrm{eff}}\approx n_{1}=2$ alternatives, leading to an improvement by a factor of $\alpha\approx 10$ in terms of regret guarantee. Thus, even though the red bus / blue bus paradox could entangle EXP3 and cause the algorithm to accrue significant regret over time, this is no longer the case under the NEW method; we also explore this issue numerically in Section 5.

As another example, suppose that each non-terminal class in $\mathcal{S}$ has $m$ children and the variability of the idiosyncratic losses likewise scales down by a factor of $m$ per attribute. In this case, a straightforward calculation shows that $n_{\mathrm{eff}}$ scales as $\Theta(m)$ , so the gain in efficiency would be of the order of $\alpha=\sqrt{n/n_{\mathrm{eff}}}=\Theta(m^{(L-1)/2})$ , i.e., polynomial in $m$ and exponential in $L$ . This gain in performance can become especially pronounced when there is a very large number of atlernatives organized in categories and subcategories of geometrically decreasing impact on the end cost of each alternative. We explore this issue in practical scenarios in Sections 5 and D.

Finally, we should also note that the parameters of NEW have been tuned so as to facilitate the comparison with EXP3. This tuning is calibrated for the case where $\mathcal{S}$ is fully symmetric, i.e., all subcategories of a given attribute have the same number of children. Otherwise, in full generality, the tuning of the algorithm’s uncertainty levels would boil down to a transcedental equation involving the nested term $H(\mu_{1},\dotsc,\mu_{L})$ of (19). This can be done efficiently offline via a line search, but since the result would be structure-dependent, we do not undertake this analysis here.

Proof outline of Theorem 1.

The detailed proof of Theorem 1 is quite lengthy, so we defer it to Appendices A, B and C and only sketch here the main ideas.

The first basic step is to derive a suitable “potential function” that can be used to track the evolution of the NEW policy relative to the benchmark $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ . The main ingredient of this potential is the “nested” entropy function

h(x)=\sum\nolimits_{k=0}^{L}\delta_{k}\sum\nolimits_{S_{k}\in\mathcal{S}_{k}}x_{S_{k}}\log x_{S_{k}},

(24)

where $\delta_{k}=\mu_{k}-\mu_{k+1}$ for all $k=1,\dotsc,L$ (with $\mu_{L+1}=0$ by convention).⁸⁸8In the non-nested case, (24) boils down to the standard (negative) entropy $h(x)=\sum_{a}x_{a}\log x_{a}$ . However, the inverse problem of deriving the “correct” form of $h$ in a nested environment involves a technical leap of faith and a fair degree of trial-and-error. As we show in Proposition A.1 in Appendix A, the “tiers” of $h$ can be unrolled to give the “non-tiered” recursive representation

h(x)=\sum\nolimits_{S\in\mathcal{S}}h(x|S)

(25)

where $h(x|S)=\mu_{\ell+1}\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}x_{S^{\prime}}\log(x_{S^{\prime}}/x_{S})$ denotes the “conditional” entropy of $x$ relative to class $S\in\mathcal{S}_{\ell}$ . Then, by means of this decomposition and a delicate backwards induction argument, we show in Proposition A.2 that \edefnit\selectfonta\edefnn) the recursively defined propensity score $y_{\mathcal{A}}$ of $\mathcal{A}$ can be expressed non-recursively as $y_{\mathcal{A}}=\operatorname*{arg\,max}_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x\rangle-h(x)\}$ ; and \edefnit\selectfonta\edefnn) that the choice rule (NLC) can be expressed itself as

P_{a}(y)=\frac{\partial y_{\mathcal{A}}}{\partial y_{a}}\quad\text{for all $y\in\mathbb{R}^{\mathcal{A}}$, $a\in\mathcal{A}$}.

(26)

This representation of (NLC) provides the first building block of our proof because, by Danskin’s theorem [7], it allows us to rewrite Algorithm 1 in more concise form as

\begin{gathered}Y_{t+1}=Y_{t}-\hat{c}_{t}\\ X_{t+1}=\operatorname*{arg\,max}_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle\eta_{t+1}Y_{t+1},x\rangle-h(x)\}\end{gathered}

(NEW)

with $\hat{c}_{t}$ given by (13) appplied to $x\leftarrow X_{t}$ . Importantly, this shows that the NEW algorithm is an instance of the well-known “follow the regularized leader” (FTRL) algorithmic framework [25, 24]. Albeit interesting, this observation is not particularly helpful in itself because there is no universal, “regularizer-agnostic” analysis giving optimal (or near-optimal) regret rates for “follow the regularized leader” (FTRL) with bandit/partial information.⁹⁹9For the analysis of specific versions of FTRL with non-entropic regularizers, cf. [ABL11, ZS19] and references therein. Nonetheless, by adapting a series of techniques that are used in the analysis of FTRL algorithms, we show in Appendix C that the iterates of ( ‣ 4) satisfy the “energy inequality”

	$\displaystyle\langle\hat{c}_{t},X_{t}-p\rangle$	$\displaystyle\leq E_{t}-E_{t+1}+\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1})$
		$\displaystyle+(\eta_{t+1}^{-1}-\eta_{t}^{-1})[h(p)-\min h]$		(27)

where $\hat{c}_{t}$ is the nested importance weighted estimator (13) for the cost vector encountered $c_{t}$ , and we have set

F(x,y)=h(x)+y_{\mathcal{A}}-\langle y,x\rangle

(28)

and $E_{t}=\eta_{t}^{-1}F(p,\eta_{t}Y_{t})$ .

Then, by Proposition 1, we obtain:

Proposition 2.

The NEW algorithm enjoys the bound

\operatorname{\mathbb{E}}[\operatorname{Reg}_{p}(T)]\leq\frac{H}{\eta_{T+1}}+\sum_{t=1}^{T}\frac{\operatorname{\mathbb{E}}[F(X_{t},\eta_{t}Y_{t+1})]}{\eta_{t}}.

(29)

Proposition 2 provides the first half of the bound (19), with the precise form of $H$ derived in Lemma C.1. The second half of (19) revolves around the term $\operatorname{\mathbb{E}}[F(X_{t},\eta_{t}Y_{t+1})]$ and boils down to estimating how propensity scores are back-propagated along $\mathcal{S}$ . In particular, the main difficulty is to bound the difference $y^{+}_{\mathcal{A}}-y_{\mathcal{A}}$ in the propensity score of the root node $\mathcal{A}$ of $\mathcal{S}$ when the underlying score profile $y\in\mathbb{R}^{\mathcal{A}}$ is incremented to $y^{+}=y+w$ for some $w\in\mathbb{R}^{\mathcal{A}}$ .

A first bound that can be obtained by convex analysis arguments is $\lvert y^{+}_{\mathcal{A}}-y_{\mathcal{A}}\rvert\leq\langle y,P(y)\rangle+\lVert w\rVert_{\infty}^{2}$ ; however, because the increments of ( ‣ 4) are unbounded in norm, this global bound is far too lax for our puposes. A similar issue arises in the analysis of EXP3, and is circumvented by deriving a bound for the log-sum-exp function using the identity $\exp(x)\leq 1+x+x^{2}/2$ for $x\leq 0$ and the fact that the estimator (IWE) is non-negative [17, 24, 10]. Extending this idea to nested environments is a very delicate affair, because each tier in $\mathcal{S}$ introduces an additional layer of error propagation in the increments $Y_{t+1}-Y_{t}$ . However, by a series of inductive arguments that traverse $\mathcal{S}$ both forward and backward, we are able to show the bound

y^{+}_{\mathcal{A}}-y_{\mathcal{A}}\leq\langle y,P(y)\rangle+\frac{1}{2\mu_{L}}\sum_{\ell=1}^{L}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}P_{S_{\ell}}(y)r_{S_{\ell}}^{2}

(30)

which, after taking expecations and using the bounds of Proposition 1, finally yields the pseudo-regret bound (19).

5. Numerical experiments

Refer to caption — Figure 2. Regret of EXP3 and NEW in the red bus / blue bus problem with different numbers of buses.

In this section we present a series of numerical experiments designed to test the efficiency of our method compared to EXP3. We use a synthetic environment where we simulate nested similarity partitions with trees. While NEW exploits the similarity structure by making forward/backward passes through the associated tree with its logit choice rule (NLC), EXP3 is simply run over the leaves of the tree, i.e., $\mathcal{A}$ . All experiment details (as well as additional results) are presented in Appendix D. For every setting, we report the results of our experiments by plotting the average regret of each algorithm for $20$ seeds of randomly drawn losses. The code to reproduce the experiments can be found at https://github.com/criteo-research/Nested-Exponential-Weights.

Benefits in the red bus/blue bus problem.

We consider here a variant of the red bus/blue bus problem with $N$ different buses (the original paradox has $N=2$ ). In this experiment (see illustration in Fig. 5, Appendix D.2) we allow each bus to have non-zero intrinsic losses and illustrate in Fig. 2 how both algorithms perform when $N$ grows. We observe there that for all configurations NEW achieves better regret than EXP3. While both methods achieve sublinear regret, EXP3 requires far more steps to identify the best alternative as $N$ grows and suffers overall from worse regret while NEW achieves similar regret and does not suffer as much from the number of irrelevant alternatives. We provide additional plots in Section D.2 which show that NEW performs consistently better than EXP3 when there exists a similarity structure allowing to efficiently update scores of classes that have very similar losses.

Performance in general nested structures.

In this setting we generate symmetric trees and experiment with different values of number of levels $L$ and number of child per nodes $M=|S_{\ell}|$ for $\ell=1,\dots,L$ . Specifically, in Fig. 3 with a fixed $M$ , we see that NEW obtains better regret than EXP3 even when $L$ increases. We provide variance plots for the experiments that generated the same performance on the plots in 8 as well as additional visualisations. Finally, in Fig. 4, we can see that for a shallow tree ( $L=2$ ) NEW performs always better than EXP3, even for high values of $M$ . Indeed, when the number of children per nodes $M$ increases, the tree loses its “factorized” structure which also affects NEW due to the less "structured" tree. Thus, again, NEW performs consistently better than EXP3 when it is possible to efficiently handle classes with similar losses.

Overall, our experiments confirm that a learning algorithm based on nested logit choice can lead to significant benefits in problems with a high degree of similarity between alternatives. This leaves open the question of whether a similar approach can be applied to structures with non-nested attributes; we defer this question to future work.

Appendix A The nested entropy and its properties

Our aim in this appendix is to prove the basic properties of the series of (negative) entropy functions that fuel the regret analysis of the nested exponential weights (NEW) algorithm.

To begin with, given a similarity structure $\mathcal{S}$ on $\mathcal{A}$ and a sequence of uncertainty parameters $\mu_{1}\geq\dotsm\geq\mu_{L}>0$ (with $\mu_{L+1}=0$ by convention), we define:

(1)

The conditional entropy of $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ relative to a target class $S\in\mathcal{S}_{\ell}$ :

h(x|S)=\mu_{\ell+1}\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}x_{S^{\prime}}\log\frac{x_{S^{\prime}}}{x_{S}}=\mu_{\ell+1}\,x_{S}\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}x_{S^{\prime}|S}\log x_{S^{\prime}|S}.

(A.1)

(2)

The nested entropy of $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ relative to $S\in\mathcal{S}_{\ell}$ :

$h_{S}(x)=\sum_{k=\ell}^{L}\delta_{k}\;\sum_{\mathclap{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}\;x_{S_{k}}\log x_{S_{k}}$ (A.2)

where $\delta_{k}=\mu_{k}-\mu_{k+1}$ for all $k=1,\dotsc,L$ .

(3)

The restricted entropy of $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ relative to $S\in\mathcal{S}_{\ell}$ :

h_{|S}(x)=h_{S}(x)+\chi_{\operatorname{\operatorname{\Delta}}(S)}(x)=\begin{cases}h_{S}(x)&\quad\text{if $x\in\operatorname{\operatorname{\Delta}}(S)$},\\ \infty&\quad\text{otherwise},\end{cases}

(A.3)

where $\chi_{\operatorname{\operatorname{\Delta}}(S)}$ denotes the (convex) characteristic function of $\operatorname{\operatorname{\Delta}}(S)$ , i.e., $\chi_{\operatorname{\operatorname{\Delta}}(S)}(x)=0$ if $x\in\operatorname{\operatorname{\Delta}}(S)$ and $\chi_{\operatorname{\operatorname{\Delta}}(S)}(x)=\infty$ otherwise. [Obviously, $h_{|S}(x)=h_{S}(x)$ whenever $x\in\operatorname{\operatorname{\Delta}}(S)$ .]

Remark 1.

As per our standard conventions, we are treating $S$ interchangeably as a subset of $\mathcal{A}$ or as an element of $\mathcal{S}$ ; by analogy, to avoid notational inflation, we are also viewing $\operatorname{\operatorname{\Delta}}(S)$ as a subset of $\operatorname{\operatorname{\Delta}}(\mathcal{A})$ – more precisely, a face thereof. Finally, in all cases, the functions $h(x|S)$ , $h_{S}(x)$ and $h_{|S}(x)$ are assumed to take the value $+\infty$ for $x\in\mathbb{R}^{\mathcal{A}}\setminus\operatorname{\operatorname{\Delta}}(\mathcal{A})$ . ¶

Remark 2.

For posterity, we also note that the nested and restricted entropy functions ( $h_{S}(x)$ and $h_{|S}(x)$ respectively) are both convex – though not necessarily strictly convex – over $\operatorname{\operatorname{\Delta}}(\mathcal{A})$ . This is a consequence of the fact that each summand $x_{S}\log x_{S}$ in (A.2) is convex in $x$ and that $\delta_{k}=\mu_{k}-\mu_{k+1}\geq 0$ for all $k=1,\dotsc,L$ . Of course, any two distributions $x,x^{\prime}\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ that assign the same probabilities to elements of $S$ but not otherwise have $h_{S}(x)=h_{S}(x^{\prime})$ , so $h_{S}$ is not strictly convex over $\operatorname{\operatorname{\Delta}}(\mathcal{A})$ if $S\neq\mathcal{A}$ . However, since the function $\sum_{a\in S}x_{a}\log x_{a}$ is strictly convex over $\operatorname{\operatorname{\Delta}}(S)$ , it follows that $h_{S}$ – and hence $h_{|S}$ – is strictly convex over $\operatorname{\operatorname{\Delta}}(S)$ . ¶

Our main goal in the sequel will be to prove the following fundamental properties of the entropy functions defined above:

Proposition A.1.

For all $S\in\mathcal{S}_{\ell}$ , $\ell=1,\dotsc,L$ , and for all $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ , we have:

	$\displaystyle h_{S}(x)$	$\displaystyle=\sum_{S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S}h(x\|S^{\prime})+\mu_{\ell}\,x_{S}\log x_{S}.$	(A.4)
Consequently, for all $x\in\operatorname{\operatorname{\Delta}}(S)$ , we have:
	$\displaystyle h_{\|S}(x)$	$\displaystyle=\sum_{S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S}h(x\|S^{\prime}).$	(A.5)

Proposition A.2.

For all $S\in\mathcal{S}$ and all $y\in\mathbb{R}^{\mathcal{A}}$ , we have:

(1)

The recursively defined propensity score $y_{S}$ of $S$ as given by (9) can be equivalently expressed as

$\displaystyle y_{S}$	$\displaystyle=\max_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x\rangle-h_{\|S}(x)\}$	(A.6)

$\displaystyle P_{a\|S}(y)$	$\displaystyle=\frac{\partial y_{S}}{\partial y_{a}}=\operatorname*{arg\,max}_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x\rangle-h_{\|S}(x)\}$	(A.7)

These propositions will be the linchpin of the analysis to follow, so some remarks are in order:

Remark 3.

Note here that the maximum in (A.6) is taken over the restricted entropy function $h_{|S}$ , not the nested entropy $h_{S}$ . This distinction will play a crucial role in the sequel; in particular, since $h_{|S}$ is strictly convex over $\operatorname{\operatorname{\Delta}}(S)$ , it implies that the $\operatorname*{arg\,max}$ in (A.7) is a singleton. ¶

Remark 4.

The first part of Proposition A.2 can be rephrased more concisely (but otherwise equivalently) as

y_{S}=h^{\ast}_{|S}(y)

(A.8)

where

h^{\ast}_{|S}(y)=\max_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x\rangle-h_{|S}(x)\}

(A.9)

denotes the convex conjugate of $h_{|S}$ . This interpretation is conceptually important because it spells out the precise functional dependence between the (primitive) propensity score profile $y\in\mathbb{R}^{\mathcal{A}}$ and the propensity scores $y_{S}$ that are propagated to higher-tier similarity classes $S\in\mathcal{S}$ via the recursive definition (9). In particular, this observation leads to the recursive rule

\exp\left(\frac{h^{\ast}_{|S}(y)}{\mu_{\ell+1}}\right)=\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{|S^{\prime}}(y)}{\mu_{\ell+1}}\right)\quad\text{for all $S\in\mathcal{S}_{\ell}$, $\ell=0,1,\dotsc,L-1$}.

(A.10)

We will we use this representation freely in the sequel. ¶

Remark 5.

It is also worth noting that the propensity scores $y_{S}$ , $S_{\ell}\in\mathcal{S}_{\ell}$ , can also be seen as primitives for the arborescence $\mathcal{S}^{\prime}=\coprod_{k=0}^{\ell}\mathcal{S}_{k}$ obtained from $\mathcal{S}$ by excising all (proper) descendants of $\mathcal{S}_{\ell}$ . Under this interpretation, the second part of Proposition A.2 readily gives the more general expression

P_{S^{\prime}|S}(y)=\frac{\partial y_{S}}{\partial y_{S^{\prime}}}\quad\text{for all $S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S$},

(A.11)

where, in the right-hand side, $y_{S}$ is to be construed as a function of $y_{S^{\prime}}$ , defined recursively via (9) applied to the truncated arborescence $\mathcal{S}^{\prime}$ . Even though we will not need this specific result, it is instructive to keep it in mind for the sequel.

The rest of this appendix is devoted to the proofs of Propositions A.1 and A.2.

Proof of Proposition A.1.

Let $\ell=\mathrm{attr}(S)$ , and fix some attribute label $k>\ell$ . We will proceed inductively by collecting all terms in (A.4) associated to the attribute $\mathcal{S}_{k}$ and then summing everything together. Indeed, we have:


$\displaystyle\mu_{k}\sum_{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}x_{S^{\prime}}\log x_{S^{\prime}}$	$\displaystyle=\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}\left[\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S_{k-1}}x_{S^{\prime}}\log x_{S^{\prime}}\right]$	# collect attributes
	$\displaystyle=\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}\left[\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S_{k-1}}x_{S^{\prime}\|S_{k-1}}x_{S_{k-1}}\log(x_{S^{\prime}\|S_{k-1}}x_{S_{k-1}})\right]$	# by definition
	$\displaystyle=\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}\left[\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S_{k-1}}x_{S^{\prime}\|S_{k-1}}x_{S_{k-1}}\log x_{S^{\prime}\|S_{k-1}}\right]$	(A.12a)
	$\displaystyle+\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}\left[\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S_{k-1}}x_{S^{\prime}\|S_{k-1}}x_{S_{k-1}}\log x_{S_{k-1}}\right]$	(A.12b)

with the tacit understanding that any empty sum that appears above is taken equal to zero.

Now, by the definition of the nested entropy, we readily obtain that

	$\eqref{eq:ent-nest1a}=\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}h(x\|S_{k-1})$	(A.13a)
whereas, by noting that $\sum_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S_{k-1}}x_{S^{\prime}\|S_{k-1}}=1$ (by the definition of conditional class choice probabilities), Eq. A.12b becomes
	$\eqref{eq:ent-nest1b}=\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}x_{S_{k-1}}\log x_{S_{k-1}}.$	(A.13b)

Hence, combining Eqs. A.12, A.13a and A.13b, we get:

\mu_{k}\sum_{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}x_{S^{\prime}}\log x_{S^{\prime}}=\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}h(x|S_{k-1})+\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}x_{S_{k-1}}\log x_{S_{k-1}}.

(A.14)

The above expression is our basic inductive step. Indeed, summing (A.14) over all $k=L,\dotsc,\ell=\mathrm{attr}(S)$ , we obtain:

$\displaystyle h_{S}(x)$	$\displaystyle=\sum_{k=\ell}^{L}(\mu_{k}-\mu_{k+1})\;\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}\;x_{S^{\prime}}\log x_{S^{\prime}}$	# by definition
	$\displaystyle=\sum_{k=L}^{\ell+1}\left[\mu_{k}\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}x_{S^{\prime}}\log x_{S^{\prime}}-\mu_{k+1}\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}x_{S^{\prime}}\log x_{S^{\prime}}\right]+(\mu_{\ell}-\mu_{\ell+1})\,x_{S}\log x_{S}$	# isolate $S$
	$\displaystyle=\sum_{k=L}^{\ell+1}\left[\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}h(x\|S_{k-1})+\mu_{k}\sum_{S_{k-1}\mathrel{}\preccurlyeq_{k-1}\mathrel{}S}x_{S_{k-1}}\log x_{S_{k-1}}-\mu_{k+1}\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}x_{S^{\prime}}\log x_{S^{\prime}}\right]$
	$\displaystyle\qquad+(\mu_{\ell}-\mu_{\ell+1})\,x_{S}\log x_{S}$	# by (A.14)
	$\displaystyle=\sum_{k=\ell}^{L-1}\sum_{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}h(x\|S^{\prime})+\mu_{\ell}\,x_{S}\log x_{S}-\mu_{L+1}\;\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{L}\mathrel{}S}}\;x_{S^{\prime}}\log x_{S^{\prime}}$	(A.15)

with the last equality following by telescoping the terms involving $\mu_{k}$ . Now, given that $\mu_{L+1}=0$ by convention, the third sum above is zero. Finally, since the conditional entropy of $x$ relative to any childless class is zero by definition, the first sum in (A.15) can be rewritten as $\sum_{k=\ell}^{L-1}\sum_{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}h(x|S^{\prime})=\sum_{S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S}h(x|S^{\prime})$ , and our claim follows.

Finally, (A.5) is a consequence of the fact that $x_{S}=1$ whenever $x\in\operatorname{\operatorname{\Delta}}(S)$ – i.e., whenever $\operatorname{supp}(x)\subseteq S$ . ∎

Proof of Proposition A.2.

We begin by noting that the optimization problem (A.6) can be written more explicitly as

	maximize	$\displaystyle\quad\langle y,x\rangle-h_{S}(x),$		( $\operatorname{Opt}_{S}$ )
	subject to	$\displaystyle\textstyle\quad x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})\text{ and }\operatorname{supp}(x)\subseteq S.$		( $\operatorname{Opt}_{S}$ )

We will proceed to show that the (unique) solution of ( $\operatorname{Opt}_{S}$ ) is given by the vector of conditional probabilities $(P_{a|S}(y))_{a\in\mathcal{A}}$ . The expression (A.6) for the maximal value of ( $\operatorname{Opt}_{S}$ ) will then be derived from Proposition A.1, and the differential representation (A.7) will follow from Legendre’s identity. We make all this precise in a series of individual steps below.

Step 1: Optimality conditions for ( $\operatorname{Opt}_{S}$ ).

For all $a\in S$ , the definition of the nested entropy gives

$\displaystyle\frac{\partial h_{S}}{\partial x_{a}}$	$\displaystyle=\sum_{k=\ell}^{L}\delta_{k}\;\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}\;\frac{\partial}{\partial x_{a}}(x_{S^{\prime}}\log x_{S^{\prime}})=\sum_{k=\ell}^{L}\delta_{k}\;\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}\;(1+\log x_{S^{\prime}})\frac{\partial x_{S^{\prime}}}{\partial x_{a}}$
	$\displaystyle=\sum_{k=\ell}^{L}\delta_{k}\;\sum_{\mathclap{S^{\prime}\mathrel{}\preccurlyeq_{k}\mathrel{}S}}\;(1+\log x_{S^{\prime}})\operatorname{\mathds{1}}\{a\in S^{\prime}\}$
	$\displaystyle=\sum_{k=\ell}^{L}\delta_{k}(1+\log x_{S_{k}})$
	$\displaystyle=\mu_{\ell}+\sum_{k=\ell}^{L}\delta_{k}\log x_{S_{k}}$	(A.16)

where $S\equiv S_{\ell}\mathrel{}\vartriangleright\mathrel{}S_{\ell+1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{L}\equiv\{a\}$ denotes the lineage of $a$ up to $S$ (inclusive). This implies that $\partial_{a}h_{S}(x)\to-\infty$ whenever $x_{a}\to 0$ , so any solution $x$ of ( $\operatorname{Opt}_{S}$ ) must have $x_{a}>0$ for all $a\in S$ . In view of this, the first-order optimality conditions for ( $\operatorname{Opt}_{S}$ ) become

y_{a}-\frac{\partial h_{S}}{\partial x_{a}}=y_{a}-\mu_{\ell}-\sum_{k=\ell}^{L}\delta_{k}\log x_{S_{k}}=\lambda\quad\text{for all $a\in S$},

(A.17)

where $\lambda$ is the Lagrange multiplier for the equality constraint $\sum_{a\in\mathcal{A}}x_{a}=1$ .¹⁰¹⁰10Since $x_{a}>0$ for all $a\in S$ , the multipliers for the corresponding inequality constraints all vanish by complementary slackness. Thus, after rearranging terms and exponentiating, we get

x_{S_{L}}^{\delta_{L}}\cdot x_{S_{L-1}}^{\delta_{L-1}}\dotsm x_{S_{\ell}}^{\delta_{\ell}}=\frac{\exp(y_{a})}{Z},

(A.18)

for some proportionality constant $Z\equiv Z(y)>0$ .

Step 2: Solving ( $\operatorname{Opt}_{S}$ ).

The next step of our proof will focus on unrolling the chain (A.18), one attribute at a time. To start, recall that $\delta_{L}=\mu_{L}$ , so (A.18) becomes

x_{S_{L}}\cdot x_{S_{L-1}}^{\delta_{L-1}/\mu_{L}}\dotsm x_{S_{\ell}}^{\delta_{\ell}/\mu_{L}}=\frac{\exp(y_{S_{L}}/\mu_{L})}{Z^{1/\mu_{L}}},

(A.19)

where we used the fact that $S_{L}=a$ by definition. Now, since $S_{L-1}\mathrel{}\preccurlyeq\mathrel{}S_{\ell}=S$ , it follows that all children of $S_{L-1}$ are also desendants of $S$ , so (A.19) applies to all siblings of $S_{L}$ as well. Hence, summing (A.19) over $S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}$ , we get

x_{S_{L-1}}\cdot x_{S_{L-1}}^{\delta_{L-1}/\mu_{L}}\dotsm x_{S_{\ell}}^{\delta_{\ell}/\mu_{L}}=\frac{\exp(y_{S_{L-1}}/\mu_{L})}{Z^{1/\mu_{L}}},

(A.20)

where we used the definition (7) of $x_{S_{L-1}}=\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}x_{S_{L}}$ and the recursive definition (9) for $y_{S_{L-1}}$ , i.e., the fact that $\exp(y_{S_{L-1}}/\mu_{L})=\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}\exp(y_{S_{L}}/\mu_{L})$ . Therefore, noting that

1+\frac{\delta_{L-1}}{\mu_{L}}=1+\frac{\mu_{L-1}-\mu_{L}}{\mu_{L}}=\frac{\mu_{L-1}}{\mu_{L}}

(A.21)

the product (A.20) becomes

x_{S_{L-1}}^{\mu_{L-1}}\cdot x_{S_{L-2}}^{\delta_{L-2}}\dotsm x_{S_{\ell}}^{\delta_{\ell}}=\frac{\exp(y_{S_{L-1}})}{Z}

(A.22)

or, equivalently

x_{S_{L-1}}\cdot x_{S_{L-2}}^{\delta_{L-2}/\mu_{L-1}}\dotsm x_{S_{\ell}}^{\delta_{\ell}/\mu_{L-1}}=\frac{\exp(y_{S_{L-1}}/\mu_{L-1})}{Z^{1/\mu_{L-1}}}.

(A.23)

This last equation has the same form as (A.20) applied to the chain $S_{\ell}\mathrel{}\vartriangleright\mathrel{}S_{\ell+1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{L-1}$ instead of $S_{\ell}\mathrel{}\vartriangleright\mathrel{}S_{\ell+1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{L}$ . Thus, proceeding inductively, we conclude that

x_{S_{k}}^{\mu_{k}}\prod_{j=k-1}^{\ell}x_{S_{j}}^{\delta_{j}}=\frac{\exp(y_{S_{k}})}{Z}\quad\text{for all $k=L,\dotsc,\ell$}

(A.24)

with the empty product $\prod_{j\in\varnothing}x_{S_{j}}^{\delta_{j}}$ taken equal to $1$ by standard convention.

Now, substituting $k\leftarrow k+1$ in (A.24), we readily get

x_{S_{k+1}}^{\mu_{k+1}}\cdot x_{S_{k}}^{\delta_{k}}\prod_{j=k-1}^{\ell}x_{S_{j}}^{\delta_{j}}=\frac{\exp(y_{S_{k+1}})}{Z}\quad\text{for all $k=L-1,\dotsc,\ell$}.

(A.25)

Consequently, recalling that $\delta_{k}=\mu_{k}-\mu_{k+1}$ and dividing (A.24) by (A.25), we get

\frac{x_{S_{k+1}}^{\mu_{k+1}}}{x_{S_{k}}^{\mu_{k+1}}}=\frac{\exp(y_{S_{k+1}})}{\exp(y_{S_{k}})},

(A.26)

and hence

\frac{x_{S_{k+1}}}{x_{S_{k}}}=\frac{\exp(y_{S_{k+1}}/\mu_{k+1})}{\exp(y_{S_{k}}/\mu_{k+1})}=P_{S_{k+1}|S_{k}}(y)

(A.27)

by the definition of the conditional logit choice model (NLC). Therefore, by unrolling the chain

x_{a|S}=\frac{x_{a}}{x_{S}}=\frac{x_{S_{L}}}{x_{S_{L-1}}}\cdot\frac{x_{S_{L-1}}}{x_{S_{L-2}}}\dotsm\frac{x_{S_{\ell+1}}}{x_{S_{\ell}}}=P_{S_{L}|S_{L-1}}(y)\times P_{S_{L-1}|S_{L-2}}(y)\times\dotsm\times P_{S_{\ell+1}|S_{\ell}}(y)

(A.28)

we obtain the nested expression

x_{a}=x_{S}\prod_{k=\ell}^{L-1}P_{S_{k+1}|S_{k}}(y)\quad\text{for all $a\in S$}.

(A.29)

Thus, with $x_{S}=1$ (by the fact that $\operatorname{supp}(x)=S$ ), we finally conclude that

x_{a}=\prod_{k=\ell}^{L-1}P_{S_{k+1}|S_{k}}(y)=P_{a|S}(y)\quad\text{for all $a\in S$}.

(A.30)

Step 3: The maximal value of ( $\operatorname{Opt}_{S}$ ).

To obtain the value of the maximization problem ( $\operatorname{Opt}_{S}$ ), we will proceed to substitute (A.30) in the expression (A.4) provided by Proposition A.1 for $h_{S}(x)$ . To that end, for all $k=\ell,\dotsc,L-1$ and all $S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S$ , the definition (A.1) of the conditional entropy gives:

$\displaystyle h(x\|S_{k})$	$\displaystyle=\mu_{k+1}\,x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}\log x_{S_{k+1}\|S_{k}}$	# by definition
	$\displaystyle=\mu_{k+1}\,x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}\log\frac{\exp(y_{S_{k+1}}/\mu_{k+1})}{\exp(y_{S_{k}}/\mu_{k+1})}$	# by (A.27)
	$\displaystyle=x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}y_{S_{k+1}}-x_{S_{k}}y_{S_{k}}$	# since $\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}=1$
	$\displaystyle=\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}}y_{S_{k+1}}-x_{S_{k}}y_{S_{k}}$	(A.31)

and hence

\displaystyle\sum_{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}h(x|S_{k})=\sum_{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}\left[\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}}y_{S_{k+1}}-x_{S_{k}}y_{S_{k}}\right]=\sum_{S_{k+1}\mathrel{}\preccurlyeq_{k+1}\mathrel{}S}x_{S_{k+1}}y_{S_{k+1}}-\sum_{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}x_{S_{k}}y_{S_{k}}.

(A.32)

Thus, telescoping this last releation over $k=\ell,\dotsc,L$ and invoking Proposition A.1, we obtain:

$\displaystyle h_{S}(x)$	$\displaystyle=\sum_{S^{\prime}\mathrel{}\preccurlyeq\mathrel{}S}h(x\|S^{\prime})+\mu_{k}\,x_{S}\cancel{\log x_{S}}$	# by Proposition A.1
	$\displaystyle=\sum_{k=\ell}^{L-1}\sum_{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}h(x\|S_{k})$	# collect parent classes
	$\displaystyle=\sum_{k=\ell}^{L-1}\left[\sum_{S_{k+1}\mathrel{}\preccurlyeq_{k+1}\mathrel{}S}x_{S_{k+1}}y_{S_{k+1}}-\sum_{S_{k}\mathrel{}\preccurlyeq_{k}\mathrel{}S}x_{S_{k}}y_{S_{k}}\right]$	# by (A.32)
	$\displaystyle=\langle y,x\rangle-x_{S}y_{S}$	(A.33)

where, in the second line, we used the fact that the conditional entropy $h(x|S_{L})$ relative to any childless class $S_{L}\in\mathcal{S}_{L}$ is zero by definition. Accordingly, substituting back to ( $\operatorname{Opt}_{S}$ ) we conclude that

\operatorname{val}\eqref{eq:opt-class}=\langle y,x\rangle-h_{S}(x)=x_{S}y_{S}=y_{S},

(A.34)

as claimed.

Step 4: Differential representation of conditional probabilities.

To prove the second part of the proposition, recall that the restricted entropy function $h_{|S}$ is convex, and let

h^{\ast}_{|S}(y)=\max_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x\rangle-h_{|S}(x)\}

(A.35)

denote its convex conjugate.¹¹¹¹11Note here that $h^{\ast}_{|S}(y)$ is bounded from above by the convex conjugate $h^{\ast}_{S}(y)$ of $h_{S}(x)$ because the latter does not include the constraint $\operatorname{supp}(x)\subseteq S$ . By standard results in convex analysis [e.g., Theorem 23.5 in 23], $h^{\ast}_{|S}$ is differentiable in $y$ and we have the Legendre identity:

x=\nabla h^{\ast}_{|S}(y)\iff y\in\partial h_{|S}(x)\iff x\in\operatorname*{arg\,max}_{x^{\prime}\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\{\langle y,x^{\prime}\rangle-h_{|S}(x^{\prime})\}

(A.36)

Now, by (A.30), we have $x_{a}=P_{a|S}(y)$ whenever $x$ solves ( $\operatorname{Opt}_{S}$ ) and hence, by Fermat’s rule, whenever $y-\partial h_{|S}(x)\ni 0$ . Our claim then follows by noting that $h^{\ast}_{|S}(y)=y_{S}$ and combining the first and third legs of the equivalence (A.36). ∎

These properties of the nested entropy function (and its restricted variant) will play a key role in deriving a suitable energy function for the nested exponential weights algorithm. We make this precise in Appendix C below.

Appendix B Auxiliary bounds and results

Throughout this appendix, we assume the following primitives:

•

A fixed sequence of real numbers $\mu_{1}\geq\mu_{2}\geq\cdots\geq\mu_{L}>0$ ; all entropy-related objects will be defined relative to this sequence as per the previous section.
•

A score vector $y\in\mathbb{R}^{\mathcal{A}}$ that defines inductively the score $y_{S}$ of any class $S\in\mathcal{S}$ using (9), as well as the associated nested choice probability $P(y)$ as per (NLC).
•

A vector of cost increments $r=(r_{S})_{S\in\mathcal{S}}\in\mathbb{R}^{\mathcal{S}}$ that defines an associated cost vector $c\in\mathbb{R}^{\mathcal{A}}$ as per (4), viz.

$c_{a}=\sum_{S\ni a}r_{S}\quad\text{for all $a\in\mathcal{A}$}.$ (B.1)

Moreover, for all $c,y\in\mathbb{R}^{\mathcal{A}}$ , we define the nested power sum function $\sigma_{c,y}\colon\mathcal{S}\backslash\mathcal{S}_{L}\rightarrow\mathbb{R}$ which, to any $S\in\mathcal{S}\backslash\mathcal{S}_{L}$ , associates the real number

\sigma_{c,y}(S)=\begin{cases}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a|S}(y)\exp\left(-c_{a}/\mu_{L}\right)&\quad\text{if $\mathrm{attr}(S)=L-1$},\\ \sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}|S}(y)\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+2}}{\mu_{\ell+1}}}&\quad\text{if $\mathrm{attr}(S)=\ell<L-1$}.\end{cases}

(B.2)

The following lemma links the increments of the conjugate entropy $h^{\ast}$ to the nested power sum defined above:

Lemma B.1.

For all $y\in\mathbb{R}^{\mathcal{A}}$ , $c\in\mathbb{R}^{\mathcal{A}}$ , we have

h^{\ast}\left(y-c\right)=h^{\ast}\left(y\right)+\mu_{1}\log\left(\sigma_{c,y}(\mathcal{A})\right).

(B.3)

Lemma B.1 will be proved as a corollary of the more general result below:

Lemma B.2.

Fix some $y\in\mathbb{R}^{\mathcal{A}}$ and $c\in\mathbb{R}^{\mathcal{A}}$ . Then, for all $S_{\ell}\in\mathcal{S}_{\ell}$ , $\ell<L$ ,we have

\exp\left(\frac{h^{\ast}_{|S_{\ell}}\left(y-c\right)}{\mu_{\ell+1}}\right)=\exp\left(\frac{h^{\ast}_{|S_{\ell}}\left(y\right)}{\mu_{\ell+1}}\right)\sigma_{c,y}(S_{\ell})

(B.4)

Proof of Lemma B.1.

Simply invoke Lemma B.2 with $S\leftarrow\mathcal{A}$ . ∎

Proof of Lemma B.2.

We proceed by descending induction on $\ell=\mathrm{attr}(S)$ .

Base step.

Fix some $S\in\mathcal{S}$ with $\mathrm{attr}(S)=L-1$ . We then have:

$\displaystyle\exp\left(\frac{h^{\ast}_{\|S}\left(y-c\right)}{\mu_{L}}\right)$	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y-c\right)}{\mu_{L}}\right)$	# by Eq. A.10
	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)-c_{a}}{\mu_{L}}\right)$	# the a’s are leaves
	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)}{\mu_{L}}\right)\exp\left(-\frac{c_{a}}{\mu_{L}}\right)$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\underbrace{\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\left[\dfrac{\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)}{\mu_{L}}\right)}{\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)}\right]\exp\left(-\frac{c_{a}}{\mu_{L}}\right)}_{=\sigma_{c,y}(S)\text{ by definition}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\sigma_{c,y}(S)$	(B.5)

with the last equality following from the definition of $P_{a|S}$ via (NLC) and by the definition of $\sigma_{c,y}(S)$ . This concludes the start of the induction process.

Induction step.

Fix some $S\in\mathcal{S}$ with $\mathrm{attr}(S)=\ell-1$ , $\ell<L$ , and suppose that (B.4) holds at level $\ell$ . We then have:

$\displaystyle\exp\left(\frac{h^{\ast}_{\|S}\left(y-c\right)}{\mu_{\ell}}\right)$	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y-c\right)}{\mu_{\ell}}\right)$
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y-c\right)}{\mu_{\ell+1}}\right)^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\left[\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell+1}}\right)\sigma_{c,y}(S^{\prime})\right]^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$	# inductive hypothesis
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell}}\right)\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{\ell}}\right)\underbrace{\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\left[\dfrac{\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell}}\right)}{\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{\ell}}\right)}\right]\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}}_{=\sigma_{c,y}(S)\text{ by definition}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\sigma_{c,y}(S)$	(B.6)

with the last equality following from the definition of $P_{S^{\prime}|S}$ and $\sigma_{c,y}(S)$ . This being true for all $S\in\mathcal{S}$ with $\mathrm{attr}(S)=\ell-1$ , the inductive step and – a fortiori – our proof are complete. ∎

The next lemma provides an upper bound for $\sigma_{c,y}(\mathcal{A})$ , which will in turn allow us to derive a bound for the increment of $h^{\ast}$ .

Lemma B.3.

For $y\in\mathbb{R}^{\mathcal{A}}$ and $c\in[0,+\infty)^{\mathcal{A}}$ , we have:

\sigma_{c,y}(\mathcal{A})\leq 1-\frac{1}{\mu_{1}}\left[\sum\limits_{a\in\mathcal{A}}P_{a}(y)c_{a}-\frac{1}{2\mu_{L}}\sum\limits_{a\in\mathcal{A}}P_{a}(y)c_{a}^{2}\right].

(B.7)

As in the case of B.1, Lemma B.3 will follow as a special case of the more general, class-based result below:

Lemma B.4.

Fix some $y\in\mathbb{R}^{\mathcal{A}}$ and $c\in\mathbb{R}_{+}^{\mathcal{A}}$ . Then, for all $S_{\ell}\in\mathcal{S}_{\ell}$ , $\ell<L$ ,we have

\sigma_{c,y}(S_{\ell})\leq 1-\frac{1}{\mu_{\ell+1}}\left[\sum\limits_{a\in S_{\ell}}P_{a|S_{\ell}}(y)c_{a}-\frac{1}{2\mu_{L}}\sum\limits_{a\in S_{\ell}}P_{a|S_{\ell}}(y)c_{a}^{2}\right],

(B.8)

Proof of Lemma B.3.

Simply invoke Lemma B.4 with $S\leftarrow\mathcal{A}$ . ∎

Proof of Lemma B.4.

We proceed again by descending induction on $\ell=\mathrm{attr}(S)$ .

Base step.

Fix some $S\in\mathcal{S}$ with $\mathrm{attr}(S)=L-1$ . We then have:

$\displaystyle\sigma_{c,y}(S)$	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)\exp(-\frac{c_{S^{\prime}}}{\mu_{L}})$
	$\displaystyle\leq\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)(1-\frac{c_{S^{\prime}}}{\mu_{L}}\frac{c_{S^{\prime}}^{2}}{2\mu_{L}^{2}})$	# $e^{-x}\leq 1-x+x^{2}/2$ for $x\geq 0$
	$\displaystyle=1-\frac{1}{\mu_{L}}\left[\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)c_{S^{\prime}}-\frac{1}{2\mu_{L}}\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)c_{S^{\prime}}^{2}\right]$
	$\displaystyle=1-\frac{1}{\mu_{(L-1)+1}}\left[\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}-\frac{1}{2\mu_{L}}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}^{2}\right]$	(B.9)

so the initialization of the induction process is complete.

Induction step.

Fix some $S\in\mathcal{S}$ with $\mathrm{attr}(S)=\ell-1$ , $\ell<L$ , and suppose that (B.8) holds at level $\ell$ . We then have:

$\displaystyle\sigma_{c,y}(S)$	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)\left[1+\frac{1}{\mu_{\ell+1}}\left(-\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)c_{a}+\frac{1}{2\mu_{L}}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)c_{a}^{2}\right)\right]^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$	# inductive hypothesis
	$\displaystyle\leq\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)\left[1+\frac{1}{\mu_{\ell}}\left(-\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)c_{a}+\frac{1}{2\mu_{L}}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)c_{a}^{2}\right)\right]$	# $(1+x)^{\beta}\leq 1+\beta x$ for $\beta\leq 1$
	$\displaystyle=1+\frac{1}{\mu_{\ell}}\left[-\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)P_{S^{\prime}\|S}(y)c_{a}+\frac{1}{2\mu_{L}}\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S^{\prime}}P_{a\|S^{\prime}}(y)P_{S^{\prime}\|S}(y)c_{a}^{2}\right]$	(B.10)
	$\displaystyle=1+\frac{1}{\mu_{(\ell-1)+1}}\left[\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}+\frac{1}{2\mu_{L}}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}^{2}\right]$	(B.11)

This being true for all $S\in\mathcal{S}$ s.t. $\mathrm{attr}(S)=\ell-1$ , the induction step and the proof of our assertion are complete. ∎

With all this in hand, we are now in a position to upper bound the increments of the conjugate nested entropy $h^{\ast}$ .

Proposition B.1.

For $y\in\mathbb{R}^{\mathcal{A}}$ and $c\in[0,+\infty)^{\mathcal{A}}$ , we have:

h^{\ast}(y-c)-h^{\ast}(y)\leq-\langle P(y),c\rangle+\frac{1}{2\mu_{L}}\sum\limits_{a\in\mathcal{A}}P_{a}(y)c_{a}^{2}.

(B.12)

Proof.

Using Lemmas B.1 and B.3 and the concavity inequality $\log x\leq 1+x$ directly delivers our assertion. ∎

Remark 6.

It is useful to note that, given a cost increment vector $r\in\mathbb{R}^{\mathcal{S}}$ with associated aggregate costs given by $c\in\mathbb{R}^{\mathcal{A}}$ we have:

	$\displaystyle\langle P(y),c\rangle$	$\displaystyle=\sum\limits_{a\in\mathcal{A}}P_{a}(y)c_{a}$
		$\displaystyle=\sum\limits_{a\in\mathcal{A}}P_{a}(y)\sum\limits_{S\ni a}r_{S}$
		$\displaystyle=\sum\limits_{a\in\mathcal{A}}P_{a}(y)\sum\limits_{S\in\mathcal{S}}r_{S}\operatorname{\mathds{1}}_{a\in S}$
		$\displaystyle=\sum\limits_{S\in\mathcal{S}}\left[\sum\limits_{a\in\mathcal{A}}P_{a}(y)\operatorname{\mathds{1}}_{a\in S}\right]r_{S}$
		$\displaystyle=\sum\limits_{S\in\mathcal{S}}P_{S}(y)r_{S}.$

We are finally in a position to prove the basic properties of the nested importance weighted estimator (NIWE) estimator, which we restate below for convenience:

See 1

Proof.

Fix some $S\in\mathcal{S}$ with $\mathrm{attr}(S)=\ell\in\{1,\dotsc,L\}$ and lineage $\mathcal{A}\equiv S_{0}\mathrel{}\vartriangleright\mathrel{}S_{1}\mathrel{}\vartriangleright\mathrel{}\dotsm\mathrel{}\vartriangleright\mathrel{}S_{\ell}\equiv S$ . We will now prove both properties of the (NIWE) estimator.

Part 1.

We begin by showing that the estimator (NIWE) is unbiased. Indeed, we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[\hat{r}_{S}\right]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\frac{\operatorname{\mathds{1}}\big{\{}S_{\ell}=\hat{S}_{\ell},\dotsc,S_{1}=\hat{S}_{1}\big{\}}}{x_{S_{\ell}\|S_{\ell-1}}\!\dotsm x_{S_{2}\|S_{1}}x_{S_{1}}}r_{S_{\ell}}\right]=\operatorname{\mathbb{E}}\left[\frac{\operatorname{\mathds{1}}\big{\{}S=\hat{S}\big{\}}}{x_{S}}r_{S}\right]$		# Rewriting (NIWE)
		$\displaystyle=\frac{r_{S}}{x_{S}}\underbrace{\operatorname{\mathbb{E}}\left[\operatorname{\mathds{1}}\big{\{}S=\hat{S}\big{\}}\right]}_{x_{S}}=r_{S}.$		(B.13)

Part 2.

We now turn to the proof of the importance-weighted mean-square bound of the estimator (NIWE). In this case, for any $S\in\mathcal{S}$ , we have:

$\displaystyle\operatorname{\mathbb{E}}\left[x_{S}\hat{r}_{S}^{2}\right]$	$\displaystyle=x_{S}\operatorname{\mathbb{E}}\left[\hat{r}_{S}^{2}\right]=x_{S}\operatorname{\mathbb{E}}\left[\left(\frac{\operatorname{\mathds{1}}\big{\{}S=\hat{S}\big{\}}}{x_{S}}r_{S_{\ell}}\right)^{2}\right]$
	$\displaystyle=x_{S}\frac{r_{S_{\ell}}^{2}}{x_{S}^{2}}\operatorname{\mathbb{E}}\left[\operatorname{\mathds{1}}\big{\{}S=\hat{S}\big{\}}\right]=r_{S_{\ell}}^{2}$	# because $\operatorname{\mathbb{E}}\big{[}\operatorname{\mathds{1}}\big{\{}S=\hat{S}\big{\}}\big{]}=x_{S}$
	$\displaystyle\leq R_{S}^{2}.$	(B.14)

We are left to derive the bound for the aggregate cost estimator (13), viz.

\hat{c}_{a}=\sum\limits_{S\ni a}\hat{r}_{S}.

(B.15)

With this in mind, we can write:

$\displaystyle\sum_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}$	$\displaystyle=\sum_{a\in\mathcal{A}}x_{a}\left(\sum_{S\ni a}\hat{r}_{S}\right)^{2}$
	$\displaystyle=\sum_{a\in\mathcal{A}}x_{a}\left[\sum_{S\ni a}\hat{r}_{S}^{2}+2\sum_{S^{\prime}\ni a}\sum_{S\mathrel{}\succ\mathrel{}S^{\prime}}\hat{r}_{S}\hat{r}_{S^{\prime}}\right]$
	$\displaystyle=\sum_{a\in\mathcal{A}}\sum_{S\in\mathcal{S}}x_{a}\hat{r}_{S}^{2}\operatorname{\mathds{1}}_{a\in S}+2\sum_{a\in\mathcal{A}}\sum_{S^{\prime}\in\mathcal{S}}\sum_{S\mathrel{}\succ\mathrel{}S^{\prime}}x_{a}\hat{r}_{S}\hat{r}_{S^{\prime}}\operatorname{\mathds{1}}_{a\in S^{\prime}}$
	$\displaystyle=\sum_{S\in\mathcal{S}}\hat{r}_{S}^{2}\underbrace{\sum_{a\in\mathcal{A}}x_{a}\operatorname{\mathds{1}}_{a\in S}}_{x_{S}}+2\sum_{S^{\prime}\in\mathcal{S}}\sum_{S\mathrel{}\succ\mathrel{}S^{\prime}}\hat{r}_{S}\hat{r}_{S^{\prime}}\underbrace{\sum_{a\in\mathcal{A}}x_{a}\operatorname{\mathds{1}}_{a\in S^{\prime}}}_{x_{S^{\prime}}}$
	$\displaystyle=\sum_{S\in\mathcal{S}}x_{S}\hat{r}_{S}^{2}+2\sum_{S^{\prime}\in\mathcal{S}}\sum_{S\mathrel{}\succ\mathrel{}S^{\prime}}x_{S^{\prime}}\hat{r}_{S}\hat{r}_{S^{\prime}}.$	(B.16)

Now, decomposing the above sums attribute-by-attribute and taking expectations in (B.16), we get:

\operatorname{\mathbb{E}}\left[\sum_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}\right]=\sum_{\ell=1}^{L}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}x_{S_{\ell}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]+2\sum_{1\leq\ell<\ell^{\prime}\leq L}\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}\hat{r}_{S_{\ell^{\prime}}}\right].

(B.17)

The first term in (B.17) can simply be bounded using (B.14). Indeed:

\sum_{\ell=1}^{L}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}x_{S_{\ell}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]\leq\sum_{\ell=1}^{L}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}R^{2}_{S_{\ell}}=\sum_{\ell=1}^{L}n_{\ell}\bar{R}^{2}_{\ell}.

(B.18)

with $\bar{R}_{\ell}=\sqrt{\frac{1}{n_{\ell}}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}R^{2}_{S_{\ell}}}$ for any $\ell=1,\dotsc,L$ .

We now turn to the second term in (B.17). Let $\{\epsilon_{\ell,\ell^{\prime}}\}_{1\leq\ell^{\prime}<\ell\leq L}$ be any fixed sequence of positive numbers. For any $\ell,\ell^{\prime}\in\{1,\dotsc,L\}$ and any $S_{\ell}\in\mathcal{S}_{\ell}$ and $S_{\ell^{\prime}}\in\mathcal{S}_{\ell^{\prime}}$ , the Peter-Paul inequality yields:

\displaystyle 2\hat{r}_{S_{\ell^{\prime}}}\hat{r}_{S_{\ell}}\leq\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\hat{r}_{S_{\ell^{\prime}}}^{2}+\epsilon_{\ell,\ell^{\prime}}\hat{r}_{S_{\ell}}^{2}

(B.19)

Injecting (B.19) into the second term of (B.17) yields:

$\displaystyle 2\sum_{1\leq\ell<\ell^{\prime}\leq L}$	$\displaystyle\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}\hat{r}_{S_{\ell^{\prime}}}\right]$
	$\displaystyle\leq\sum_{1\leq\ell<\ell^{\prime}\leq L}\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\left(\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell^{\prime}}}^{2}\right]+\epsilon_{\ell,\ell^{\prime}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]\right)$
	$\displaystyle=\sum_{1\leq\ell<\ell^{\prime}\leq L}\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell^{\prime}}}^{2}\right]+\sum_{1\leq\ell<\ell^{\prime}\leq L}\epsilon_{\ell,\ell^{\prime}}\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]$
	$\displaystyle=\sum_{1\leq\ell<\ell^{\prime}\leq L}\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\sum_{\begin{subarray}{c}S_{\ell}\in\mathcal{S}_{\ell}\\ S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}\end{subarray}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell^{\prime}}}^{2}\right]+\sum_{1\leq\ell<\ell^{\prime}\leq L}\epsilon_{\ell,\ell^{\prime}}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]\underbrace{\sum_{S_{\ell^{\prime}}\mathrel{}\prec\mathrel{}_{\ell^{\prime}}\mathcal{S}_{\ell}}x_{S_{\ell^{\prime}}}}_{x_{S_{\ell}}}$
	$\displaystyle=\sum_{1\leq\ell<\ell^{\prime}\leq L}\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\sum_{S_{\ell^{\prime}}\in\mathcal{S}_{\ell^{\prime}}}x_{S_{\ell^{\prime}}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell^{\prime}}}^{2}\right]+\sum_{1\leq\ell<\ell^{\prime}\leq L}\epsilon_{\ell,\ell^{\prime}}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}x_{S_{\ell}}\operatorname{\mathbb{E}}\left[\hat{r}_{S_{\ell}}^{2}\right]$
	$\displaystyle\leq\sum_{1\leq\ell<\ell^{\prime}\leq L}\frac{1}{\epsilon_{\ell,\ell^{\prime}}}\sum_{S_{\ell^{\prime}}\in\mathcal{S}_{\ell^{\prime}}}R_{S_{\ell^{\prime}}}^{2}+\sum_{1\leq\ell<\ell^{\prime}\leq L}\epsilon_{\ell,\ell^{\prime}}\sum_{S_{\ell}\in\mathcal{S}_{\ell}}R^{2}_{S_{\ell}}$	# by (B.14)
	$\displaystyle\leq\sum_{1\leq\ell<\ell^{\prime}\leq L}\frac{1}{\epsilon_{\ell,\ell^{\prime}}}n_{\ell^{\prime}}\bar{R}_{\ell^{\prime}}^{2}+\sum_{1\leq\ell<\ell^{\prime}\leq L}\epsilon_{\ell,\ell^{\prime}}n_{\ell}\bar{R}^{2}_{\ell}.$	(B.20)

Injecting (B.18) and (B.20) into (B.17) ensures that:

\operatorname{\mathbb{E}}\left[\sum\nolimits_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}\right]\leq\sum\nolimits_{\ell=1}^{L}n_{\ell}\bar{R}^{2}_{\ell}+\sum\nolimits_{1\leq\ell<\ell^{\prime}\leq L}\left(\frac{1}{\epsilon_{\ell,\ell^{\prime}}}n_{\ell^{\prime}}\bar{R}_{\ell^{\prime}}^{2}+\epsilon_{\ell,\ell^{\prime}}n_{\ell}\bar{R}^{2}_{\ell}\right)

holds for any sequence of positive numbers $\{\epsilon_{\ell,\ell^{\prime}}\}_{1\leq\ell^{\prime}<\ell\leq L}$ . As a result, taking $\epsilon_{\ell,\ell^{\prime}}=\sqrt{\frac{n_{\ell^{\prime}}}{n_{\ell}}}\frac{\bar{R}_{\ell^{\prime}}}{\bar{R}_{\ell}}$ yields the tight bound

\operatorname{\mathbb{E}}\left[\sum\nolimits_{a\in\mathcal{A}}x_{a}\hat{c}_{a}^{2}\right]\leq\sum\nolimits_{\ell=1}^{L}n_{\ell}\bar{R}^{2}_{\ell}+2\sum\nolimits_{1\leq\ell<\ell^{\prime}\leq L}\sqrt{n_{\ell^{\prime}}}\bar{R}_{\ell^{\prime}}\sqrt{n_{\ell}}\bar{R}^{2}_{\ell}=\left(\sum\nolimits_{\ell=1}^{L}\sqrt{n}_{\ell}\bar{R}_{\ell}\right)^{2},

(B.21)

which proves our original assertion. ∎

Appendix C Regret analysis

As we mentioned in the main text, the principal component of our analysis is a recursive inequality which, when telescoped over $t=1,2,\dotsc$ , will yield the desired regret bound. To establish this “template inequality”, we will first require an energy function measuring the disparity between a benchmark strategy $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and a propensity score profile $y\in\mathbb{R}^{\mathcal{A}}$ . To that end, building on the notions introduced in Appendix A, let $h\colon\operatorname{\operatorname{\Delta}}(\mathcal{A})\to\mathbb{R}$ denote the total nested entropy function

	$\displaystyle h(x)$	$\displaystyle=h_{\mathcal{A}}(x)=\sum_{k=0}^{L}\delta_{k}\sum_{S_{k}\in\mathcal{S}_{k}}x_{S_{k}}\log x_{S_{k}},$	$\displaystyle\text{$x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$},$	(C.1)
and let
	$\displaystyle h^{\ast}(y)$	$\displaystyle=\max_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\left\{\langle y,x\rangle-h(x)\right\},$	$\displaystyle\text{$y\in\mathbb{R}^{\mathcal{A}}$},$	(C.2)

denote the convex conjugate of $h$ so, by Proposition A.2, we have

h^{\ast}(y)=y_{\mathcal{A}}\quad\text{and}\quad P_{a}(y)=\frac{\partial h^{\ast}}{\partial y_{a}}\quad\text{for all $y\in\mathbb{R}^{\mathcal{A}}$}.

(C.3)

The Fenchel coupling between $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and $y\in\mathbb{R}^{\mathcal{A}}$ is then defined as

F(x,y)=h(x)+h^{\ast}(y)-\langle y,x\rangle\quad\text{for all $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$, $y\in\mathbb{R}^{\mathcal{A}}$},

(C.4)

and we have the following key result:

Proposition C.1.

Let $\mathcal{S}=\coprod_{\ell=0}^{L}\mathcal{S}_{\ell}$ be a similarity structure on $\mathcal{A}$ with uncertainty parameters $\mu_{1}\geq\dotsm\geq\mu_{L}>0$ . Then:

(1)

The Fenchel coupling (C.4) is positive-definite, i.e.,

F(x,y)\geq 0\qquad\text{for all $x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and all $y\in\mathbb{R}^{\mathcal{A}}$},

(C.5)

with equality if and only if $x$ is given by (NLC), i.e., if and only if $x=P(y)$ .

(2)

For all $x\in\mathcal{A}$ , we have

$F(x,0)=h(x)+h^{\ast}(0)=h(x)-\min h$ (C.6)

where $\min h\equiv\min_{x^{\prime}\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}h(x^{\prime})$ denotes the minimum of $h$ over $\operatorname{\operatorname{\Delta}}(\mathcal{A})$ .

Proof.

Our first claim follows by setting $S\leftarrow\mathcal{A}$ in Propositions A.1 and A.2 and noting that $h_{S}=h_{|S}$ when $S=\mathcal{A}$ : indeed, by Young’s inequality, we have $h(x)+h^{\ast}(y)-\langle y,x\rangle\geq 0$ with equality if and only if $y\in\partial h(x)$ , so the equality $x=P(y)$ follows from (A.36) applied to $S\leftarrow\mathcal{A}$ and the fact that $P_{a|\mathcal{A}}(y)=P_{a}(y)$ . As for our second claim, simply note that $h^{\ast}(0)=\max_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}\left\{\langle 0,x\rangle-h(x)\right\}=-\min_{x\in\operatorname{\operatorname{\Delta}}(\mathcal{A})}h(x)$ and set $y\leftarrow 0$ in the definition (C.4) of the Fenchel coupling. ∎

With all this in hand, the specific energy function that we will use for our regret analysis is the “rate-deflated” Fenchel coupling

E_{t}=\frac{1}{\eta_{t}}F(p,\eta_{t}Y_{t})

(C.7)

where $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ is the regret comparator, $\eta_{t}$ is the algorithm’s learning rate at stage $t$ , and $Y_{t}$ is the corrsponding propensity score estimate. In words, since the mixed strategy employed by the learner at stage $t$ is $X_{t}=P(\eta_{t}Y_{t})$ , the energy $E_{t}$ essentially measures the disparity between $X_{t}$ and the target strategy $p$ (suitably rescaled by the method’s learning rate). We then have the following fundamental estimate:

Proposition C.2.

For all $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and all $t=1,2,\dotsc$ , we have:

E_{t+1}\leq E_{t}+\langle\hat{c}_{t},X_{t}-p\rangle+(\eta_{t+1}^{-1}-\eta_{t}^{-1})[h(p)-\min h]+\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1}).

(C.8)

Proof.

By the definition of $E_{t}$ , we have


$\displaystyle E_{t+1}-E_{t}=\frac{1}{\eta_{t+1}}F(p,\eta_{t+1}Y_{t+1})-\frac{1}{\eta_{t}}F(p,\eta_{t}Y_{t})$	$\displaystyle=\frac{1}{\eta_{t+1}}F(p,\eta_{t+1}Y_{t+1})-\frac{1}{\eta_{t}}F(p,\eta_{t}Y_{t+1})$	(C.9a)
	$\displaystyle+\frac{1}{\eta_{t}}F(p,\eta_{t}Y_{t+1})-\frac{1}{\eta_{t}}F(p,\eta_{t}Y_{t}).$	(C.9b)

We now proceed to upper-bound each of the two terms (C.9a) and (C.9b) separately.

For the term (C.9a), the definition of the Fenchel coupling (C.4) readily yields:

\displaystyle\eqref{eq:energy-const}

\displaystyle=\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]h(p)+\frac{1}{\eta_{t+1}}h^{\ast}(\eta_{t+1}Y_{t+1})-\frac{1}{\eta_{t}}h^{\ast}(\eta_{t}Y_{t+1}).

(C.10)

Inspired by a trick of Nesterov [21], consider the function $\varphi(\eta)=\eta^{-1}[h^{\ast}(\eta y)+\min h]$ . Then, by Proposition A.2, letting $x=P(\eta y)$ and differentiating $\varphi$ with respect to $\eta$ gives

$\displaystyle\varphi^{\prime}(\eta)$	$\displaystyle=\frac{1}{\eta}\langle y,P(\eta y)\rangle-\frac{1}{\eta^{2}}[h^{\ast}(\eta y)+\min h]$
	$\displaystyle=\frac{1}{\eta^{2}}[\langle\eta y,x\rangle-h^{\ast}(\eta y)-\min h]$
	$\displaystyle=\frac{1}{\eta^{2}}[h(x)-\min h]\geq 0.$	(C.11)

Since $\eta_{t+1}\leq\eta_{t}$ , the above shows that $\varphi(\eta_{t})\geq\varphi(\eta_{t+1})$ . Accordingly, setting $y\leftarrow Y_{t+1}$ in the definition of $\varphi$ yields

\frac{1}{\eta_{t+1}}h^{\ast}(\eta_{t+1}Y_{t+1})-\frac{1}{\eta_{t}}h^{\ast}(\eta_{t}Y_{t+1})\leq\left[\frac{1}{\eta_{t}}-\frac{1}{\eta_{t+1}}\right]\min h

(C.12)

and hence

\eqref{eq:energy-const}\leq(\eta_{t+1}^{-1}-\eta_{t}^{-1})[h(p)-\min h].

(C.13)

Now, after a straightforward rearrangement, the second term of (C.9) becomes

$\displaystyle\eqref{eq:energy-update}$	$\displaystyle=\frac{1}{\eta_{t}}\left[h(p)+h^{\ast}(\eta_{t}Y_{t+1})-\eta_{t}\langle Y_{t+1},p\rangle\right]-\frac{1}{\eta_{t}}\left[h(p)+h^{\ast}(\eta_{t}Y_{t})-\eta_{t}\langle Y_{t},p\rangle\right]$
	$\displaystyle=\frac{1}{\eta_{t}}\left[h^{\ast}(\eta_{t}Y_{t+1})-h^{\ast}(\eta_{t}Y_{t})-\eta_{t}\langle\hat{c}_{t},p\rangle\right]$	# by ( ‣ 4)
	$\displaystyle=\frac{1}{\eta_{t}}\left[h^{\ast}(\eta_{t}Y_{t+1})-h^{\ast}(\eta_{t}Y_{t})-\eta_{t}\langle\hat{c}_{t},X_{t}\rangle\right]+\langle\hat{c}_{t},X_{t}-p\rangle$	# isolate benchmark
	$\displaystyle=\frac{1}{\eta_{t}}\left[h^{\ast}(\eta_{t}Y_{t+1})-\langle\eta_{t}Y_{t},X_{t}\rangle+h(X_{t})-\eta_{t}\langle\hat{c}_{t},X_{t}\rangle\right]+\langle\hat{c}_{t},X_{t}-p\rangle$	# by Proposition A.2
	$\displaystyle=\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1})+\langle\hat{c}_{t},X_{t}-p\rangle$	(C.14)

Thus, combining the above with (C.13), we finally obtain

	$\displaystyle E_{t+1}$	$\displaystyle=E_{t}+\eqref{eq:energy-const}+\eqref{eq:energy-update}$
		$\displaystyle\leq E_{t}+(\eta_{t+1}^{-1}-\eta_{t}^{-1})[h(p)-\min h]+\langle\hat{c}_{t},X_{t}-p\rangle+\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1})$		(C.15)

and our proof is complete. ∎

We are now in a position to state and prove the template inequality that provides the scaffolding for our regret bounds:

See 2

Proof.

Let $Z_{t}=\hat{c}_{t}-v_{t}$ denote the error in the learner’s estimation of the $t$ -th stage payoff vector $v_{t}$ . Then, by substituting in Proposition C.2 and rearranging, we readily get:

\langle v_{t},p-X_{t}\rangle\leq E_{t}-E_{t+1}+\langle Z_{t},X_{t}-p\rangle+\left(\eta_{t+1}^{-1}-\eta_{t}^{-1}\right)[h(p)-\min h]+\eta_{t}F(p,\eta_{t}Y_{t+1})

(C.16)

Thus, telescoping over $t=1,2,\dotsc,T$ , we have

$\displaystyle\operatorname{Reg}_{p}(T)$	$\displaystyle\leq E_{1}-E_{T+1}+\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right)[h(p)-\min h]$
	$\displaystyle\qquad+\sum_{t=1}^{T}\langle Z_{t},X_{t}-p\rangle+\sum_{t=1}^{T}\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1})$
	$\displaystyle\leq\frac{h(p)-\min h}{\eta_{T+1}}+\sum_{t=1}^{T}\langle Z_{t},X_{t}-p\rangle+\sum_{t=1}^{T}\frac{1}{\eta_{t}}F(X_{t},\eta_{t}Y_{t+1})$	(C.17)

where we used the fact that \edefnit\selectfonta\edefnn) $E_{t}\geq 0$ for all $t$ (a consequence of the first part of Proposition C.1); and that \edefnit\selectfonta\edefnn) $E_{1}=\eta_{1}^{-1}[h(p)+h^{\ast}(0)]=\eta_{1}^{-1}[h(p)-\min h]$ (from the second part of the same proposition). Our claim then follows by taking expectations in (C) and noting that $\operatorname{\mathbb{E}}[Z_{t}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t}]=0$ (by Proposition 1). ∎

In view of the above, our main regret bound follows by bounding the two terms in the template inequality (C.8). The second term is by far the most difficult one to bound, and is where Appendix B comes in; the first term is easier to handle, and it can be bounded as follows:

Lemma C.1.

Suppose that each class $S\in\mathcal{S}_{\ell-1}$ has at most $m_{\ell}$ children, $\ell=1,\dotsc,L$ . Then, for all $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A}),$ we have

	$\displaystyle H$	$\displaystyle\leq\sum_{\ell=1}^{L}\mu_{\ell}\log m_{\ell}$	with equality iff the tree is symmetric,		(C.18)
	$\displaystyle H$	$\displaystyle=\mu\log(n)$	$\displaystyle\text{if $\mu_{1}=\mu_{2}=\dots=\mu_{L}=\mu$}.$		(C.19)

Proof.

Suppose that $y_{a}=0$ for all $a\in\mathcal{A}$ . Then, applying (9) inductively, we have:

$\displaystyle y_{S_{L}}$	$\displaystyle=0$	for all $S_{L}\in\mathcal{S}_{L}$	(C.20)
$\displaystyle y_{S_{L-1}}$	$\displaystyle=\mu_{L}\,\log\;\;\smashoperator[]{\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}^{}}\;\;\exp(0)\leq\mu_{L}\log m_{L}$	for all $S_{L-1}\in\mathcal{S}_{L-1}$
$\displaystyle y_{S_{L-2}}$	$\displaystyle=\mu_{L-1}\,\log\;\smashoperator[]{\sum_{S_{L-1}\mathrel{}\vartriangleleft\mathrel{}S_{L-2}}^{}}\;\exp\left(\frac{y_{S_{L-1}}}{\mu_{L-1}}\right)\leq\mu_{L-1}\log m_{L-1}+\mu_{L}\log m_{L}$	for all $S_{L-2}\in\mathcal{S}_{L-2}$
	$\displaystyle\;\;\vdots$	$\displaystyle\;\;\vdots$
$\displaystyle y_{S_{\ell-1}}$	$\displaystyle=\mu_{\ell}\,\log\;\;\smashoperator[]{\sum_{S_{\ell}\mathrel{}\vartriangleleft\mathrel{}S_{\ell-1}}^{}}\;\;\exp(y_{S_{\ell}}/\mu_{\ell})\leq\sum_{k=\ell}^{L}\mu_{k}\log m_{k}$	for all $S_{\ell-1}\in\mathcal{S}_{\ell-1}$

and hence $H=h^{\ast}(0)=y_{\mathcal{A}}\leq\sum_{\ell=1}^{L}\mu_{\ell}\log m_{\ell}$ . Eq. C.18 then follows from Proposition C.1.

Now, if $\mu_{1}=\mu_{2}=\dots=\mu_{L}=\mu$ , we have

$\displaystyle H$	$\displaystyle=\log\left[\sum_{S_{1}\mathrel{}\vartriangleleft\mathrel{}S_{0}}\left[\sum_{S_{2}\mathrel{}\vartriangleleft\mathrel{}S_{1}}\!\dotsi\left[\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}\!\!\!\!1\right]^{\frac{\mu_{L}}{\mu_{L-1}}}\!\!\!\!\!\dotsi\,\right]^{\frac{\mu_{2}}{\mu_{1}}}\right]^{\mu_{1}}$
	$\displaystyle=\mu\log\sum_{S_{1}\mathrel{}\vartriangleleft\mathrel{}S_{0}}\left[\sum_{S_{2}\mathrel{}\vartriangleleft\mathrel{}S_{1}}\dotsi\left[\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}S_{L-1}}\!\!\!\!1\right]\dotsi\right]$
	$\displaystyle=\mu\log\left[\sum_{S_{L}\mathrel{}\vartriangleleft\mathrel{}_{L}S_{0}}1\right]=\mu\log n,$	(C.21)

which proves Eq. C.19 and completes our proof. ∎

Proposition C.3.

For all $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and all $t=\{1,2,\dotsc\}$ , we have:

F(X_{t},\mu_{t}Y_{t+1})+\eta_{t}\langle\hat{c}_{t},X_{t}\rangle=h^{\ast}(\eta_{t}Y_{t}+\eta_{t}\hat{c}_{t})-h^{\ast}(\eta_{t}Y_{t}).

(C.22)

Proof.

Let $p\in\operatorname{\operatorname{\Delta}}(\mathcal{A})$ and $t\in 1,2,\dotsc$ , we simply write:

$\displaystyle F(X_{t},\eta_{t}Y_{t+1})$	$\displaystyle=h(X_{t})+h^{\ast}(\eta_{t}Y_{t+1})-\eta_{t}\langle Y_{t+1},X_{t}\rangle$
	$\displaystyle=\underbrace{h(X_{t})+h^{\ast}(\eta_{t}Y_{t})-\langle\eta_{t}Y_{t},X_{t}\rangle}_{=F(X_{t},\eta_{t}Y_{t})}+h^{\ast}(\eta_{t}Y_{t+1})-h^{\ast}(\eta_{t}Y_{t})-\eta_{t}\langle\hat{c}_{t},X_{t}\rangle$
	$\displaystyle=h^{\ast}(\eta_{t}Y_{t}+\eta_{t}\hat{c}_{t})-h^{\ast}(Y_{t})-\eta_{t}\langle\hat{c}_{t},X_{t}\rangle$	# $F(X_{t},\eta_{t}Y_{t})=0$

and our assertion follows. ∎

We are finally in a position to prove our main result (which we restate below for convenience):

See 1

Proof.

Injecting Eq. C.22 in the result of Proposition 2 and using Proposition B.1 and Eq. 16 of Proposition 1 directly yields the pseudo-regret bound (19).

Finally, if we choose $\mu_{1}=\dotsm=\mu_{L}=\sqrt{n_{\mathrm{eff}}/2}$ , Lemma C.1 gives

H=\sqrt{n_{\mathrm{eff}}/2}\log n.

(C.23)

Thus, taking $\eta_{t}=\sqrt{\log n/(2t)}$ and substituting in (19) along with (C.23) finally delivers

\operatorname{\mathbb{E}}[\operatorname{Reg}_{p}(T)]\leq 2\sqrt{n_{\mathrm{eff}}\log n\cdot T},

(C.24)

and our claim follows. ∎

Appendix D Additional Experiment Details and Discussions

In this appendix we provide additional details on the experiments as well as further discussions on the settings we presented. The code with the implementation of the algorithms as well as the code to reproduce the figures will be open-sourced and is provided along with the supplementary materials.

D.1. Experiment additional details

In the synthetic environment, at each level, the rewards are generated randomly according for each class nodes, through uniform distributions of randomly generated means and fixed bandwidth. From a level $\ell$ to the next $\ell+1$ , the rewards range are divided by a multiplicative factor $R_{\ell}/R_{\ell+1}=10$ . The implemented method of NEW uses the reward based IW. Moreover, no model selection was used in this experiment as no hyperparameter was tuned. Indeed, a decaying rate of $\frac{1}{\sqrt{t}}$ was used for the score updates for all methods, as is common in the bandit litterature [17].

D.2. Blue Bus/ Red Bus environment

We detail in Figure 5 a graphical representation of such blue bus/red bus environment, where many colors of the bus item build irrelevant alternatives. In this setting, with few arms, we run the methods up to the horizon $T=1000$ . We provide in Figure 6 the average reward of the two methods NEW and EXP3 with varying number of subclasses of the “bus”.

Figure 5. Diagram of the blue Bus/Red Bus environment.

While the NEW method ends up selecting the best alternative and having the lowest regret, the EXP3 seems to pick wrong alternative in some experiments, and ends up having higher regret and requiring more iterations to converge to higher average reward. In some of our experiments over the multiple random runs, alternatives of very low sampling probability that were sampled changed the score vector too brutally in the IPS estimator which seemed to hurt the EXP3 method much more than the NEW algorithm.

D.3. Tree structures

In this appendix we show additional results and visualisations for the second setting presented in the main paper. We start with discussions on the depth parameter $L$ and follow with the breadth parameter related to the number of child per class $M=|S|$ .

Influence of the depth parameter $L$

In Figure 7 we show the influence of the depth parameter with a fixed number of child per class. By making the tree deeper, we illustrate the effect of knowing the nested structure compared to running the logit choice to the whole alternative set. As shown in both the regret and average reward plots, the NEW method outperforms the EXP3 algorithm. While the NEW method also use an IPS estimator, it is less prone to variance issues than the EXP3 method. Indeed, due to the nested structure and the reward decay related to the ratio $R_{\ell+1}/R_{\ell}$ , the NEW estimator end up not hurting the regret by still selecting "right" parent classes.

Influence of the number of child per class (wideness) $M=|S|$

In this setting we fix the number of levels $L$ and vary the number of child per classes $M$ . In Figure 8 we can see that the NEW method outperforms the EXP3 in terms of regret and average reward. Interestingly, we see that the gap between the two methods shrinks when the number of child per class augments. This is because when the size of a class increase, the NEW method also end up having less knowledge locally and end up having a large number of alternatives to choose among.

D.4. A visualisation of the effects of NEW

In this appendix we want to show the effects of NEW through the simple setting where we assume a nested structure with $L=4$ and $M=|S|=3$ . We illustrate in Figure 9 the score vectors of the NEW method along the optimal path in the tree (path which nodes have the highest cumulated mean, i.e which generates the highest reward) along with the oracle means of the child nodes. We can see that the algorithm takes advantage of the nested structure and updates the scores vectors optimally with regards to the oracle means of all the nodes. The NEW algorithm therefore estimates correctly the rewards of the environment.

Inversely we see in Figure 10 that the EXP3 method has suffered from variance issue and selected a suboptimal alternative among the $|S|^{L}=81$ possible ones. The EXP3 did not take advantage of the nested structure and therefore did not learn as correctly as the NEW algorithm the reward values.

D.5. Cases where both algorithms perform identically

In this appendix we merely show that the implementation of the NEW and EXP3 algorithm match exactly and observe the same behavior when the number of levels $L$ is set to 1. This setting is where we have no knowledge of any nested structure, therefore both algorithms perform identically in Figure 11.

D.6. Variance plots for the synthetic experiments

We discuss here the variance of the regret at the final timestep $T=10000$ . Indeed, as shown on Figure 6 for the NEW algorithm , on Figure 7 for both algorithms EXP3 and NEW, and on Figure 8 for EXP3, some of the plots do no exhibit the monotonicity one would have expected when increasing the number of arms through $L$ or $M$ , and are even overlapping on the regret plot. This can be explained on Figures 12 for the Red Bus/Blue Bus environment, and in Figures 13 and 14 respectively for depth and wideness tree experiments. Those plots show the variances (across the 20 random seeds) of the final regret for both methods at the final step-size. In Figure 13 we see that the EXP3 arms have similar mean values with large variances, which explains why they are overlapping on the plot in Figure 3. In Figure 14 when varying $M$ we can also have a closer look on how NEW outperforms EXP3 and how the close values of NEW regrets through different $M$ can be explained by their high variance.

D.7. Reproducibility

We provide code for reproducibility of our experiments and plots, in addition to a more general implementation of both the NEW algorithm and EXP3 baseline. All experiments were run on a Mac book pro laptop, with 1 processor of 6 cores @2.6GHz (6-Core Intel Core i7). The code and all experiments can be found in the attached .zip.

Acknowledgements

P. Mertikopoulos is grateful for financial support by the French National Research Agency (ANR) in the framework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01).

References

Abbasi-yadkori et al. [2011] Abbasi-yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Adv. Neural Information Processing Systems (NIPS), 2011.
Anderson et al. [1992] Anderson, S. P., de Palma, A., and Thisse, J.-F. Discrete Choice Theory of Product Differentiation. MIT Press, Cambridge, MA, 1992.
Auer et al. [1995] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995.
Auer et al. [2002a] Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002a.
Auer et al. [2002b] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b.
Ben-Akiva & Lerman [1985] Ben-Akiva, M. and Lerman, S. R. Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press, Cambridge, 1985.
Berge [1997] Berge, C. Topological Spaces. Dover, New York, 1997.
Bubeck & Cesa-Bianchi [2012] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
Bubeck et al. [2011] Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. $\mathcal{X}$ -armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011.
Cesa-Bianchi & Lugosi [2006] Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
Cesa-Bianchi & Lugosi [2012] Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422, 2012.
Cesa-Bianchi & Shamir [2018] Cesa-Bianchi, N. and Shamir, O. Bandit regret scaling with the effective loss range. In ALT ’18: Proceedings of the 29th International Conference on Algorithmic Learning Theory, 2018.
Cesa-Bianchi et al. [2017] Cesa-Bianchi, N., Gaillard, P., Gentile, C., and Gerchinovitz, S. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In COLT ’17: Proceedings of the 30th Annual Conference on Learning Theory, 2017.
György et al. [2007] György, A., Linder, T., Lugosi, G., and Ottucsák, G. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.
Héliou et al. [2021] Héliou, A., Martin, M., Mertikopoulos, P., and Rahier, T. Zeroth-order non-convex learning via hierarchical dual averaging. In ICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021.
Kocák et al. [2014] Kocák, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations. In NIPS ’14: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014.
Lattimore & Szepesvári [2020] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.
Littlestone & Warmuth [1994] Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
Luce [1959] Luce, R. D. Individual Choice Behavior: A Theoretical Analysis. Wiley, New York, 1959.
McFadden [1974] McFadden, D. L. Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (ed.), Frontiers in Econometrics, pp. 105–142. Academic Press, New York, NY, 1974.
Nesterov [2009] Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259, 2009.
Neu [2015] Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS ’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015.
Rockafellar [1970] Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
Shalev-Shwartz [2011] Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
Shalev-Shwartz & Singer [2006] Shalev-Shwartz, S. and Singer, Y. Convex repeated games and Fenchel duality. In NIPS’ 06: Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 1265–1272. MIT Press, 2006.
Thune & Seldin [2018] Thune, T. S. and Seldin, Y. Adaptation to easy data in prediction with limited advice. In Advances in Neural Information Processing Systems, volume 31, 2018.
Vovk [1990] Vovk, V. G. Aggregating strategies. In COLT ’90: Proceedings of the 3rd Workshop on Computational Learning Theory, pp. 371–383, 1990.

$\displaystyle h(x\|S_{k})$	$\displaystyle=\mu_{k+1}\,x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}\log x_{S_{k+1}\|S_{k}}$	# by definition
	$\displaystyle=\mu_{k+1}\,x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}\log\frac{\exp(y_{S_{k+1}}/\mu_{k+1})}{\exp(y_{S_{k}}/\mu_{k+1})}$	# by (A.27)
	$\displaystyle=x_{S_{k}}\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}y_{S_{k+1}}-x_{S_{k}}y_{S_{k}}$	# since $\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}\|S_{k}}=1$
	$\displaystyle=\sum_{S_{k+1}\mathrel{}\vartriangleleft\mathrel{}S_{k}}x_{S_{k+1}}y_{S_{k+1}}-x_{S_{k}}y_{S_{k}}$	(A.31)

$\displaystyle\exp\left(\frac{h^{\ast}_{\|S}\left(y-c\right)}{\mu_{L}}\right)$	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y-c\right)}{\mu_{L}}\right)$	# by Eq. A.10
	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)-c_{a}}{\mu_{L}}\right)$	# the a’s are leaves
	$\displaystyle=\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)}{\mu_{L}}\right)\exp\left(-\frac{c_{a}}{\mu_{L}}\right)$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\underbrace{\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}\left[\dfrac{\exp\left(\frac{h^{\ast}_{\|a}\left(y\right)}{\mu_{L}}\right)}{\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)}\right]\exp\left(-\frac{c_{a}}{\mu_{L}}\right)}_{=\sigma_{c,y}(S)\text{ by definition}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\sigma_{c,y}(S)$	(B.5)

$\displaystyle\exp\left(\frac{h^{\ast}_{\|S}\left(y-c\right)}{\mu_{\ell}}\right)$	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y-c\right)}{\mu_{\ell}}\right)$
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y-c\right)}{\mu_{\ell+1}}\right)^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\left[\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell+1}}\right)\sigma_{c,y}(S^{\prime})\right]^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$	# inductive hypothesis
	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell}}\right)\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{\ell}}\right)\underbrace{\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}\left[\dfrac{\exp\left(\frac{h^{\ast}_{\|S^{\prime}}\left(y\right)}{\mu_{\ell}}\right)}{\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{\ell}}\right)}\right]\sigma_{c,y}(S^{\prime})^{\frac{\mu_{\ell+1}}{\mu_{\ell}}}}_{=\sigma_{c,y}(S)\text{ by definition}}$
	$\displaystyle=\exp\left(\frac{h^{\ast}_{\|S}\left(y\right)}{\mu_{L}}\right)\sigma_{c,y}(S)$	(B.6)

$\displaystyle\sigma_{c,y}(S)$	$\displaystyle=\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)\exp(-\frac{c_{S^{\prime}}}{\mu_{L}})$
	$\displaystyle\leq\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)(1-\frac{c_{S^{\prime}}}{\mu_{L}}\frac{c_{S^{\prime}}^{2}}{2\mu_{L}^{2}})$	# $e^{-x}\leq 1-x+x^{2}/2$ for $x\geq 0$
	$\displaystyle=1-\frac{1}{\mu_{L}}\left[\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)c_{S^{\prime}}-\frac{1}{2\mu_{L}}\sum\limits_{S^{\prime}\mathrel{}\vartriangleleft\mathrel{}S}P_{S^{\prime}\|S}(y)c_{S^{\prime}}^{2}\right]$
	$\displaystyle=1-\frac{1}{\mu_{(L-1)+1}}\left[\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}-\frac{1}{2\mu_{L}}\sum\limits_{a\mathrel{}\vartriangleleft\mathrel{}S}P_{a\|S}(y)c_{a}^{2}\right]$	(B.9)

NESTED BANDITS

Abstract.

Key words and phrases:

2020 Mathematics Subject Classification:

1. Introduction

Our contributions.

Related Work.

2. The model

2.1. Attributes, classes, and the relations between them

Remark 1.

2.2. The loss model

Example 1.

Remark 2.

Remark 3.

2.3. Sequence of events

3. The nested exponential weights algorithm

3.1. Probabilities, propensities, and nested logit choice

Remark 4.

3.2. The nested importance weighted estimator

Proposition 1.

3.3. The nested exponential weights algorithm

4. Analysis and results

Theorem 1.

Proof outline of Theorem 1.

Proposition 2.

5. Numerical experiments

Benefits in the red bus/blue bus problem.

Performance in general nested structures.

Appendix A The nested entropy and its properties

Remark 1.

Remark 2.

Proposition A.1.

Proposition A.2.

Remark 3.

Remark 4.

Remark 5.

Proof of Proposition A.1.

Proof of Proposition A.2.

Step 1: Optimality conditions for (𝐎𝐩𝐭S\operatorname{Opt}_{S}).

Step 2: Solving (𝐎𝐩𝐭S\operatorname{Opt}_{S}).

Step 3: The maximal value of (𝐎𝐩𝐭S\operatorname{Opt}_{S}).

Step 4: Differential representation of conditional probabilities.

Appendix B Auxiliary bounds and results

Lemma B.1.

Lemma B.2.

Proof of Lemma B.1.

Proof of Lemma B.2.

Base step.

Induction step.

Lemma B.3.

Lemma B.4.

Proof of Lemma B.3.

Proof of Lemma B.4.

Base step.

Induction step.

Proposition B.1.

Proof.

Remark 6.

Proof.

Appendix C Regret analysis

Proposition C.1.

Proof.

Proposition C.2.

Proof.

Proof.

Lemma C.1.

Proof.

Proposition C.3.

Proof.

Proof.

Appendix D Additional Experiment Details and Discussions

D.1. Experiment additional details

D.2. Blue Bus/ Red Bus environment

D.3. Tree structures

Influence of the depth parameter LL

Influence of the number of child per class (wideness) M=|S|M=|S|

D.4. A visualisation of the effects of NEW

D.5. Cases where both algorithms perform identically

D.6. Variance plots for the synthetic experiments

D.7. Reproducibility

Step 1: Optimality conditions for ( $\operatorname{Opt}_{S}$ ).

Step 2: Solving ( $\operatorname{Opt}_{S}$ ).

Step 3: The maximal value of ( $\operatorname{Opt}_{S}$ ).

Influence of the depth parameter $L$

Influence of the number of child per class (wideness) $M=|S|$