This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributionally Robust Batch Contextual Bandits

Nian Si [email protected] Department of Management Science & Engineering, Stanford University Fan Zhang [email protected] Department of Management Science & Engineering, Stanford University Zhengyuan Zhou [email protected] Stern School of Business, New York University Jose Blanchet [email protected] Department of Management Science & Engineering, Stanford University
Abstract

Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data – an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.

1 Introduction

As a result of the digitization of our economy, user-specific data has exploded across a variety of application domains: electronic medical data in health care, marketing data in product recommendation, and customer purchase/selection data in digital advertising ([10, 53, 15, 6, 68]). Such growing availability of user-specific data has ushered in an exciting era of personalized decision making, one that allows the decision maker(s) to personalize the service decisions based on each individual’s distinct features. As such, heterogeneity across individuals (i.e. best recommendation decisions vary across individuals) can be intelligently exploited to achieve better outcomes.

When abundant historical data are available, effective personalization can be achieved by learning a policy offline (i.e. from the collected data) that prescribes the right treatment/selection/recommendation based on individual characteristics. Such an approach has been fruitfully explored (see Section 1.3) and has witnessed tremendous success. However, this success is predicated on (and hence restricted to) the setting where the learned policy is deployed in the same environment from which past data has been collected. This restriction limits the applicability of the learned policy, because one would often want to deploy this learned policy in a new environment where the population characteristics are not entirely the same as before, even though the underlying personalization task is still the same. Such settings occur frequently in managerial contexts, such as when a firm wishes to enter a new market for the same business, hence facing a new, shifted environment that is similar yet different. We highlight several examples below:

\bullet Product Recommendation in a New Market. In product recommendation, different products and/or promotion offers are directed to different customers based on their covariates (e.g. age, gender, education background, income level, marital status) in order to maximize sales. Suppose the firm has enjoyed great success in the US market by deploying an effective personalized product recommendation scheme that is learned from its US operations data111Such data include a database of transactions, each of which records the consumer’s individual characteristics, the recommended item, and the purchase outcome., and is now looking to enter a new market in Europe. What policy should the firm use initially, given that little transaction data in the new market is available? The firm could simply reuse the same recommendation policy that is currently deployed for the US market. However, this policy could potentially be ineffective because the population in the new market often has idiosyncratic features that are somewhat distinct from the previous market. For instance, the market demographics will be different; further, even two individuals with the same observable covariates in different markets could potentially have different preferences as a result of the cultural, political, and environmental divergences. Consequently, such an environment-level “shift” renders the previously learned policy fragile. Note that in such applications, taking the standard online learning approach – by gradually learning to recommend in the new market as more data in that market becomes available – is both wasteful and risky. It is wasteful because it entirely ignores the US market data even though presumably the two markets still share similarities and useful knowledge/insights can be transferred. It is also risky because a “cold start” policy may be poor enough to cause the loss of customers in an initial phase, in which case little further data can be gathered. Moreover, there may be significant reputation costs associated with the choice of a poor cold start. Finally, many personalized content recommendation platforms – such as news recommendation or video recommendation – also face these problems when initiating a presence in a new market.

\bullet Feature-Based Pricing in a New Market. In feature-based pricing, a platform sells a product with features xtx_{t} on day tt to a customer, and prices it at ptp_{t}, which corresponds to the action (assumed to take discrete values). The reward is the revenue collected from the customer, which is ptp_{t} if the customer decides to purchase the product and 0 otherwise. The (generally-unknown-to-the-platform) probability of the customer purchasing this product depends on both the price ptp_{t} and the product xtx_{t} itself. If the platform now wishes to enter a new market to sell its products, it will need to learn a distributionally robust feature-based pricing policy (which maps xtx_{t} to ptp_{t}) that takes into account possible distributional shifts which arise in the new market.

\bullet Loan Interest Rate Provisioning in a New Market In loan interest rate provisioning, the loan provider (typically a bank) would gather individual information xtx_{t} (such as personal credit history, outstanding loans, current assets and liabilities etc) from a potential borrower tt, and based on that information, provision an interest rate ata_{t}, which corresponds to our action here. In general, the interest rate ata_{t} will be higher for borrowers who have a larger default probability, and lower for borrowers who have little or no chance to default. Of course, the default probability is not observed, and depends on both the borrower’s financial situation xtx_{t} and potentially on ata_{t}, which determines how much payment to make in each installment. For the latter, note that a higher interest rate would translate into a larger installment payment, which may deplete the borrower’s cash flow and hence make default more likely. What is often observed is the sequence of installment payments for many borrowers under a given environment over a certain horizon (say 30 years for a home loan). If the borrower defaults at any point, then all subsequent payments are zero. With such information, the reward for the bank corresponding to an individual borrower is the present value of the stream of payments made by discounting the cash flows back to the time when the loan was made (using the appropriate market discount rate). A policy here would be one that selects the best interest rate to produce the largest expected present value of future installment payment streams. When opening up a new branch in a different area (i.e. environment), the bank may want to learn a distributionally robust interest rate provisioning policy so as to take into account the environment shift. A notable feature of home loans is that they often span a long period of time. As such, even in the same market, the bank may wish to have some built-in robustness level in case there are shifts over time.

1.1 Main Challenges

The aforementioned applications 222Other applications include deploying a personalized education curriculum and/or digital tutoring plan based on students’ characteristics in a different school/district than from where the data was collected. highlight the need to learn a personalization policy that is robust to unknown shifts in another environmentt, an environment of which the decision maker has little knowledge or data. Broadly speaking, there can be two sources of such shifts:

  1. 1.

    Covariate shift: The population composition – and hence the marginal distribution of the covariates – can change. For instance, in the product recommendation example, demographics in different markets will be different (e.g. markets in developed countries will have more people that are educated than those in developing countries), and certain segments of the population that are a majority in the old market might be a minority in the new market.

  2. 2.

    Concept drift: How the outcomes depend on the covariates and prescription can also change, thereby resulting in different conditional distributions of the outcomes given the two. For instance, in product recommendation, large market segments in the US (e.g. a young population with high education level) choose to use cloud services to store personal data. In contrast, the same market segment in some emerging markets (where data privacy may be regulated differently) may prefer to buy flash drives to store at least some of the personal data.

These two shifts, often unavoidable and unknown, bring forth significant challenges in provisioning a suitable policy in the new environment, at least during the initial deployment phase. For instance, a certain subgroup (e.g. educated females in their 50s or older that live in rural areas) may be under-represented in the old environment’s population. In this case, the existing product recommendation data are insufficient to identify the optimal recommendation for this subgroup. This insufficiency is not a problem in the old environment, because the sub-optimal prescription for this subgroup will not significantly affect the overall performance of the policy given that the subgroup occurrence is not sufficiently frequent. However, with the new population, this subgroup may have a larger presence, in which case the incorrect prescription will be amplified and could translate into poor performance. In such cases, the old policy’s performance will be particularly sensitive to the marginal distribution of the subgroup in the new environment, highlighting the danger of directly deploying the same policy that worked well before.

Even if all subgroups are well-represented, the covariate shift will cause a problem if the decision maker is constrained to select a policy in a certain policy class such as trees or linear policy class, due to, for instance, interpretability and/or fairness considerations. In such cases, since one can only hope to learn the best policy in the policy class (rather than the absolute best policy), the optimal in-class policy will change when the underlying covariate marginal distribution shifts in the new environment, rendering the old policy potentially ineffective. Note that this could be the case even if the covariate marginal distribution is the only thing that changes.

Additionally, even more challenging is the existence of concept drift. Fundamentally, this type of shifts occurs because there are hidden characteristics at a population level (cultural, political, environmental factors or other factors that are beyond the decision maker’s knowledge) that also influence the outcome, but are unknown, unobservable. and different across the environments. As such, the decision maker faces a challenging hurdle: because these population-level factors that may influence the outcome are unknown and unobservable, making it is infeasible to explicitly model them in the first place, let alone deciding what policy to deploy as a result of them. therefore, the decision maker faces an “unknown unknown” dilemma when choosing the right policy for the new environment.

Situated in this challenging landscape, one naturally wonders if there is any hope to rigorously address the problem of policy learning in shifted environments with a significant degree of model uncertainty. This challenge leads to the following fundamental question: Using the (contextual bandits) data collected from one environment, can we learn a robust policy that would provide reliable worst-case guarantees in the presence of both types of environment shifts? If so, how can this be done in a data-efficient way? Our goal is to answer this question in the affirmative, as we shall explain.

1.2 Our Message and Managerial Insights

To answer this question we adopt a mathematical framework which allows us to formalize and quantify environmental model shifts. First, we propose a distributionally robust formulation of policy learning in batch contextual bandits that accommodates both environment shifts. To overcome the aforementioned “unknown unknown” challenge that presents modelling difficulty, our formulation takes a general, fully non-parametric (and hence model-agnostic) approach to describe the shift at the distribution level: we allow the new environment to be an arbitrary distribution in a KL-neighborhood around the old environment’s distribution. As such, the shift is succinctly represented by a single quantity: the KL-radius δ\delta. We then propose to learn a policy that has maximum value under the worst-case distribution, that is optimal for a decision maker who wishes to maximize value against an adversary who selects a worst-case distribution to minimize value. Such a distributionally robust policy – if learnable at all – would provide decision makers with the guarantee that the value of deploying this policy will never be worse – and possibly better – no matter where the new environment shifts within this KL-neighborhood.

Regarding the choice of δ\delta, we provide two complementary perspectives on its selection process from a managerial viewpoint; each useful in the particular context one is concerned with. First, when data across different environments are available, one can estimate δ\delta using such data. Such an approach would work well (and is convenient) if the new environment is different in similar ways in nature compared to how those other environments differ. For instance, in the voting application we consider in this paper (the August 2006 primary election in Michigan [35]), voting turnout data from different cities have been collected. As such, when deploying a new policy to encourage voters to vote in a different city, one can use the δ\delta that is estimated from data across those different cities (we describe the technical procedures for such estimation in Section 6). Second, we can view δ\delta as a parameter that can vary and that trades off with the optimal distributionally robust value: the larger the δ\delta, the more conservative the decision maker, the smaller the optimal distributionally robust value. We can compute the optimal distributionally robust values (and the corresponding policies) – one for each δ\delta – for a range of δ\deltas; see Figure 1 for an illustration. Inspecting Figure 1, we see that the difference between the optimal distributionally robust value under δ=0\delta=0 (i.e., no distributional shifts) and the optimal distributionally robust value for a given δ\delta representing the price of robustness. If the new environment had actually remained unchanged, then deploying a robust policy “eats” into the value. As such, the decision maker can think of this value reduction as a form of insurance premium budget in order to protect the downside, in case the new environment did shift in unexpected ways. Consequently, under a given premium budget (i.e. the amount of per-unit profit/sales that the decision maker is willing to forgo), a conservative choice would be for the decision maker to select the largest δ\delta where the difference is within this amount (δ\delta^{*} in Figure 1), and have the maximum robustness coverage therein. Importantly, if the new environment ends up not shifting in the worst possible way or not as much, then the actual value will only be higher. In particular, if the new environment does not shift at all, then the insurance premium the decision maker ends up paying after using the distributionally robust policy (under δ\delta^{*}) is smaller than the budget, because its value under the old environment is larger than its value under the corresponding worst-case shift. Consequently, selecting δ\delta this way yields the optimal worst-case policy under a given budget333Of course, if the new environment ends up shifting a larger amount than δ\delta^{*} and also in the worst possible way, then the actual value could be even smaller. However, in such situations, one would be much worse off with just using the old environment’s optimal policy, which is not robust at all..

Refer to caption
Figure 1: Maximum distributionally robust values as a function of δ\delta.

Second, we show that learning distributionally robust policies can indeed be done in a statistically efficient way. In particular, we provide an algorithmic framework that solves this problem optimally. To achieve this, we first provide a novel scheme for distributionally robust policy evaluation (Algorithm 1) that estimates the robust value of any given policy using historical data. We do so by drawing from duality theory and transforming the primal robust value estimation problem–an infinitely-dimensional problem–into a dual problem that is 11-dimensional and convex, hence admitting an efficiently computable solution, which we can solve using Newton’s method. We then establish, in the form of a central limit theorem, that the proposed estimator converges to the true value at an Op(n1/2)O_{p}\left(n^{-1/2}\right) rate where nn is the number of data points. Building upon the estimator, we devise a distributionally robust policy learning algorithm (Algorithm 2) and establish that (Theorem 2) it achieves a Op(n1/2)O_{p}(n^{-1/2}) finite-sample regret. Such a finite-sample regret bound informs the decision maker that in order to learn an ϵ\epsilon-optimal distributionally robust policy, a dataset on the order of 1ϵ2\frac{1}{\epsilon^{2}} samples suffice with high probability. Note that this result is true for any δ\delta, where the regret bound itself does not depend on δ\delta. In addition, we also characterize the fundamental limit of this problem by establishing a Ω(n1/2)\Omega(n^{-1/2}) lower bound for expected regret, thus making it clear that our policy learning algorithm is statistically optimal in the minimax sense. Taken together, these results highlight that we provide an optimal prescription framework for learning distributionally robust policies.

Third, we demonstrate the empirical efficacy and efficiency of our proposed algorithm by providing extensive experimental results in Section 5. We dedicate Section 6 to the voting problem mentioned previously using our distributionally robust policy learning framework and demonstrate its applicability on a real-world dataset. Finally, we extend our results to ff-divergence measures and show our framework is still applicable even beyond KL-divergence (Section 7).

1.3 Related Work

As mentioned, our work is closely related to the flourishing and rapidly developing literature on offline policy learning in contextual bandits; see, e.g, [28, 85, 89, 88, 76, 61, 45, 47, 46, 91, 43, 17]. Many valuable insights have been contributed: novel policy evaluation and policy learning algorithms have been developed; sharp minimax regret guarantees have been characterized in many different settings; and extensive and illuminating experimental results have been performed to offer practical advice for optimizing empirical performance. However, this line of work assumes the environment in which the learned policy will be deployed is the same as the the environment from which the training data is collected. In such settings, robustness is not a concern. Importantly, this line of work developed and used the family of doubly robust estimators for policy evaluation and learning [28, 91]. For clarification, we point out that this family of estimators, although related to robustness, does not address the robustness discussed in this paper (i.e. robustness to environment shifts). In particular, those estimators aim to stabilize statistical noise and deal with mis-specified models for rewards and propensity scores, where the underlying environment distribution is the same across test and training environments.

Correspondingly, there has also been an extensive literature on online contextual bandits, for example,  [53, 65, 31, 62, 18, 37, 3, 4, 66, 67, 44, 54, 2, 23, 54], whose focus is to develop online adaptive algorithms that effectively balance exploration and exploitation. This online setting is not the focus of our paper. See [14, 50, 74] for a few articulate expositions. Despite this, we do point out that as alluded to before and whenever possible, online learning can complement the distributionally robust policy learned and deployed initially. We leave this investigation for future work.

Additionally, there is also rapidly growing literature in distributionally robust optimization (DRO); see, e.g, [11, 22, 40, 69, 7, 33, 57, 26, 75, 70, 49, 81, 51, 59, 83, 56, 87, 1, 86, 73, 34, 16, 36, 12, 24, 48, 27, 39]. The existing DRO literature has mostly focused on the statistical learning aspects, including supervised learning and feature selection type problems, rather than on the decision making aspects. Furthermore, much of that literature uses DRO as tool to prevent over-fitting when it comes to making predictions, rather than dealing with distributional shifts. To the best of our knowledge, we provide the first distributionally robust formulation for policy evaluation and learning under bandit feedback and shifted environments, in a general, non-parametric space.

Some of the initial results appeared in the conference version [72], which only touched a very limited aspect of the problem: policy evaluation under shifted environments. In contrast, this paper is substantially developed and fully addresses the entire policy learning problem. We summarize the main differences below:

  1. 1.

    The conference version focused on the policy evaluation problem and only studied a non-stable version of the policy evaluation scheme, which is outperformed by the stable policy evaluation scheme analyzed here (we simply dropped the non-stable version of the policy evaluation scheme in this journal version).

  2. 2.

    The conference version did not study the policy learning problem, which is our ultimate objective. Here, we provide a policy learning algorithm and establish the minimax optimal rate Op(n1/2)O_{p}(n^{-1/2}) for the finite-sample regret by providing the regret upper bound as well as the matching regret lower bound.

  3. 3.

    We demonstrate the applicability of our policy learning algorithms and provide results on a real-world voting data set, which is missing in the conference version.

  4. 4.

    We provide practical managerial insights for the choice of the critical parameter δ\delta which governs the size of distributional shifts.

  5. 5.

    We finally extend our results to ff-divergence measures, a broader class of divergence measures that include KL as a special case.

2 A Distributionally Robust Formulation of Batch Contextual Bandits

2.1 Batch Contextual Bandits

Let 𝒜\mathcal{A} be the set of dd actions: 𝒜={a1,a2,,ad}\mathcal{A}=\{a^{1},a^{2},\dots,a^{d}\} and let 𝒳\mathcal{X} be the set of contexts endowed with a σ\sigma-algebra (typically a subset of 𝐑p\mathbf{R}^{p} with the Borel σ\sigma-algebra). Following the standard contextual bandits model, we posit the existence of a fixed underlying data-generating distribution on (X,Y(a1),Y(a2),,Y(ad))𝒳×j=1d𝒴j(X,Y(a^{1}),Y(a^{2}),\dots,Y(a^{d}))\in\mathcal{X}\times\prod_{j=1}^{d}\mathcal{Y}_{j}, where X𝒳X\in\mathcal{X} denotes the context vector, and each Y(aj)𝒴j𝐑Y(a^{j})\in\mathcal{Y}_{j}\subset\mathbf{R} denotes the random reward obtained when action aja^{j} is selected under context XX.

Let {(Xi,Ai,Yi)}i=1n\{(X_{i},A_{i},Y_{i})\}_{i=1}^{n} be nn iid observed triples that comprise of the training data, where (Xi,Yi(a1),,Yi(ad))(X_{i},Y_{i}(a^{1}),\dots,Y_{i}(a^{d})) are drawn iid from the fixed underlying distribution described above, and we denote this underlying distribution by 𝐏0\mathbf{P}_{0}. Further, in the ii-th datapoint (Xi,Ai,Yi)(X_{i},A_{i},Y_{i}), AiA_{i} denotes the action selected and Yi=Yi(Ai)Y_{i}=Y_{i}(A_{i}). In other words, YiY_{i} in the ii-th datapoint is the observed reward under the context XiX_{i} and action AiA_{i}. Note that all the other rewards Yi(a)Y_{i}(a) (i.e. for a𝒜{Ai}a\in\mathcal{A}-\{A_{i}\}), even though they exist in the model (and have been drawn according to the underlying joint distribution), are not observed.

We assume the actions in the training data are selected by some fixed underlying policy π0\pi_{0} that is known to the decision-maker, where π0(ax)\pi_{0}(a\mid x) gives the probability of selecting action aa when the context is xx. In other words, for each context XiX_{i}, a random action AiA_{i} is selected according to the distribution π0(Xi)\pi_{0}(\cdot\mid X_{i}), after which the reward Yi(Ai)Y_{i}(A_{i}) is observed. Finally, we use 𝐏0π0\mathbf{P}_{0}*\pi_{0} to denote the product distribution on space 𝒳×j=1d𝒴j×𝒜\mathcal{X}\times\prod_{j=1}^{d}\mathcal{Y}_{j}\times\mathcal{A}. We make the following assumptions on the data-generating process.

Assumption 1.

The joint distribution (X,Y(a1),Y(a2),,Y(ad),A)(X,Y(a^{1}),Y(a^{2}),\dots,Y(a^{d}),A) satisfies:

  1. 1.

    Unconfoundedness: (Y(a1),Y(a2),,Y(ad))(Y(a^{1}),Y(a^{2}),\ldots,Y(a^{d})) is independent with AA conditional on XX, i.e.,

    (Y(a1),Y(a2),,Y(ad))A|X.(Y(a^{1}),Y(a^{2}),\ldots,Y(a^{d}))\rotatebox[origin={c}]{90.0}{$\models$}A|X.
  2. 2.

    Overlap: There exists some η>0\eta>0, π0(ax)η\pi_{0}(a\mid x)\geq\eta, (x,a)𝒳×𝒜\forall(x,a)\in\mathcal{X}\times\mathcal{A}.

  3. 3.

    Bounded reward support: 0Y(ai)M0\leq Y(a^{i})\leq M for i=1,2,,di=1,2,\ldots,d.

Assumption 2 (Positive densities/probabilities).

The joint distribution (X,Y(a1),Y(a2),,Y(ad))(X,Y(a^{1}),Y(a^{2}),\dots,Y(a^{d})) satisfies one of the following assumptions below:

  1. 1.

    In the continuous case, for any i=1,2,,d,i=1,2,\ldots,d, Y(ai)|XY(a^{i})|X has a conditional density fi(yi|x)f_{i}(y_{i}|x), which has a uniform non-zero lower bound, i.e., fi(yi|x)b¯>0f_{i}(y_{i}|x)\geq\underline{b}>0 over the interval [0,M][0,M] for any x𝒳x\in\mathcal{X}.

  2. 2.

    In the discrete case, for any i=1,2,,d,i=1,2,\ldots,d, Y(ai)Y(a^{i}) is supported on a finite set 𝔻\mathbb{D} with cardinality more than 1, and Y(ai)|XY(a^{i})|X satisfies 𝐏0(Y(ai)=v|X)b¯>0\mathbf{P}_{0}(Y(a^{i})=v|X)\geq\underline{b}>0 almost surely for any v𝔻v\in\mathbb{D}.

The overlap assumption ensures that some minimum positive probability is guaranteed regardless of the context is. This assumption ensures sufficient exploration in collecting the training data, and indeed, many operational policies have ϵ\epsilon-greedy components. Assumption 1 is standard and commonly adopted in both the estimation literature ([63, 41, 42]) and the policy learning literature ([85, 89, 47, 76, 90]). Assumption 2 is made to ensure the Op(n1/2)O_{p}(n^{-1/2}) convergence rate.

Remark 1.

In standard contextual bandits terminology, μa(x)𝐄𝐏0[Yi(a)Xi=x]\mu_{a}(x)\triangleq\mathbf{E}_{\mathbf{P}_{0}}[Y_{i}(a)\mid X_{i}=x] is known as the mean reward function for action aa. Depending on whether one assumes a parametric form of μa(x)\mu_{a}(x) or not, one needs to employ different statistical methodologies. In particular, when μa(x)\mu_{a}(x) is a linear function of xx, this setting is known as linear contextual bandits, an important and most extensively studied subclass of contextual bandits. In this paper, we do not make any structural assumption on μa(x)\mu_{a}(x): we are in the non-parametric contextual bandits regime and work with general underlying data-generating distributions 𝐏0\mathbf{P}_{0}.

2.2 Standard Policy Learning

With the aforementioned setup, the standard goal is to learn a good policy from a fixed deterministic policy class Π\Pi using the training data, often known as the batch contextual bandits problem (in contrast to online contextual bandits), because all the data has already been collected before the decision maker aims to learn a policy. A policy π:𝒳𝒜\pi:\mathcal{X}\rightarrow\mathcal{A} is a function that maps a context vector xx to an action and the performance of π\pi is measured by the expected reward this policy generates, as characterized by the policy value function:

Definition 1.

The policy value function Q:Π𝐑Q:\Pi\rightarrow\mathbf{R} is defined as: Q(π)𝐄𝐏0[Y(π(X))]Q(\pi)\triangleq\mathbf{E}_{\mathbf{P}_{0}}[Y(\pi(X))], where the expectation is taken with respect to the randomness in the underlying joint distribution 𝐏0\mathbf{P}_{0} of (X,Y(a1),Y(a2),,Y(ad))(X,Y(a^{1}),Y(a^{2}),\dots,Y(a^{d})).

With this definition, the optimal policy is a policy that maximizes the policy value function. The objective in the standard policy learning context is to learn a policy π\pi that has the policy value as large as possible, which is equivalent to minimizing the discrepancy between the performance of the optimal policy and the performance of the learned policy π\pi.

2.3 Distributionally Robust Policy Learning

Using the policy value function Q()Q(\cdot) as defined in Definition 1 to measure the quality of a policy brings out an implicit assumption that the decision maker is making: the environment that generated the training data is the same as the environment where the policy will be deployed. This is manifested in that the expectation in Q()Q(\cdot) is taken with respect to the same underlying distribution 𝐏0\mathbf{P}_{0}. However, the underlying data-generating distribution may be different for the training environment and the test environment. In such cases, the policy learned with the goal to maximize the value under 𝐏0\mathbf{P}_{0} may not work well under the new test environment.

To address this issue, we propose a distributionally robust formulation for policy learning, where we explicitly incorporate into the learning phase the consideration that the test distribution may not be the same as the training distribution 𝐏0\mathbf{P}_{0}. To that end, we start by introducing some terminology. First, the KL-divergence between two probability measures 𝐏\mathbf{P} and 𝐏0\mathbf{P}_{0}, denoted by D(𝐏||𝐏0)D(\mathbf{P}||\mathbf{P}_{0}), is defined as D(𝐏||𝐏0)log(d𝐏d𝐏0)d𝐏.D(\mathbf{P}||\mathbf{P}_{0})\triangleq\int\log\left(\frac{d\mathbf{P}}{d\mathbf{P}_{0}}\right)\mathrm{d}\mathbf{P}. With KL-divergence, we can define a class of neighborhood distributions around a given distribution. Specifically, the distributional uncertainty set 𝒰𝐏0(δ)\mathcal{U}_{\mathbf{P}_{0}}(\delta) of size δ\delta is defined as 𝒰𝐏0(δ){𝐏𝐏0D(𝐏||𝐏0)δ}\mathcal{U}_{\mathbf{P}_{0}}(\delta)\triangleq\{\mathbf{P}\ll\mathbf{P}_{0}\mid D(\mathbf{P}||\mathbf{P}_{0})\leq\delta\}, where 𝐏𝐏0\mathbf{P}\ll\mathbf{P}_{0} means 𝐏\mathbf{P} is absolutely continuous with respect to 𝐏0\mathbf{P}_{0}. When it is clear from the context what the uncertainty radius δ\delta is, we sometimes drop δ\delta for notational simplicity and write 𝒰𝐏0\mathcal{U}_{\mathbf{P}_{0}} instead. We remark that in practice, δ\delta can be selected empirically. For example, we can collect historical distributional data from different regions and compute distances between them. Then, although the distributional shift direction is unclear, a reasonable distributional shift size δ\delta can be estimated. Furthermore, we can also check the sensitivity of robust policy with respect to δ\delta and choose an appropriate one according to a given insurance premium budget. We detail these two approaches in Section 6.4.

Definition 2.

For a given δ>0\delta>0, the distributionally robust value function QDRO:Π𝐑Q_{\mathrm{\rm DRO}}:\Pi\rightarrow\mathbf{R} is defined as: QDRO(π)inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]Q_{\mathrm{\rm DRO}}(\pi)\triangleq\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi(X))].

In other words, QDRO(π)Q_{\mathrm{\mathrm{DRO}}}(\pi) measures the performance of a policy π\pi by evaluating how it performs in the worst possible environment among the set of all environments that are δ\delta-close to 𝐏0\mathbf{P}_{0}. With this definition, the optimal policy πDRO\pi^{*}_{\mathrm{\mathrm{DRO}}} is a policy that maximizes the distributionally robust value function: πDROargmaxπΠ{QDRO(π)}\pi^{*}_{\mathrm{\mathrm{DRO}}}\in\arg\max_{\pi\in\Pi}\{Q_{\mathrm{\mathrm{DRO}}}(\pi)\}. If such optimal policy does not exist, we can always construct a sequence of policies whose distributionally robust value converges to the supremum supπΠ{QDRO(π)}\sup_{\pi\in\Pi}\{Q_{\rm DRO}(\pi)\}. Then, all of our results can generalize to this case. Therefore, for simplicity, we assume the optimal policy exists. To be robust to the changes between the test environment and the training environment, our goal is to learn a policy such that its distributionally robust policy value is as large as possible, or equivalently, as close to the best distributionally robust policy as possible. We formalize this notion in Definition 3.

Definition 3.

The distributionally robust regret RDRO(π)R_{\mathrm{\rm DRO}}(\pi) of a policy πΠ\pi\in\Pi is defined as
RDRO(π)maxπΠinf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]R_{\mathrm{\rm DRO}}(\pi)\triangleq\max_{\pi^{\prime}\in\Pi}\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi^{\prime}(X))]-\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi(X))].

Several things to note. First, per its definition, we can rewrite regret as RDRO(π)=QDRO(πDRO)QDRO(π)R_{\mathrm{DRO}}(\pi)=Q_{\mathrm{DRO}}(\pi^{*}_{\mathrm{\mathrm{DRO}}})-Q_{\mathrm{DRO}}(\pi). Second, the underlying random policy that has generated the observational data (specifically the AiA_{i}s) could be totally irrelevant with the policy class Π\Pi. Third, when a policy π^\hat{\pi} is learned from data and hence RDRO(π^)R_{\mathrm{DRO}}(\hat{\pi}) is a random variable, then a regret bound in such cases is customarily a high probability bound that highlights how regret scales as a function of the size nn of the dataset, the error probability and other important parameters of the problem, e.g. the complexity of the policy class Π\Pi.

Regarding some other definitions of regret, one may consider choices such as

sup𝐏𝒰𝐏0(δ)maxπΠ(𝐄𝐏[Y(π(X))]𝐄𝐏[Y(π(X))]),\sup_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\max_{\pi^{\prime}\in\Pi}(\mathbf{E}_{\mathbf{P}}[Y(\pi^{\prime}(X))]-\mathbf{E}_{\mathbf{P}}[Y(\pi(X))]),

where for a fixed distribution 𝐏\mathbf{P}, one compares the learned policy with the best one that could be done under perfect knowledge of 𝐏\mathbf{P}. In this definition the adversary is very strong in the sense that it knows the test domain. Therefore, a problem of this definition is that the regret does not converge to zero when nn goes to infinity (even for a randomized policy π\pi). We will not discuss this notion in this paper.

3 Distributionally Robust Policy Evaluation

3.1 Algorithm

In order to learn a distributionally robust policy – one that maximizes QDRO(π)Q_{\mathrm{\mathrm{DRO}}}(\pi) – a key step lies in accurately estimating the given policy π\pi’s distributionally robust value. We devote this section to tackling this problem.

Lemma 1 (Strong Duality).

For any policy πΠ\pi\in\Pi, we have

inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]\displaystyle\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}\left[Y(\pi(X))\right] =supα0{αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ}\displaystyle=\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta\right\} (1)
=supα0{αlog𝐄𝐏0π0[exp(Y(A)/α)𝟏{π(X)=A}π0(AX)]αδ},\displaystyle=\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}*\pi_{0}}\left[\frac{\exp(-Y(A)/\alpha)\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A\mid X)}\right]-\alpha\delta\right\}, (2)

where 𝟏{}\mathbf{1}\{\cdot\} denotes the indicator function.

The proof of Lemma 1 is in Appendix A.2.

Remark 2.

When α=0\alpha=0, by the discussion of Case 1 after Assumption 1 in Hu and Hong [40], we define

αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ|α=0=essinf{Y(π(X)},-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta|_{\alpha=0}=\operatorname*{ess\,inf}\{Y(\pi(X)\},

where essinf\operatorname*{ess\,inf} denotes the essential infimum. Therefore, αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta is right continuous at zero. In fact, Lemma A12 in Appendix A.3 shows that the optimal value is not attained at α=0\alpha=0 if Assumption 2.1 is enforced.

The above strong duality allows us to transform the original problem of evaluating inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}(\delta)}}\mathbf{E}_{\mathbf{P}}\left[Y(\pi(X))\right], where the (primal) variable is a infinite-dimensional distribution 𝐏\mathbf{P} into a simpler problem where the (dual) variable is a positive scalar α\alpha. Note that in the dual problem, the expectation is taken with respect to the same underlying distribution 𝐏0\mathbf{P}_{0}. This then allows us to use an easily-computable plug-in estimate of the distributionally robust policy value. To easily reference the subsequent analysis of our algorithm, we capture the important terms in the following definition.

Definition 4.

Let {(Xi,Ai,Yi)}i=1n\{(X_{i},A_{i},Y_{i})\}_{i=1}^{n} be a given dataset. We define

Wi(π,α)𝟏{π(Xi)=Ai}π0(AiXi)exp(Yi(Ai)/α),Snπ1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)W_{i}(\pi,\alpha)\triangleq\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}\exp(-Y_{i}(A_{i})/\alpha),\ S_{n}^{\pi}\triangleq\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}

and

W^n(π,α)1nSnπi=1nWi(π,α).\hat{W}_{n}(\pi,\alpha)\triangleq\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}W_{i}(\pi,\alpha).

We also define the dual objective function and the empirical dual objective function as

ϕ(π,α)αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ,\phi(\pi,\alpha)\triangleq-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta,

and

ϕ^n(π,α)αlogW^n(π,α)αδ,\hat{\phi}_{n}(\pi,\alpha)\triangleq-\alpha\log\hat{W}_{n}(\pi,\alpha)-\alpha\delta,

respectively.

Then, we define the distributionally robust value estimators and the optimal dual variable using the following notations.

  1. 1.

    The distributionally robust value estimator Q^DRO:Π𝐑\hat{Q}_{\mathrm{\rm DRO}}:\Pi\rightarrow\mathbf{R} is defined by Q^DRO(π)supα0{ϕ^n(π,α)}\hat{Q}_{\mathrm{\rm DRO}}(\pi)\triangleq\sup_{\alpha\geq 0}\left\{\hat{\phi}_{n}(\pi,\alpha)\right\}.

  2. 2.

    The optimal dual variable α(π)\alpha^{\ast}(\pi) is defined by α(π)argmaxα0{ϕ(π,α)}\alpha^{\ast}(\pi)\triangleq\arg\max_{\alpha\geq 0}\left\{\phi(\pi,\alpha)\right\}, and the empirical dual value is denoted as αn(π)argmaxα0{ϕ^n(π,α)}\alpha_{n}(\pi)\in\arg\max_{\alpha\geq 0}\left\{\hat{\phi}_{n}(\pi,\alpha)\right\}.

Wi(π,α)W_{i}(\pi,\alpha) is a realization of the random variable inside the expectation in equation (2), and we approximate 𝐄𝐏0π0[exp(Y(A)/α)𝟏{π(X)=A}π0(AX)]\mathbf{E}_{\mathbf{P}_{0}*\pi_{0}}\left[\frac{\exp(-Y(A)/\alpha)\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A\mid X)}\right] by its empirical average Wn^(π,α)\hat{W_{n}}(\pi,\alpha) with a normalization factor SnπS_{n}^{\pi}. Note that 𝐄[Snπ]=1\mathbf{E}[S_{n}^{\pi}]=1 and Snπ1S_{n}^{\pi}\rightarrow 1 almost surely. Therefore, the normalized Wn^(π,α)\hat{W_{n}}(\pi,\alpha) is asymptotically equivalent with the unnormalized 1ni=1nWi(π,α)\frac{1}{n}\sum_{i=1}^{n}W_{i}(\pi,\alpha). The reason for dividing a normalization factor SnπS_{n}^{\pi} is that it makes our evaluation more stable; see discussions in [72] and [77]. The upper bound of α(π)\alpha^{\ast}(\pi) proven in Lemma A11 of Appendix A.3 establishes the validity of the definitions α(π)\alpha^{\ast}(\pi), namely, α(π)\alpha^{\ast}(\pi) is attainable and unique.

Remark 3.

Another recent paper [30] also discusses a similar problem. The estimator they propose is equivalent to

supa0αlog(1ni=1nexp(1{π(Xi)=Ai}Yiαπ0(Ai|Xi)))αδ.\sup_{a\geq 0}-\alpha\log\left(\frac{1}{n}\sum_{i=1}^{n}\exp\left(-\frac{1\left\{\pi(X_{i})=A_{i}\right\}Y_{i}}{\alpha\pi_{0}\left(A_{i}|X_{i}\right)}\right)\right)-\alpha\delta. (3)

We remark that their estimator is not consistent, namely, the estimator (3) does not converge to inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}\left[Y(\pi(X))\right], when nn goes to infinity.

By Hu and Hong [40, Proposition 1 and their discussion following the proposition], we provide a characterization of the worst case distribution in Proposition 1.

Proposition 1 (The Worst Case Distribution).

Suppose that Assumption 1 is imposed. For any policy πΠ\pi\in\Pi, when α(π)>0\alpha^{*}(\pi)>0, we define a probability measure 𝐏(π)\mathbf{P}(\pi) supported on 𝒳×j=1d𝒴j\mathcal{X}\times\prod_{j=1}^{d}\mathcal{Y}_{j} such that

d𝐏(π)d𝐏0=exp(Y(π(X)/α(π)))𝐄𝐏0[exp(Y(π(X)/α(π)))],\frac{{\rm d}\mathbf{P}(\pi)}{{\rm d}\mathbf{P}_{0}}=\frac{\exp(-Y(\pi(X)/\alpha^{*}(\pi)))}{\mathbf{E}_{\mathbf{P}_{0}}[\exp(-Y(\pi(X)/\alpha^{*}(\pi)))]},

where d𝐏(π)/d𝐏0{{\rm d}\mathbf{P}(\pi)}/{{\rm d}\mathbf{P}_{0}} is the Radon-Nikodym derivative; when α(π)=0\alpha^{*}(\pi)=0, we define

d𝐏(π)d𝐏0=𝟏{Y(π(X))=essinf{Y(π(X)}}𝐏0(Y(π(X))=essinf{Y(π(X)}).\frac{{\rm d}\mathbf{P}(\pi)}{{\rm d}\mathbf{P}_{0}}=\frac{\mathbf{1}\{Y(\pi(X))=\operatorname*{ess\,inf}\{Y(\pi(X)\}\}}{\mathbf{P}_{0}(Y(\pi(X))=\operatorname*{ess\,inf}\{Y(\pi(X)\})}.

Then, we have that 𝐏(π){\mathbf{P}(\pi)} is the unique worst case distribution, namely

𝐏(π)=argmin𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X)].\mathbf{P}(\pi)=\mathop{\rm arg\,min}_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi(X)].

Proposition 1 shows that the worst case measure 𝐏(π)\mathbf{P}(\pi) is an exponentially tilted measure with respect to the underlying measure 𝐏0\mathbf{P}_{0}, where 𝐏(π)\mathbf{P}(\pi) puts more weights on the low end. Since α(π)\alpha^{*}(\pi) can be approximated by αn(π)\alpha_{n}(\pi), and αn(π)\alpha_{n}(\pi) is explicitly computable as we shall see in Algorithm 1, we are able to understand how the worst case measure behaves. Moreover, we show that the worst case measure 𝐏(π)\mathbf{P}(\pi) maintains mutual independence when Y(a1),,Y(ad)Y(a^{1}),\ldots,Y(a^{d}) are mutually independent conditional on XX under 𝐏0\mathbf{P}_{0} in the following Corollary.

Corollary 1.

Suppose that Assumptions 1 and 2 is imposed and under 𝐏0\mathbf{P}_{0}, Y(a1),,Y(ad)Y(a^{1}),\ldots,Y(a^{d}) are mutually independent conditional on XX. Then, for any policy πΠ\pi\in\Pi, under the worst case measure 𝐏(π)\mathbf{P}(\pi), Y(a1),,Y(ad)Y(a^{1}),\ldots,Y(a^{d}) are still mutually independent conditional on XX.

The proofs of Proposition 1 and Corollary 1 are in Appendix A.2.

To compute Q^DRO\hat{Q}_{\mathrm{\mathrm{DRO}}}, one needs to solve an optimization problem to obtain the distributionally robust estimate of the policy π\pi. As the following lemma indicates, this optimization problem is easy to solve.

Lemma 2.

The empirical dual objective function ϕ^n(π,α)\hat{\phi}_{n}(\pi,\alpha) is concave in α\alpha and its partial derivative admits the expression

αϕ^n(π,α)\displaystyle\frac{\partial}{\partial\alpha}\hat{\phi}_{n}(\pi,\alpha) =\displaystyle= i=1nYi(Ai)Wi(π,α)αSnπW^n(π,α)logW^n(π,α)δ,\displaystyle-\frac{\sum_{i=1}^{n}Y_{i}(A_{i})W_{i}(\pi,\alpha)}{\alpha S_{n}^{\pi}\hat{W}_{n}(\pi,\alpha)}-\log\hat{W}_{n}(\pi,\alpha)-\delta,
2α2ϕ^n(π,α)\displaystyle\frac{\partial^{2}}{\partial\alpha^{2}}\hat{\phi}_{n}(\pi,\alpha) =\displaystyle= (i=1nYi(Ai)Wi(π,α))2α3(Snπ)2(W^n(π,α))2i=1nYi2(Ai)Wi(π,α)α3SnπW^n(π,α).\displaystyle\frac{(\sum_{i=1}^{n}Y_{i}(A_{i})W_{i}(\pi,\alpha))^{2}}{\alpha^{3}(S_{n}^{\pi})^{2}(\hat{W}_{n}(\pi,\alpha))^{2}}-\frac{\sum_{i=1}^{n}Y^{2}_{i}(A_{i})W_{i}(\pi,\alpha)}{\alpha^{3}S_{n}^{\pi}\hat{W}_{n}(\pi,\alpha)}.

Further, if the array {Yi(Ai)𝟏{π(Xi)=Ai}}i=1n\{Y_{i}(A_{i})\mathbf{1}\{\pi(X_{i})=A_{i}\}\}_{i=1}^{n} has at least two different non-zero entries, then ϕ^n(π,α)\hat{\phi}_{n}(\pi,\alpha) is strictly-concave in α\alpha.

The proof of Lemma 2 is in Appendix A.2. Since the optimization problem Q^DRO=maxα0{ϕ^n(π,α)}\hat{Q}_{\mathrm{\mathrm{DRO}}}=\max_{\alpha\geq 0}\left\{\hat{\phi}_{n}(\pi,\alpha)\right\} is maximizing a concave function, it can be computed using the Newton-Raphson method. Based on all of the discussions above, we formally give the distributionally robust policy evaluation algorithm in Algorithm 1. By Luenberger and Ye [55, Section 8.8], we have that ϕ^n(π,α)\hat{\phi}_{n}(\pi,\alpha) converges to the global maximum Q^DRO(π)\hat{Q}_{\mathrm{DRO}}(\pi) quadratically in Algorithm 1 if the initial value of α\alpha is sufficiently closed to the optimal value.

Algorithm 1 Distributionally Robust Policy Evaluation
1:  Input: Dataset {(Xi,Ai,Yi)}i=1n\{(X_{i},A_{i},Y_{i})\}_{i=1}^{n}, data-collecting policy π0\pi_{0}, policy πΠ\pi\in\Pi, and initial value of dual variable α\alpha.
2:  Output: Estimator of the distributionally robust policy value Q^DRO(π)\hat{Q}_{\mathrm{\rm DRO}}(\pi).
3:  repeat
4:     Let Wi(π,α)𝟏{π(Xi)=Ai}π0(AiXi)exp(Yi(Ai)/α)W_{i}(\pi,\alpha)\leftarrow\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}\exp(-Y_{i}(A_{i})/\alpha).
5:     Compute Snπ1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)S_{n}^{\pi}\leftarrow\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}.
6:     Compute W^n(π,α)1nSnπi=1nWi(π,α)\hat{W}_{n}(\pi,\alpha)\leftarrow\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}W_{i}(\pi,\alpha).
7:     Update αα(αϕ^n)/(2α2ϕ^n)\alpha\leftarrow\alpha-(\frac{\partial}{\partial\alpha}\hat{\phi}_{n})/(\frac{\partial^{2}}{\partial\alpha^{2}}\hat{\phi}_{n}).
8:  until α\alpha converges.
9:  Return Q^DRO(π)ϕ^n(π,α)\hat{Q}_{\mathrm{\rm DRO}}(\pi)\leftarrow\hat{\phi}_{n}(\pi,\alpha).

3.2 Theoretical Guarantee of Distributionally Robust Policy Evaluation

In the next theorem, we demonstrate that the approximation error for policy evaluation function Q^DRO(π)\hat{Q}_{\mathrm{\mathrm{DRO}}}(\pi) is Op(1/n)O_{p}(1/\sqrt{n}) for a fixed policy π\pi.

Theorem 1.

Suppose Assumptions 1 and 2 are enforced, and define

σ2(α)=α2𝐄[exp(Y(π(X))/α)]2𝐄[1π0(π(X)|X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2].\sigma^{2}(\alpha)=\frac{\alpha^{2}}{\mathbf{E}[\exp\left(-Y(\pi(X))/\alpha\right)]^{2}}\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right].

Then, for any policy πΠ\pi\in\Pi, we have

n(Q^DRO(π)QDRO(π))\displaystyle\sqrt{n}\left(\hat{Q}_{\mathrm{\rm DRO}}(\pi)-Q_{\mathrm{\rm DRO}}(\pi)\right) 𝒩(0,σ2(α(π))), if α(π)>0, and\displaystyle\Rightarrow\mathcal{N}\left(0,\sigma^{2}(\alpha^{\ast}(\pi))\right),\text{ if }\alpha^{\ast}(\pi)>0,\text{ and}
n(Q^DRO(π)QDRO(π))\displaystyle\sqrt{n}\left(\hat{Q}_{\mathrm{\rm DRO}}(\pi)-Q_{\mathrm{\rm DRO}}(\pi)\right) 0 in probability, if α(π)=0,\displaystyle\rightarrow 0\text{ in probability, if }\alpha^{\ast}(\pi)=0,

where α(π)\alpha^{\ast}(\pi) is defined in Definition 4, \Rightarrow denotes convergence in distribution, and 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) is the normal distribution with mean zero and variance σ2\sigma^{2}.

Theorem 1 ensures that we are able to evaluate the performance of a policy in a new environment using only the training data even if the new environment is different from the training environment. The proof of Theorem 1 is in Appendix A.3.

4 Distributionally Robust Policy Learning

In this section, we study the policy learning aspect of the problem and discuss both the algorithm and its corresponding finite-sample theoretical guarantee. The aim is to find a robust policy that has reasonable performance in a new environment with unknown distributional shifts. First, with the distributionally robust policy evaluation scheme discussed in the previous section, we can in principle compute the distributionally robust optimal policy π^DRO\hat{\pi}_{\mathrm{DRO}} by picking a policy in the given policy class Π\Pi that maximizes the value of Q^DRO\hat{Q}_{\mathrm{DRO}}, i.e.

π^DROargmaxπΠQ^DRO(π)=argmaxπΠsupα0{αlogW^n(π,α)αδ}.\hat{\pi}_{\mathrm{DRO}}\in\mathop{\rm arg\,max}_{\pi\in\Pi}\hat{Q}_{\mathrm{DRO}}(\pi)=\mathop{\rm arg\,max}_{\pi\in\Pi}\;\sup_{\alpha\geq 0}\left\{-\alpha\log\hat{W}_{n}(\pi,\alpha)-\alpha\delta\right\}. (4)

How do we compute π^DRO\hat{\pi}_{\mathrm{DRO}}? In general, this problem is computationally intractable since it is highly non-convex in its optimization variables (π\pi and α\alpha jointly). However, following the standard tradition in the machine learning and optimization literature , we can employ certain approximate schemes that, although do not guarantee global convergence, are computationally efficient and practically effective (for example, greedy tree search [32, Section 9.2] for decision-tree policy classes and gradient descent [64] for linear policy classes).

A simple and quite effective scheme is alternate minimization, given in Algorithm 2, where we learn π^DRO\hat{\pi}_{\mathrm{\mathrm{DRO}}} by fixing α\alpha and minimizing on π\pi and then fixing π\pi and maximizing on α\alpha in each iteration. Since the value of Q^DRO(π)\hat{Q}_{\mathrm{DRO}}(\pi) is non-decreasing along the iterations of Algorithm 2, the converged solution obtained from Algorithm 2 is a local maximum of Q^DRO\hat{Q}_{\mathrm{DRO}}. In practice, to accelerate the algorithm, we only iterate once for α\alpha (line 8) using the Newton-Raphson step αα(αϕ^n)/(2α2ϕ^n)\alpha\leftarrow\alpha-\left(\frac{\partial}{\partial\alpha}\hat{\phi}_{n}\right)/\left(\frac{\partial^{2}}{\partial\alpha^{2}}\hat{\phi}_{n}\right). Subsequent simulations (see next section) show that this is often sufficient.

Algorithm 2 Distributionally Robust Policy Learning
1:  Input: Dataset {(Xi,Ai,Yi)}i=1n\{(X_{i},A_{i},Y_{i})\}_{i=1}^{n}, data-collecting policy π0\pi_{0}, and initial value of dual variable α\alpha.
2:  Output: Distributionally robust optimal policy π^DRO\hat{\pi}_{\mathrm{\rm DRO}}.
3:  repeat
4:     Let Wi(π,α)𝟏{π(Xi)=Ai}π0(AiXi)exp(Yi(Ai)/α)W_{i}(\pi,\alpha)\leftarrow\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}\exp(-Y_{i}(A_{i})/\alpha).
5:     Compute Snπ1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)S_{n}^{\pi}\leftarrow\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}.
6:     Compute W^n(π,α)1nSnπi=1nWi(π,α)\hat{W}_{n}(\pi,\alpha)\leftarrow\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}W_{i}(\pi,\alpha).
7:     Update πargminπΠW^n(π,α)\pi\leftarrow\mathop{\rm arg\,min}_{\pi\in\Pi}\hat{W}_{n}(\pi,\alpha).
8:     Update αargmaxα>0{ϕ^n(π,α)}\alpha\leftarrow\mathop{\rm arg\,max}_{\alpha>0}\{\hat{\phi}_{n}(\pi,\alpha)\}.
9:  until α\alpha converges.
10:  Return π\pi.

4.1 Statistical Performance Guarantee

We now establish the finite-sample statistical performance guarantee for the distributionally robust optimal policy π^DRO\hat{\pi}_{\mathrm{DRO}}. Before giving the theorem, we first need to define entropy integral in the policy class, which represents the class complexity.

Definition 5.

Given the feature domain 𝒳\mathcal{X}, a policy class Π,\Pi, a set of nn points {x1,,xn}𝒳\{x_{1},\ldots,x_{n}\}\subset\mathcal{X}, define:

  1. 1.

    Hamming distance between any two policies π1\pi_{1} and π2\pi_{2} in Π:H(π1,π2)=1nj=1n𝟏{π1(xj)π2(xj)}.\Pi:H(\pi_{1},\pi_{2})=\frac{1}{n}\sum_{j=1}^{n}\mathbf{1}\{\pi_{1}(x_{j})\neq\pi_{2}(x_{j})\}.

  2. 2.

    ϵ\epsilon-Hamming covering number of the set {x1,,xn}:NH(n)(ϵ,Π,{x1,,xn})\{x_{1},\ldots,x_{n}\}:N_{H}^{\left(n\right)}\left(\epsilon,\Pi,\{x_{1},\ldots,x_{n}\}\right) is the smallest number KK of policies {π1,,πK}\{\pi_{1},\ldots,\pi_{K}\} in Π\Pi, such that πΠ,πi,H(π,πi)ϵ.\forall\pi\in\Pi,\exists\pi_{i},H(\pi,\pi_{i})\leq\epsilon.

  3. 3.

    ϵ\epsilon-Hamming covering number of Π:NH(n)(ϵ,Π)sup{NH(n)(ϵ,Π,{x1,,xn})|x1,,xn𝒳}.\Pi:N_{H}^{\left(n\right)}\left(\epsilon,\Pi\right)\triangleq\sup\left\{N_{H}^{(n)}\left(\epsilon,\Pi,\{x_{1},\ldots,x_{n}\}\right)|x_{1},\ldots,x_{n}\in\mathcal{X}\right\}.

  4. 4.

    Entropy integral: κ(n)(Π)01logNH(n)(ϵ2,Π)dϵ.\kappa^{(n)}\left(\Pi\right)\triangleq\int_{0}^{1}\sqrt{\log N_{H}^{\left(n\right)}\left(\epsilon^{2},\Pi\right)}{\rm d}\epsilon.

The defined entropy integral is the same as Definition 4 in [91], which is a variant of the classical entropy integral introduced in [29], and the Hamming distance is a well-known metric for measuring the similarity between two equal-length arrays whose elements are supported on on discrete sets [38]. We then discuss the entropy integrals κ(n)(Π)\kappa^{(n)}\left(\Pi\right) for different policy classes Π\Pi.

Example 1 (Finite policy classes).

For a policy class ΠFin\Pi_{\rm{Fin}} containing a finite number of policies, we have κ(n)(ΠFin)log(|ΠFin|),\kappa^{(n)}\left(\Pi_{\rm{Fin}}\right)\leq\sqrt{\log(|\Pi_{\rm{Fin}}|)}, where |ΠFin||\Pi_{\rm{Fin}}| denotes the cardinality of the set ΠFin\Pi_{\rm{Fin}}.

The entropy integrals for the linear policy classes and decision-tree policy classes are discussed in Section 5.2 and Section 6.2, respectively. For the special case of binary action, we have the following bound for the entropy integral by [60] (see the discussion following Definition 4).

Lemma 3.

If d=2d=2, we have κ(n)(Π)2.5VC(Π)\kappa^{(n)}\left(\Pi\right)\leq 2.5\sqrt{VC(\Pi)}, where VC()VC(\cdot) denotes the VC dimension defined in [80].

This result can be further generalized to the multi-action policy learning setting, where dd is greater than 2; see the proof of Theorem 2 in [60].

Lemma 4.

We have κ(n)(Π)2.5log(d)Graph(Π)\kappa^{(n)}\left(\Pi\right)\leq 2.5\sqrt{\log(d)Graph(\Pi)}, where Graph()Graph(\cdot) denotes the graph dimension (see the definition in [8]).

Graph dimension is a direct generalization of VC dimension. There are many papers that discuss the graph dimension and also a closely related concept, Natarajan dimension; see, for example, [20, 21, 60, 58].

Theorem 2 demonstrates that with high probability, the distributionally robust regret of the learned policy RDRO(π^DRO)R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}}) decays at a rate upper bounded by Op(κ(n)/n)O_{p}(\kappa^{(n)}/\sqrt{n}).

Theorem 2.

Suppose Assumption 1 is enforced. Then, with probability at least 1ε1-\varepsilon, under Assumption 2.1, we have

RDRO(π^DRO)4b¯ηn(24(2+1)κ(n)(Π)+2log(2ε)+C),R_{\rm DRO}(\hat{\pi}_{\rm DRO})\leq\frac{4}{\underline{b}\eta\sqrt{n}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}+C\right), (5)

where CC is a universal constant, and under Assumption 2.2, when

n{4b¯η(24(2+1)κ(n)(Π)+48|𝔻|log(2)+2log(2ε))}2,n\geq\left\{\frac{4}{\underline{b}\eta}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}\right)\right\}^{2},

we have

RDRO(π^DRO)4Mb¯ηn(24(2+1)κ(n)(Π)+48|𝔻|log(2)+2log(2ε)),R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}})\leq\frac{4M}{\underline{b}\eta\sqrt{n}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}\right), (6)

where |𝔻||\mathbb{D}| denotes the cardinality of the set 𝔻\mathbb{D}.

The key challenge to the proof of Theorem 2 is that QDRO(π)Q_{\rm DRO}(\pi) is hard to quantify, since it is a non-linear functional of the probability measure 𝐏\mathbf{P}. Thanksfully, Lemmas 5 and 6 allow us to transform the hardness of analysis of QDRO(π)Q_{\rm DRO}(\pi) into the well-studied terms such as the quantile and the total variation distance.

Lemma 5.

For any probability measures 𝐏1,𝐏2\mathbf{P}_{1},\mathbf{P}_{2} supported on 𝐑\mathbf{R}, we have

|supα0{αlog𝐄𝐏1[exp(Y/α)]αδ}supα0{αlog𝐄𝐏2[exp(Y/α)]αδ}|supt[0,1]|q𝐏1(t)q𝐏2(t)|,\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}\right|\leq\sup_{t\in[0,1]}\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right|,

where q𝐏(t)q_{\mathbf{P}}\left(t\right) denotes the tt-quantile of a probability measure 𝐏\mathbf{P}, defined as

q𝐏(t)inf{x𝐑:tF𝐏(x)},q_{\mathbf{P}}\left(t\right)\triangleq\inf\left\{x\in\mathbf{R}:t\leq F_{\mathbf{P}}\left(x\right)\right\},

where F𝐏F_{\mathbf{P}} is the CDF of 𝐏.\mathbf{P.}

Lemma 6.

Suppose 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2} are supported on 𝔻\mathbb{D} and satisfy Assumption 1.3. We further assume 𝐏2\mathbf{P}_{2} satisfies Assumption 2.2. When TV(𝐏1,𝐏2)<b¯/2,\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2})<\underline{b}/2, we have

|supα0{αlog𝐄𝐏1[exp(Y/α)]αδ}supα0{αlog𝐄𝐏2[exp(Y/α)]αδ}|2Mb¯TV(𝐏1,𝐏2),\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}\right|\leq\frac{2M}{\underline{b}}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}),

where TV\mathrm{TV} denotes the total variation distance.

The detailed proof is in Appendix A.4. We see those bounds in (5) and (6) for the distributionally robust regret does not depend on the uncertainty size δ\delta. Furthermore, if supnκ(n)<\sup_{n}\kappa^{(n)}<\infty including the finite policy classes, linear classes, decision-tree policy classes and the case where VC(Π)VC(\Pi) or Graph(Π)Graph(\Pi) is finite, we have a parametric convergence rate Op(1/n)O_{p}(1/\sqrt{n}). Further, if κ(n)=op(n)\kappa^{(n)}=o_{p}(\sqrt{n}), we have RDRO(π^DRO)0R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}})\rightarrow 0 in probability. Generally, we may expect the complexities of parametric classes are O(1)O(1). Theorem 2 guarantees the robustness of the policy learned from the training environment given sufficient training data and low complexity of the policy class. This result means that the test environment performance is guaranteed as long as test and training environments do not differ too much. We will show this rate is optimal up to a constant in Theorem 3.

4.2 Statistical Lower Bound

In this subsection, we provide a tight lower bound of the distributionally robust batch contextual bandit problem. First we define 𝒫(M)\mathcal{P}(M) as the collection of all joint distributions of (X,Y(a1),Y(a2),,Y(ad),A)(X,Y(a^{1}),Y(a^{2}),\ldots,Y(a^{d}),A) satisfying Assumption 1. To emphasize the dependence on the underlying distribution 𝐏0\mathbf{P}_{0}, we rewrite RDRO(π)=RDRO(π,𝐏0)R_{\mathrm{DRO}}(\pi)=R_{\mathrm{DRO}}(\pi,\mathbf{P}_{0}). We further denote 𝐏0π0\mathbf{P}^{\pi_{0}}_{0} to be the distribution of the observed triples {X,A,Y(A)}\left\{X,A,Y(A)\right\}.

Theorem 3.

Let d=2d=2 and δ0.226\delta\leq 0.226. Then, for any policy π\pi as a function of {Xi,Ai,Yi}i=1n,\{X_{i},A_{i},Y_{i}\}_{i=1}^{n}, it holds that

sup𝐏0π0𝒫(M)𝐄(𝐏0π0)n[RDRO(π,𝐏0)]Mκ(n)(Π)160n, for nκ(n)(Π)2,\sup_{\mathbf{P}_{0}*\pi_{0}\in\mathcal{P}(M)}\mathbf{E}_{\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n}}\left[R_{\mathrm{DRO}}(\pi,\mathbf{P}_{0})\right]\geq\frac{M\kappa^{(n)}\left(\Pi\right)}{160\sqrt{n}},\text{ for }n\geq\kappa^{(n)}\left(\Pi\right)^{2},

where (𝐏0π0)n\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n} denotes the nn-times product measure of 𝐏0π0\mathbf{P}^{\pi_{0}}_{0}.

The proof of Theorem 3 is in Appendix A.5. Theorem 3 shows that the dependence of the regret on the complexity κ(n)(Π)\kappa^{(n)}(\Pi); the number of samples, nn; and the bound of the reward, MM, is optimal up to a constant. It means that it is impossible to find a good robust policy with a small amount of training data or a relatively large policy class.

5 Simulation Studies

In this section, we provide discussions on simulation studies to justify the robustness of the proposed DRO policy π^DRO\hat{\pi}_{\mathrm{DRO}} in the linear policy class. Specifically, Section 5.1 discusses a notion of the Bayes DRO policy, which is viewed as a benchmark; Section 5.2 presents an approximation algorithm to efficiently learn a linear policy; Section 5.3 gives a visualization of the learned DRO policy, with a comparison to the benchmark Bayes DRO policy, and demonstrates the performance of our proposed estimator.

5.1 Bayes DRO Policy

In this section, we give a characterization of the Bayes DRO policy π¯DRO\overline{\pi}^{*}_{\mathrm{DRO}}, which maximizes the distributionally robust value function within the class of all measurable policies, i.e.,

π¯DROargmaxπΠ¯{QDRO(π)},\overline{\pi}^{*}_{\mathrm{DRO}}\in\mathop{\rm arg\,max}_{\pi\in\overline{\Pi}}\{Q_{\mathrm{DRO}}(\pi)\},

where Π¯\overline{\Pi} denotes the class of all measurable mappings from 𝒳\mathcal{X} to the action set 𝒜\mathcal{A}. Despite the Bayes DRO policy is not being learnable given finitely many training samples, it could be a benchmark in a simulation study. Proposition 2 shows how to compute π¯DRO\overline{\pi}^{*}_{\mathrm{DRO}} if we know the population distribution.

Proposition 2.

Suppose that for any α>0\alpha>0 and any a𝒜a\in\mathcal{A}, the mapping x𝐄𝐏0[exp(Y(a)/α)|X=x]x\mapsto\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-Y(a)/\alpha\right)\right|X=x\right] is measurable. Then, the Bayes DRO policy is

π¯DRO(x)argmina𝒜{𝐄𝐏0[exp(Y(a)α(π¯DRO))|X=x]},\overline{\pi}^{*}_{\mathrm{\rm DRO}}(x)\in\mathop{\rm arg\,min}_{a\in\mathcal{A}}\left\{\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-\frac{Y(a)}{\alpha^{*}(\overline{\pi}^{*}_{\rm DRO})}\right)\right|X=x\right]\right\},

where α(πDRO)\alpha^{*}(\pi^{*}_{\rm DRO}) is an optimizer of the following optimization problem:

α(π¯DRO)argmaxα0{αlog𝐄𝐏0[mina𝒜{𝐄𝐏0[exp(Y(a)/α)|X]}]αδ}.\alpha^{*}(\overline{\pi}^{*}_{\rm DRO})\in\mathop{\rm arg\,max}_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\min_{a\in\mathcal{A}}\left\{\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-Y(a)/\alpha\right)\right|X\right]\right\}\right]-\alpha\delta\right\}. (7)

See Appendix A.6 for the proof.

Remark 4.

π¯DRO\overline{\pi}_{\rm DRO}^{*} only depends on the marginal distribution of XX and the conditional distributions of Y(ai)|X,i=1,2,,dY(a^{i})|X,i=1,2,\ldots,d. Therefore, the conditional correlation structure of Y(ai)|X,i=1,2,,dY(a^{i})|X,i=1,2,\ldots,d does not affect π¯DRO\overline{\pi}_{\rm DRO}^{*}.

5.2 Linear Policy Class and Logistic Policy Approximation

In this section, we introduce the linear policy class ΠLin\Pi_{\mathrm{Lin}}. We consider 𝒳\mathcal{X} to be a subset of 𝐑p\mathbf{R}^{p}, and the action set 𝒜={1,2,,d}\mathcal{A}=\{1,2,\ldots,d\}. To capture the intercept, it is convenient to include the constant variable 1 in X𝒳X\in\mathcal{X}, thus in the rest of Section 5.2, XX is a p+1p+1 dimensional vector and 𝒳\mathcal{X} is a subset of 𝐑p+1\mathbf{R}^{p+1}. Each policy πΠLin\pi\in\Pi_{\mathrm{Lin}} is parameterized by a set of dd vectors Θ={θa𝐑p+1:a𝒜}𝐑(p+1)×d\Theta=\{\theta_{a}\in\mathbf{R}^{p+1}:a\in\mathcal{A}\}\in\mathbf{R}^{(p+1)\times d}, and the mapping π:𝒳𝒜\pi:\mathcal{X}\rightarrow\mathcal{A} is defined as

πΘ(x)argmaxa𝒜{θax}.\pi_{\Theta}(x)\in\mathop{\rm arg\,max}_{a\in\mathcal{A}}\leavevmode\nobreak\ \left\{\theta_{a}^{\top}x\right\}.

The optimal parameter for linear policy class is characterized by the optimal solution of

maxΘ𝐑(p+1)×d𝐄𝐏0[Y(πΘ(X))].\max_{\Theta\in\mathbf{R}^{(p+1)\times d}}\mathbf{E}_{\mathbf{P}_{0}}[Y(\pi_{\Theta}(X))].

Due to the fact that 𝐄𝐏0[Y(πΘ(X))]=𝐄𝐏π0[Y(A)𝟏{πΘ(X)=A}π0(AX)]\mathbf{E}_{\mathbf{P}_{0}}[Y(\pi_{\Theta}(X))]=\mathbf{E}_{\mathbf{P}*\pi_{0}}\left[\frac{Y(A)\mathbf{1}\{\pi_{\Theta}(X)=A\}}{\pi_{0}(A\mid X)}\right], the associated sample average approximation problem for optimal parameter estimation is

maxΘ𝐑(p+1)×d1ni=1nYi(Ai)𝟏{πΘ(Xi)=Ai}π0(Ai|Xi).\max_{\Theta\in\mathbf{R}^{(p+1)\times d}}\frac{1}{n}\sum_{i=1}^{n}\frac{Y_{i}(A_{i})\mathbf{1\{}\pi_{\Theta}(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}.

However, the objective in this optimization problem is non-differentiable and non-convex, thus we approximate the indicator function using a softmax mapping by

𝟏{πΘ(Xi)=Ai}exp(θAiXi)a=1dexp(θaXi),\mathbf{1\{}\pi_{\Theta}(X_{i})=A_{i}\mathbf{\}}\approx\frac{\exp(\theta_{A_{i}}^{\top}X_{i})}{\sum_{a=1}^{d}\exp(\theta_{a}^{\top}X_{i})},

which leads to an optimization problem with smooth objective:

maxΘ𝐑(p+1)×d1ni=1nYi(Ai)exp(θAiXi)π0(Ai|Xi)a=1dexp(θaXi).\max_{\Theta\in\mathbf{R}^{(p+1)\times d}}\frac{1}{n}\sum_{i=1}^{n}\frac{Y_{i}(A_{i})\exp(\theta_{A_{i}}^{\top}X_{i})}{\pi_{0}\left(A_{i}|X_{i}\right)\sum_{a=1}^{d}\exp(\theta_{a}^{\top}X_{i})}.

We employ the gradient descent method to solve for the optimal parameter

Θ^LinargmaxΘ𝐑(p+1)×d{1ni=1nYi(Ai)exp(θAiXi)π0(Ai|Xi)a=1dexp(θaXi)},\hat{\Theta}_{\mathrm{Lin}}\in\mathop{\rm arg\,max}_{\Theta\in\mathbf{R}^{(p+1)\times d}}\left\{\frac{1}{n}\sum_{i=1}^{n}\frac{Y_{i}(A_{i})\exp(\theta_{A_{i}}^{\top}X_{i})}{\pi_{0}\left(A_{i}|X_{i}\right)\sum_{a=1}^{d}\exp(\theta_{a}^{\top}X_{i})}\right\},

and define the policy π^LinπΘ^Lin\hat{\pi}_{\mathrm{Lin}}\triangleq\pi_{\hat{\Theta}_{\mathrm{Lin}}} as our linear policy estimator. In Section 5.3, we justify the efficacy of π^Lin\hat{\pi}_{\mathrm{Lin}} by empirically showing π^Lin\hat{\pi}_{\mathrm{Lin}} is capable of discovering the (non-robust) optimal decision boundary.

As an oracle in Algorithm 2, a similar smoothing technique is adopted to solve argminπΠLinW^n(π,α)\mathop{\rm arg\,min}_{\pi\in\Pi_{\mathrm{Lin}}}\hat{W}_{n}(\pi,\alpha) for linear policy class ΠLin\Pi_{\mathrm{Lin}}. We omit the details here due to space limitations.

We will present an upper bound of the entropy integral κ(n)(ΠLin)\kappa^{(n)}(\Pi_{\mathrm{Lin}}) in Lemma 7. By plugging the result of Lemma 7 into Theorem 2, one can quickly remark that the regret RDRO(π^DRO)R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}}) achieves the optimal asymptotic convergence rate Op(1/n)O_{p}(1/\sqrt{n}) given by Theorem 2.

Lemma 7.

There exists a universal constant CC such that κ(n)(ΠLin)Cdplog(d)log(dp).\kappa^{(n)}(\Pi_{\rm{Lin}})\leq C\sqrt{dp\log(d)\log(dp)}.

The proof of Lemma 7 is achieved by upper bounding ϵ\epsilon-Hamming covering number NH(n)(ϵ,ΠLin)N_{H}^{\left(n\right)}\left(\epsilon,\Pi_{\mathrm{Lin}}\right) in terms of the graph dimension in Lemma 4, then by deploying an upper bound of graph dimension for the linear policy class provided in [21].

5.3 Experiment Results

In this section, we present two simple examples with an explicitly computable optimal linear DRO policy. We illustrate the behavior of distributionally robust policy learning in Section 5.3.1 and we demonstrate the effectiveness of the distributionally robust policy in Section 5.3.2.

5.3.1 A Linear Boundary Example

We consider 𝒳={x=(x(1),,x(p))𝐑p:i=1px(i)21}\mathcal{X}=\{x=(x(1),\ldots,x(p))\in\mathbf{R}^{p}:\sum_{i=1}^{p}x(i)^{2}\leq 1\} to be a pp-dimensional closed unit ball, and the action set 𝒜={1,,d}\mathcal{A}=\{1,\ldots,d\}. We assume that Y(i)Y(i)’s are mutually independent conditional on XX with conditional distribution

Y(i)|X𝒩(βiX,σi2), for i=1,,d.Y(i)|X\sim\mathcal{N}(\beta_{i}^{\top}X,\sigma_{i}^{2}),\text{ for }i=1,\ldots,d.

for vectors {β1,,βd}𝐑p\{\beta_{1},\ldots,\beta_{d}\}\subset\mathbf{R}^{p} and {σ12,,σd2}𝐑+\{\sigma_{1}^{2},\ldots,\sigma_{d}^{2}\}\subset\mathbf{R}_{+}. In this case, by directly computing the moment generating functions and applying Proposition 2, we have

π¯DRO(x)argmaxi{1,,d}{βixσi22α(πDRO)}.\overline{\pi}^{*}_{\mathrm{\mathrm{DRO}}}(x)\in\mathop{\rm arg\,max}_{i\in\{1,\ldots,d\}}\left\{\beta_{i}^{\top}x-\frac{\sigma^{2}_{i}}{2\alpha^{*}(\pi^{*}_{\mathrm{DRO}})}\right\}.

We consider the linear policy class ΠLin\Pi_{\mathrm{Lin}}. Apparently, the DRO Bayes policy π¯DRO(x)\overline{\pi}^{*}_{\mathrm{\mathrm{DRO}}}(x) is in the class ΠLin\Pi_{\mathrm{Lin}}, thus it is also the optimal linear DRO policy, i.e., π¯DROargmaxπΠLinQDRO(π)\overline{\pi}_{\mathrm{DRO}}^{*}\in\mathop{\rm arg\,max}_{\pi\in\Pi_{\mathrm{Lin}}}Q_{DRO}(\pi). Consequently, we can check the efficacy of the distributionally robust policy learning algorithm by comparing π^DRO\hat{\pi}_{\mathrm{DRO}} against π¯DRO\overline{\pi}_{\mathrm{DRO}}^{*}.

Now we describe the parameter in the experiment. We choose p=5p=5 and d=3d=3. To facilitate visualization of the decision boundary, we set all the entries of βi\beta_{i} to be 0 except for the first two dimensions. Specifically, we choose

β1=(1,0,0,0,0),\displaystyle\beta_{1}=(1,0,0,0,0), β2=(1/2,3/2,0,0,0),\displaystyle\beta_{2}=(-1/2,\sqrt{3}/2,0,0,0), β3=(1/2,3/2,0,0,0).\displaystyle\beta_{3}=(-1/2,-\sqrt{3}/2,0,0,0).

and σ1=0.2,σ2=0.5,σ3=0.8.\sigma_{1}=0.2,\sigma_{2}=0.5,\sigma_{3}=0.8. We define the Bayes policy π¯\overline{\pi}^{\ast} as the policy that maximizes 𝐄𝐏0[Y(π(X))]\mathbf{E}_{\mathbf{P}_{0}}[Y(\pi(X))] within the class of all measurable policies. Under this setting, π¯(x)argmaxi=1,2,3{βix}\overline{\pi}^{\ast}(x)\in\mathop{\rm arg\,max}_{i=1,2,3}\{\beta_{i}^{\top}x\}. The feature space 𝒳\mathcal{X} is partitioned into three regions based on π¯\overline{\pi}^{\ast}: for i=1,2,3i=1,2,3, we say x𝒳x\in\mathcal{X} belongs to Region ii if π¯(x)=i\overline{\pi}^{\ast}(x)=i. Given XX, the action AA is drawn according to the underlying data collection policy π0\pi_{0}, which is described in Table 1.

Region 1 Region 2 Region 3
Action 1 0.50 0.25 0.25
Action 2 0.25 0.50 0.25
Action 3 0.25 0.25 0.50
Table 1: The probabilities of selecting an action based on π0\pi_{0} in the linear example.

We generate {Xi,Ai,Yi}i=1n\{X_{i},A_{i},Y_{i}\}_{i=1}^{n} according to the procedure described above as training dataset, from which we learn the non-robust linear policy π^Lin\hat{\pi}_{\mathrm{Lin}} and the distributionally robust linear policy π^DRO\hat{\pi}_{\mathrm{DRO}}. Figure 2 presents the decision boundary of four different policies: (a) π¯\overline{\pi}^{\ast}; (b) π^Lin\hat{\pi}_{\mathrm{Lin}}; (c) π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}}; (d) π^DRO\hat{\pi}_{\mathrm{DRO}}, where n=5000n=5000 and δ=0.2\delta=0.2. One can quickly remark that the decision boundary of π^Lin\hat{\pi}_{\mathrm{Lin}} resembles π¯\overline{\pi}^{\ast}; and the decision boundary of π^DRO\hat{\pi}_{\mathrm{DRO}} resembles π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}}, which demonstrates that π^Lin\hat{\pi}_{\mathrm{Lin}} is the (nearly) optimal non-DRO policy and π^DRO\hat{\pi}_{\mathrm{DRO}} is the (nearly) optimal DRO policy.

This distinction between π¯\overline{\pi}^{\ast} and π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}} is also apparent in Figure 2: π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}} is less likely to choose Action 3, but more likely to choose Action 1. In other words, a distributionally robust policy prefers action with smaller variance. We remark that this finding is consistent with [25] and [26] as they find the DRO problem with KL-divergence is a good approximation to the variance-regularized quantity when δ0\delta\rightarrow 0.

Refer to caption
Figure 2: Comparison of decision boundaries for different policies in the linear example: (a) Bayes policy π¯\overline{\pi}^{\ast}; (b) linear policy π^Lin\hat{\pi}_{\mathrm{Lin}}; (c) Bayes distributionally robust policy π¯DRO\overline{\pi}_{\mathrm{DRO}}^{*}; (d) distributionally robust linear policy π^DRO\hat{\pi}_{\mathrm{DRO}}. We visualize the actions selected by different policies against the value of (X(1),X(2))(X(1),X(2)). Training set size n=5000n=5000; size of distributional uncertainty set δ=0.2\delta=0.2.

5.3.2 A Non-linear Boundary Example

In this section, we compare the performance of different estimators in a simulation environment where the Bayes decision boundaries are nonlinear.

We consider 𝒳=[1,1]5\mathcal{X}=[-1,1]^{5} to be a 55-dimensional cube, and the action set to be 𝒜={1,2,3}\mathcal{A}=\{1,2,3\}. We assume that Y(i)Y(i)’s are mutually independent conditional on XX with conditional distribution

Y(i)|X𝒩(μi(X),σi2), for i=1,2,3.Y(i)|X\sim\mathcal{N}(\mu_{i}(X),\sigma_{i}^{2}),\text{ for }i=1,2,3.

where μi:𝒳𝒜\mu_{i}:\mathcal{X}\rightarrow\mathcal{A} is a measurable function and σi𝐑+\sigma_{i}\in\mathbf{R}_{+} for i=1,2,3i=1,2,3. In this setting, we are still able to analytically compute the Bayes policy π¯(x)argmaxi=1,2,3{μi(x)}\overline{\pi}^{\ast}(x)\in\mathop{\rm arg\,max}_{i=1,2,3}\{\mu_{i}(x)\} and the DRO Bayes π¯DRO(x)argmaxi=1,2,3{μi(x)σi22α(πDRO)}\overline{\pi}^{\ast}_{\mathrm{DRO}}(x)\in\mathop{\rm arg\,max}_{i=1,2,3}\left\{\mu_{i}(x)-\frac{\sigma^{2}_{i}}{2\alpha^{*}(\pi^{*}_{\mathrm{DRO}})}\right\}.

In this section, the conditional mean μi(x)\mu_{i}(x) and conditional variance σi\sigma_{i} are chosen as

μ1(x)=0.2x(1),σ1=0.8,μ2(x)=1(x(1)+0.5)2+(x(2)1)2,σ2=0.2,μ3(x)=1(x(1)+0.5)2+(x(2)+1)2),σ3=0.4.\displaystyle\begin{array}[]{lr}\mu_{1}(x)=0.2x(1),&\hskip 108.405pt\sigma_{1}=0.8,\\ \mu_{2}(x)=1-\sqrt{(x(1)+0.5)^{2}+(x(2)-1)^{2}},&\hskip 108.405pt\sigma_{2}=0.2,\\ \mu_{3}(x)=1-\sqrt{(x(1)+0.5)^{2}+(x(2)+1)^{2}}),&\hskip 108.405pt\sigma_{3}=0.4.\\ &\end{array}

Given XX, the action AA is drawn according to the underlying data collection policy π0\pi_{0} described in Table 2.

Region 1 Region 2 Region 3
Action 1 0.50 0.25 0.25
Action 2 0.30 0.40 0.30
Action 3 0.30 0.30 0.40
Table 2: The probabilities of selecting an action based on π0\pi_{0} in nonlinear example.

Now we generate the training set {Xi,Ai,Yi}i=1n\{X_{i},A_{i},Y_{i}\}_{i=1}^{n} and learn the non-robust linear policy π^Lin\hat{\pi}_{\mathrm{Lin}} and distributionally robust linear policy π^DRO\hat{\pi}_{\mathrm{DRO}} in linear policy class ΠLin\Pi_{\mathrm{Lin}}, for n=5000n=5000 and δ=0.2\delta=0.2. Figure 3 presents the decision boundary of four different policies: (a) π¯\overline{\pi}^{\ast}; (b) π^Lin\hat{\pi}_{\mathrm{Lin}}; (c) π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}}; (d) π^DRO\hat{\pi}_{\mathrm{DRO}}. As π¯\overline{\pi}^{\ast} and π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}} have nonlinear decision boundaries, any linear policy is incapable of accurate recovery of Bayes policy. However, we quickly notice that the boundary produced by π^Lin\hat{\pi}_{\mathrm{Lin}} and π^DRO\hat{\pi}_{\mathrm{DRO}} are reasonable linear approximation of π¯\overline{\pi}^{\ast} and π¯DRO\overline{\pi}^{\ast}_{\mathrm{DRO}}, respectively. Especially noteworthy is the robust policy prefers action with small variance (Action 2), which is consistent with our finding in Section 5.2.

Refer to caption
Figure 3: Comparison of decision boundaries for different policies in nonlinear example: (a) optimal policy under population distribution 𝐏0\mathbf{P}_{0}; (b) optimal linear policy π^Lin\hat{\pi}_{\mathrm{Lin}} learned from data; (c) Bayes distributionally robust policy π¯DRO\overline{\pi}_{\mathrm{DRO}}^{*}; (d) distributionally robust linear policy π^DRO\hat{\pi}_{\mathrm{DRO}}. We visualize the actions selected by different policies against the value of (X(1),X(2))(X(1),X(2)). Training size is 5000; size of distributional uncertainty set δ=0.2\delta=0.2.

Now we introduce two evaluation metrics in order to quantitatively characterize the adversarial performance for different policies.

  1. 1.

    We generate a test set with n=2500n^{\prime}=2500 i.i.d. data points sampled from 𝐏0\mathbf{P}_{0} and evaluate the worst case performance of each policy using Q^DRO\hat{Q}_{\mathrm{DRO}} with a radius δtest\delta^{\mathrm{test}}. Note that δtest\delta^{\mathrm{test}} may be different from δ\delta in the training procedure. The results are reported in the first row of Tables 3 and 4.

  2. 2.

    We first generate M=100M=100 independent test sets, where each test set consists of n=2500n^{\prime}=2500 i.i.d. data points sampled from 𝐏0\mathbf{P}_{0}. We denote them by {{(Xi(j),Yi(j)(a1),,Yi(j)(ad))}i=1n}j=1M\left\{\left\{\left(X^{(j)}_{i},Y^{(j)}_{i}(a^{1}),\dots,Y^{(j)}_{i}(a^{d})\right)\right\}_{i=1}^{n^{\prime}}\right\}_{j=1}^{M}. Then, we randomly sample a new dataset around each dataset, i.e., (X~i(j),Y~i(j)(a1),,Y~i(j)(ad))\left(\tilde{X}^{(j)}_{i},\tilde{Y}^{(j)}_{i}(a^{1}),\dots,\tilde{Y}^{(j)}_{i}(a^{d})\right) is sampled on the KL-sphere centered at (Xi(j),Yi(j)(a1),,Yi(j)(ad))\left(X^{(j)}_{i},Y^{(j)}_{i}(a^{1}),\dots,Y^{(j)}_{i}(a^{d})\right) with a radius δtest\delta^{\mathrm{test}}. Then, we evaluate each policy using Q^min\hat{Q}_{\mathrm{min}}, defined by

    Q^min(π)min1jM{1ni=1nY~i(j)(π(X~i(j)))}.\hat{Q}_{\mathrm{min}}(\pi)\triangleq\min_{1\leq j\leq M}\left\{\frac{1}{n^{\prime}}\sum_{i=1}^{n^{\prime}}\tilde{Y}^{(j)}_{i}\left(\pi\left({\tilde{X}_{i}^{(j)}}\right)\right)\right\}.

    The results are reported in the second row in Tables 3 and 4.

We compare the robust performance of π^Lin\hat{\pi}_{\mathrm{Lin}} and π^DRO\hat{\pi}_{\mathrm{DRO}} and the POEM policy π^POEM\hat{\pi}_{\mathrm{POEM}} introduced in [76]. The regularization parameter of the POEM estimator is chosen from {0.05,0.1,0.2,0.5,1}\{0.05,0.1,0.2,0.5,1\}, and we find the results are insensitive to the regularization parameter. We fix the uncertainty radius δ=0.2\delta=0.2 used in the training procedure and size of test set n=2500n^{\prime}=2500. In Table 3, we let the training set size range from 500500 to 25002500, and we fix δtest=δ=0.2\delta^{\mathrm{test}}=\delta=0.2, while in Table 4, we fix the training set size to be n=2500n=2500, and we let the magnitude of “environment change” δtest\delta^{\mathrm{test}} range from 0.020.02 to 0.40.4. We denote π^DRO0.2\hat{\pi}_{\mathrm{DRO}}^{0.2} to be the DRO policy with δ=0.2\delta=0.2. Tables 3 and 4 report the mean and the standard error of the mean of Q^DRO\hat{Q}_{\mathrm{DRO}} and Q^min\hat{Q}_{\mathrm{min}} computed using T=1000T=1000 i.i.d. experiments, where an independent training set and an independent test set are generated in each experiment. Figure 4 visualizes the relative differences between π^Lin/π^POEM\hat{\pi}_{\mathrm{Lin}}/\hat{\pi}_{\mathrm{POEM}} and π^DRO0.2\hat{\pi}^{0.2}_{\mathrm{DRO}} in distributional shift environments.

n=500n=500 n=1000n=1000 n=1500n=1500 n=2000n=2000 n=2500n=2500
Q^DRO\hat{Q}_{\mathrm{DRO}} π^Lin\hat{\pi}_{\mathrm{Lin}} 0.0852±0.00130.0852\pm 0.0013 0.1031±0.00080.1031\pm 0.0008 0.1093±0.00050.1093\pm 0.0005 0.1120±0.00050.1120\pm 0.0005 0.1135±0.00040.1135\pm 0.0004
π^POEM\hat{\pi}_{\mathrm{POEM}} 0.0621±0.00140.0621\pm 0.0014 0.0858±0.00090.0858\pm 0.0009 0.0972±0.00070.0972\pm 0.0007 0.1013±0.00060.1013\pm 0.0006 0.1057±0.00050.1057\pm 0.0005
π^DRO0.2\hat{\pi}^{0.2}_{\mathrm{DRO}} 0.0998±0.00110.0998\pm 0.0011 0.1120±0.00070.1120\pm 0.0007 0.1152±0.00050.1152\pm 0.0005 0.1166±0.00040.1166\pm 0.0004 0.1170±0.00040.1170\pm 0.0004
Q^min\hat{Q}_{\mathrm{min}} π^Lin\hat{\pi}_{\mathrm{Lin}} 0.2183±0.00110.2183\pm 0.0011 0.2347±0.00070.2347\pm 0.0007 0.2398±0.00050.2398\pm 0.0005 0.2426±0.00050.2426\pm 0.0005 0.2437±0.00050.2437\pm 0.0005
π^POEM\hat{\pi}_{\mathrm{POEM}} 0.2030±0.00110.2030\pm 0.0011 0.2230±0.00070.2230\pm 0.0007 0.2311±0.00060.2311\pm 0.0006 0.2344±0.00060.2344\pm 0.0006 0.2378±0.00050.2378\pm 0.0005
π^DRO0.2\hat{\pi}^{0.2}_{\mathrm{DRO}} 0.2249±0.00090.2249\pm 0.0009 0.2384±0.00060.2384\pm 0.0006 0.2428±0.00050.2428\pm 0.0005 0.2439±0.00050.2439\pm 0.0005 0.2460±0.00050.2460\pm 0.0005
Table 3: Comparison of robust performance for different training sizes nn when δ=δtest=0.2\delta=\delta^{\mathrm{test}}=0.2.
δtest=0.02\delta^{\rm test}=0.02 δtest=0.06\delta^{\rm test}=0.06 δtest=0.10\delta^{\rm test}=0.10 δtest=0.20\delta^{\rm test}=0.20 δtest=0.30\delta^{\rm test}=0.30 δtest=0.40\delta^{\rm test}=0.40
Q^DRO\hat{Q}_{\mathrm{DRO}} π^Lin\hat{\pi}_{\mathrm{Lin}} 0.2141±0.00030.2141\pm 0.0003 0.1783±0.00030.1783\pm 0.0003 0.1546±0.00040.1546\pm 0.0004 0.1132±0.00040.1132\pm 0.0004 0.0840±0.00050.0840\pm 0.0005 0.0601±0.00050.0601\pm 0.0005
π^POEM\hat{\pi}_{\mathrm{POEM}} 0.2097±0.00030.2097\pm 0.0003 0.1734±0.00040.1734\pm 0.0004 0.1497±0.00040.1497\pm 0.0004 0.1087±0.00050.1087\pm 0.0005 0.0787±0.00050.0787\pm 0.0005 0.0543±0.00050.0543\pm 0.0005
π^DRO0.2\hat{\pi}^{0.2}_{\mathrm{DRO}} 0.2164±0.00030.2164\pm 0.0003 0.1805±0.00030.1805\pm 0.0003 0.1574±0.00030.1574\pm 0.0003 0.1170±0.00040.1170\pm 0.0004 0.0882±0.00040.0882\pm 0.0004 0.0646±0.00050.0646\pm 0.0005
Q^min\hat{Q}_{\mathrm{min}} π^Lin\hat{\pi}_{\mathrm{Lin}} 0.2602±0.00050.2602\pm 0.0005 0.2545±0.00050.2545\pm 0.0005 0.2516±0.00050.2516\pm 0.0005 0.2443±0.00050.2443\pm 0.0005 0.2378±0.00050.2378\pm 0.0005 0.2305±0.00050.2305\pm 0.0005
π^POEM\hat{\pi}_{\mathrm{POEM}} 0.2556±0.00050.2556\pm 0.0005 0.2511±0.00050.2511\pm 0.0005 0.2472±0.00050.2472\pm 0.0005 0.2400±0.00050.2400\pm 0.0005 0.2334±0.00050.2334\pm 0.0005 0.2263±0.00050.2263\pm 0.0005
π^DRO0.2\hat{\pi}^{0.2}_{\mathrm{DRO}} 0.2613±0.00040.2613\pm 0.0004 0.2563±0.00040.2563\pm 0.0004 0.2532±0.00040.2532\pm 0.0004 0.2461±0.00050.2461\pm 0.0005 0.2397±0.00050.2397\pm 0.0005 0.2329±0.00050.2329\pm 0.0005
Table 4: Comparison of robust performance for different test environments δtest\delta_{\mathrm{test}} when δ=0.2\delta=0.2 and n=2500n=2500.
Refer to caption
(a) Q^DRO\hat{Q}_{\mathrm{DRO}}
Refer to caption
(b) Q^min\hat{Q}_{\mathrm{min}}
Figure 4: Difference of robust performance for different test environments δtest\delta_{\mathrm{test}} when δ=0.2\delta=0.2 and n=2500n=2500.

We can easily observe from Table 3 that π^DRO0.2\hat{\pi}_{\mathrm{DRO}}^{0.2} achieves the best robust performance among all three policies and the superiority is significant in the most of cases, which implies π^DRO\hat{\pi}_{\mathrm{DRO}} is more resilient to adversarial perturbations. We also highlight that π^DRO\hat{\pi}_{\mathrm{DRO}} has smaller standard deviation (T×\sqrt{T}\timesstandard error) in Table 3 and the superiority of π^DRO\hat{\pi}_{\mathrm{DRO}} is more manifest under smaller training set, indicating π^DRO\hat{\pi}_{\mathrm{DRO}} is a more stable estimator compared with π^Lin\hat{\pi}_{\mathrm{Lin}} and π^POEM\hat{\pi}_{\mathrm{POEM}}. In Table 4 and Figure 4, we find that π^DRO\hat{\pi}_{\mathrm{DRO}} significantly outperforms π^Lin\hat{\pi}_{\mathrm{Lin}} and π^POEM\hat{\pi}_{\mathrm{POEM}} for a wide range of δtest\delta^{\mathrm{test}} even if the model is misspecified in the sense that δtestδ\delta^{\mathrm{test}}\neq\delta, and the results of small δtest\delta^{\mathrm{test}} indicate that our method may potentially alleviate overfitting. These results show that our method is insensitive to the choice of the uncertainty radius δ\delta in the training procedure.

6 Real Data Experiments: Application on a Voting Dataset

In this section, we compare the empirical performance of different estimators on a voting dataset concerned with the August 2006 primary election.444Data available in https://github.com/gsbDBI/ExperimentData/tree/master/Social This dataset was originally collected by [35] to study the effect of social pressure on electoral participation rates. Later, the dataset was employed by [91] to study the empirical performance of several offline policy learning algorithms. In this section, we apply different policy learning algorithms to this dataset and illustrate some interesting findings.

6.1 Dataset Description

For completeness, we borrow the description of the dataset from [91] since we use almost the same (despite different reward) data preprocessing procedure. We only focus on aspects that are relevant to our current policy learning context.

The dataset contains 180002180002 data points (i.e. n=180002n=180002), each corresponding to a single voter in a different household. The voters span the entire state of Michigan. There are ten voter characteristics in the dataset: year of birth, sex, household size, city, g2000, g2002, g2004, g2000, p2002, and p2004. The first four features are self-explanatory. The next three features are outcomes for whether a voter voted in the general elections in 2000, 2002 and 2004 respectively: 11 was recorded if the voter did vote and 0 was recorded if the voter did not vote. The last three features are outcomes for whether a voter voted in the primary in 2000, 2002 and 2004. As [35] pointed out, these 10 features are commonly used as covariates for predicting whether an individual voter will vote.

There are five actions in total, as listed below:

Nothing: No action is performed.

Civic: A letter with ”Do your civic duty” is mailed to the household before the primary election.

Monitored: A letter with ”You are being studied” is mailed to the household before the primary election. Voters receiving this letter are informed that whether they vote or not in this election will be observed.

Self History: A letter with the voter’s past voting records as well as the voting records of other voters who live in the same household is mailed to the household before the primary election. The letter also indicates that, once the election is over, a follow-up letter on whether the voter voted will be sent to the household.

Neighbors: A letter with the voting records of this voter, the voters living in the same household, and the voters who are neighbors of this household is mailed to the household before the primary election. The letter also indicates that all your neighbors will be able to see your past voting records and that follow-up letters will be sent so that whether this voter voted in the upcoming election will become public knowledge among the neighbors.

In collecting this dataset, these five actions are randomly chosen independent of everything else, with probabilities equal to 1018,218,218,218,218\frac{10}{18},\frac{2}{18},\frac{2}{18},\frac{2}{18},\frac{2}{18} (in the same order as listed above). The outcome is whether a voter has voted in the 2006 primary election, which is either 1 or 0. It is not hard to imagine that Neighbors is the best policy for the whole population as it adds the highest social pressure for people to vote. Therefore, instead of directly using the voting outcome as a reward, we define YiY_{i}, the reward associated to voter ii, as the voting outcome minus the social cost of deploying an action to this voter, namely,

Yi(a)=𝟏{voter i votes under action a}ca,a𝒜,Y_{i}(a)=\mathbf{1}\{\mbox{voter $i$ votes under action $a$}\}-c_{a},\quad\forall a\in\mathcal{A},

where cac_{a} is the vector of cost for deploying certain actions. Here, we set ca=(0.3,0.32,0.34,0.36,0.38)c_{a}=(0.3,0.32,0.34,0.36,0.38) to be close to the empirical average of each action.

6.2 Decision Trees and Greedy Tree Search

We introduce the decision-tree policy classes. We follow the convention in [9]. A depth-LL tree has LL layers in total: branch nodes live in the first L1L-1 layers, while the leaf nodes live in the last layer. Each branch node is specified by the variable to be split on and the threshold bb. At a branch node, each component of the pp-dimensional feature vector xx can be chosen as a split variable. The set of all depth-LL trees is denoted by ΠL\Pi_{L}. Then, Lemma 4 in [91] shows that

κ(n)(ΠL)(2L1)logp+2Llogd+43L1/42L1.\kappa^{(n)}(\Pi_{L})\leq\sqrt{(2^{L}-1)\log p+2^{L}\log d}+\frac{4}{3}L^{1/4}\sqrt{2^{L}-1}.

In the voting dataset experiment, we concentrate on the policy class ΠL\Pi_{L}.

The algorithm for decision tree learning needs to be computationally efficient, since algorithm will be iteratively executed in Line 7 of Algorithm 2 to compute argminπΠLW^n(π,α)\mathop{\rm arg\,min}_{\pi\in\Pi_{L}}\hat{W}_{n}(\pi,\alpha). Since finding an optimal classification tree is generally intractable, see [9], here we adopt an heuristic algorithm called greedy tree search. This procedure can be inductively defined. First, to learn a depth-2 tree, greedy tree search will brute force search all the possible spliting choices of the branch node, and all the possible actions of the leaf nodes. Suppose that the learning procedure for depth-(L1)(L-1) tree has been defined. To learn a depth-LL tree, we first learn a depth-2 tree with the optimal branching node, which partitions all the training data into two disjointed groups associated with two leaf nodes. Then each leaf node is replaced by the depth-(L1)(L-1) tree trained using the data in the associated group.

6.3 Training and Evaluation Procedure

Consider a hypothetical experiment of designing distributionally robust policy. In the experiment, suppose that the training data is collected from some cities, and our goal is to learn a robust policy to be deployed to the other cities. Dividing the training and test population based on the city voters live creates both a covariate shift and a concept drift between training set and test set. For example, considering the covariate shift first, the distribution of year of birth is generally different across different cities. As for the concept drift, it is conceivable that different groups of population may have different response to the same action, depending on some latent factors that are not reported in the dataset, such as occupation and education. The distribution of such latent factors also varies among different cities, which results in concept drift. Consequently, we use the feature city to divide the training set and the test set, in order to test policy performance under “environmental change”.

The voting data set contains 101 distinct cities. To comprehensively evaluate the out-of-sample policy performance, we adapt leave-one-out cross-validation to generate 101 pairs of the training and test set, each test set contains exactly one district city and the corresponding training set is the complement set of the test set. On each pair of the training and test set, we learn a non-robust depth-33 decision tree policy π^3\hat{\pi}_{3} and distributionally robust decision tree policies π^DRO\hat{\pi}_{\mathrm{DRO}} in Π3\Pi_{3} for δ{0.1,0.2,0.3,0.4}\delta\in\{0.1,0.2,0.3,0.4\}, then on the test set the policies are evaluated using the unbiased IPW estimator

Q^IPW(π)1ni=1n𝟏{π(Xi)=Ai}π0(AiXi)Yi(Ai).\hat{Q}_{\mathrm{IPW}}(\pi)\triangleq\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}Y_{i}(A_{i}).

Consequently, for each policy π\pi we get 101101 of Q^IPW(π)\hat{Q}_{\mathrm{IPW}}(\pi) scores on 101101 different test sets.

6.4 Selection of Distributional Shift Size δ\delta

The distributional shift size δ\delta quantifies the level of robustness of the distributionally robust policy learning algorithm. The empirical performance of the algorithm substantially depends on the selection of δ\delta. On one hand, if δ\delta is too small, the robustification effect is negligible and the algorithm would learn an over-aggressive policy; on the other hand, if δ\delta is overly large, the policy is over-conservative, always choosing the action subject to the smallest reward variation. We remark that the selection of δ\delta is more a managerial decision rather than a scientific procedure. It depends to the decision-makers’ own risk-aversion level and their own perception of the new environments. In this section, we provide a guide to help select δ\delta in this voting dataset.

A natural approach to select δ\delta is to empirically estimate the size of distributional shift using the training data. From the training set, we partition the data in 20% of cities as our validation set with distribution denoted by 𝐏20\mathbf{P}^{20}, and we use 𝐏80\mathbf{P}^{80} to denote the distribution of the remaining 80% of the training set. We estimate the D(𝐏20||𝐏80)D(\mathbf{P}^{20}||\mathbf{P}^{80}), which reasonably quantifies the size of distributional shift across different cities. To this end, we decompose distributional shift into two parts,

D(𝐏20||𝐏80)=D(𝐏X20||𝐏X80)marginal distribution of X+𝐄𝐏X20[D(𝐏Y20|X||𝐏Y80|X)]conditional distribution of Y given X,D(\mathbf{P}^{20}||\mathbf{P}^{80})=\underbrace{D(\mathbf{P}_{X}^{20}||\mathbf{P}_{X}^{80})}_{\mbox{marginal distribution of $X$}}+\underbrace{\mathbf{E}_{\mathbf{P}_{X}^{20}}[D(\mathbf{P}_{Y}^{20}|X||\mathbf{P}_{Y}^{80}|X)]}_{\mbox{conditional distribution of $Y$ given $X$}},

where 𝐏Xi\mathbf{P}_{X}^{i} denote the XX-marginal distribution of 𝐏i\mathbf{P}^{i}, and 𝐏Yi|X\mathbf{P}_{Y}^{i}|X denote the conditional distribution of YY given XX for 𝐏i\mathbf{P}^{i}, for i=20,80i=20,80. To estimate the size of marginal distributional shift D(𝐏X20||𝐏X80)D(\mathbf{P}_{X}^{20}||\mathbf{P}_{X}^{80}), we first apply grouping to features such as year of birth, in order to avoid of infinite KL-divergence. Next we focus on the conditional distributional shift D(𝐏Y20|X||𝐏Y80|X)D(\mathbf{P}_{Y}^{20}|X||\mathbf{P}_{Y}^{80}|X). Noticing that the value of Y(a)Y(a) is binary for each aa, we fit two logistic regression models separately for 𝐏20\mathbf{P}^{20} and 𝐏80\mathbf{P}^{80} to estimate the conditional distribution of Y(a)Y(a) given XX. We estimate D(𝐏Y20|X||𝐏Y80|X)D(\mathbf{P}_{Y}^{20}|X||\mathbf{P}_{Y}^{80}|X) using the fitted logistic regression model, then take the expectation of XX over 𝐏X20\mathbf{P}_{X}^{20}. We repeat the 80%/20% random splitting 100 times, and compute D(𝐏20||𝐏80)D(\mathbf{P}^{20}||\mathbf{P}^{80}) using this procedure. Additional experimental details are reported in Appendix B.2. The empirical CDF of the estimated δ\delta from those 100100 experiments is reported in Figure 5(a). It is easy to see approximately 90% percent of δ\deltas are less than 0.20.2.

Refer to caption
(a) Empirical CDF for estimated δ\delta.
Refer to caption
(b) Sensitivity of policy on δ\delta.
Figure 5: Selection of distributional shift size δ\delta.

Beside explicitly estimating the uncertainty size, we also check the sensitivity of our policy on δ\delta. We present the reward profile for our distributionally robust policy in Figure 5(b). In the figure, we employ the x-axis to represent the δ\delta used in the policy training process. The top black line is the non-robust pure value function, which appears to be almost invariant among policies with different robust level for δ[0,0.4]\delta\in[0,0.4]. The blue line is the distributionally robust value function with δ\delta fixed to 0.20.2. We remark that the non-robust policy π^DRO0\hat{\pi}_{\mathrm{DRO}}^{0} has a deficient performance in terms of robust value function, yet the robust value function improves as the robust level of the policy is increasing and becomes non-sensitive to the delta when δ\delta larger than 0.20.2. Finally, this red line is the distributionally robust value function with respect to the same δ\delta in the training process. Thus, it is the actual training reward.

6.5 Experimental Result and Interpretation

We summarize some important statistics of Q^IPW(π)\hat{Q}_{\mathrm{IPW}}(\pi) scores in Table 5, including mean, standard deviation, minimal value, 5th percentile, 10th percentile, and 20th percentile. All the statistics are calculated based on the result of 101 test sets.

mean std min 5th percentile 10th percentile 20th percentile
Q^IPW(π^3)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{3}) 0.0386 0.0991 -0.2844 -0.1104 -0.0686 -0.0358
Q^IPW(π^DRO)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{\mathrm{DRO}}) δ=0.1\delta=0.1 0.0458 0.0989 -0.2321 -0.1007 -0.0489 -0.0223
δ=0.2\delta=0.2 0.0368 0.0895 -0.2314 -0.0785 -0.0518 -0.0217
δ=0.3\delta=0.3 0.0397 0.0864 -0.2313 -0.0677 -0.0407 -0.0190
δ=0.4\delta=0.4 0.0383 0.0863 -0.2312 -0.0677 -0.0429 -0.0202
Table 5: Comparison of important statistics for voting dataset.

We remark that the mean value of Q^IPW(π^DRO)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{\mathrm{DRO}}) is comparable to Q^IPW(π^3)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{3}), and it is even better when an appropriate value of δ\delta (such as δ=0.1\delta=0.1) is selected. One can also observe that Q^IPW(π^DRO)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{\mathrm{DRO}}) has a smaller standard deviation and a larger minimal value when comparing to Q^IPW(π^3)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{3}), and the difference becomes larger as δ\delta increases. The comparison of 5th, 10th, and 20th percentiles also indicates that π^DRO\hat{\pi}_{\mathrm{DRO}} perform better than π^3\hat{\pi}_{3} in “bad” (or “adversarial”) scenarios of environmental change, which is exactly the desired behavior of π^DRO\hat{\pi}_{\mathrm{DRO}} by design.

To reinforce our observation in Table 5, we visualize and compare the distribution of Q^IPW(π^DRO)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{\mathrm{DRO}}) and Q^IPW(π^3)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{3}) in Figure 6, for (a) δ=0.1\delta=0.1 and (b) δ=0.4\delta=0.4. We notice that the histogram of Q^IPW(π^DRO)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{\mathrm{DRO}}) is more concentrated than the histogram of Q^IPW(π^3)\hat{Q}_{\mathrm{IPW}}(\hat{\pi}_{3}), which supports our observation that π^DRO\hat{\pi}_{\mathrm{DRO}} is more robust.

Refer to caption
(a) δ=0.1\delta=0.1
Refer to caption
(b) δ=0.4\delta=0.4
Figure 6: Comparison of the distribution of Q^IPW\hat{Q}_{\mathrm{IPW}} between distributionally robust decision tree against non-robust decision tree. (a) δ=0.1\delta=0.1, (b) δ=0.4\delta=0.4.

We present two instances of distributionally robust decision trees in Figure 7: (a) is an instance of robust tree with δ=0.1\delta=0.1, and (b) is an instance of robust tree with δ=0.4\delta=0.4. We remark that the decision tree in (b) deploys the action Nothing to most of the potential voters, because almost all the individuals in the dataset were born after 1917 and have a household size fewer than 6. For a large value of δ\delta, the distributionally robust policy π^DRO\hat{\pi}_{\mathrm{DRO}} becomes almost degenerate, which only selects Nothing, the action with a minimal reward variation.

Refer to caption
(a) δ=0.1\delta=0.1
Refer to caption
(b) δ=0.4\delta=0.4
Figure 7: Examples of distributionally robust decision trees when (a) δ=0.1\delta=0.1, (b) δ=0.4\delta=0.4.

7 Extension to ff-divergence Uncertainty Set

In this section, we generalize the KL-divergence to ff-divergence. Here, we define ff-divergence between 𝐏\mathbf{P} and 𝐏0\mathbf{P}_{0} as

Df(𝐏||𝐏0)f(d𝐏d𝐏0)d𝐏0,D_{f}(\mathbf{P}||\mathbf{P}_{0})\triangleq\int f\left(\frac{d\mathbf{P}}{d\mathbf{P}_{0}}\right)d\mathbf{P}_{0},

where f:𝐑𝐑+{+}f:\mathbf{R}\rightarrow\mathbf{R}_{+}\cup\{+\infty\} is a convex function satisfying f(1)=0f(1)=0 and f(t)=+f(t)=+\infty for any t<0t<0. Then, we define the ff-divergence uncertainty set as 𝒰𝐏0f(δ){𝐏𝐏0Df(𝐏||𝐏0)δ}.\mathcal{U}^{f}_{\mathbf{P}_{0}}(\delta)\triangleq\{\mathbf{P}\ll\mathbf{P}_{0}\mid D_{f}(\mathbf{P}||\mathbf{P}_{0})\leq\delta\}. Accordingly, the distributionally robust value function is defined below.

Definition 6.

For a given δ>0\delta>0, the distributionally robust value function QDROf:Π𝐑Q^{f}_{\mathrm{\rm DRO}}:\Pi\rightarrow\mathbf{R} is defined as: QDROf(π)inf𝐏𝒰𝐏0f(δ)𝐄𝐏[Y(π(X))]Q^{f}_{\mathrm{\rm DRO}}(\pi)\triangleq\inf_{\mathbf{P}\in\mathcal{U}^{f}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi(X))].

We focus on Cressie-Read family of ff-divergence, defined in [19]. For k(1,),k\in(1,\infty), function fkf_{k} is defined as

fk(t)tkkt+k1k(k1).f_{k}(t)\triangleq\frac{t^{k}-kt+k-1}{k(k-1)}.

As k1,k\rightarrow 1, fkf1(t)=tlogtt+1,f_{k}\rightarrow f_{1}(t)=t\log t-t+1,which becomes KL-divergence. For the ease of notation, we use QDROk()Q_{\mathrm{\mathrm{DRO}}}^{k}\left(\cdot\right), 𝒰𝐏0k(δ)\mathcal{U}_{\mathbf{P}_{0}}^{k}(\delta), and Dk(||)D_{k}\left(\cdot||\cdot\right) as shorthands of QDROf()Q_{\mathrm{\mathrm{DRO}}}^{f}\left(\cdot\right), 𝒰𝐏0f(δ)\mathcal{U}_{\mathbf{P}_{0}}^{f}(\delta), and Df(||),D_{f}\left(\cdot||\cdot\right), respectively, for k[1,).k\in[1,\infty). We further define kk/(k1),k_{\ast}\triangleq k/(k-1), and ck(δ)(1+k(k1)δ)1/k.c_{k}(\delta)\triangleq(1+k(k-1)\delta)^{1/k}. Then, [24] give the following duality results.

Lemma 8.

For any Borel measure 𝐏\mathbf{P} supported on the space 𝒳×j=1d𝒴j\mathcal{X}\times\prod_{j=1}^{d}\mathcal{Y}_{j} and k(1,),k\in(1,\infty), we have

inf𝐐𝒰𝐏k(δ)𝐄𝐐[Y(π(X))]=supα𝐑{ck(δ)𝐄𝐏[(Y(π(X))+α)+k]1k+α}.\inf_{\mathbf{Q}\in\mathcal{U}^{k}_{\mathbf{P}}(\delta)}\mathbf{E}_{\mathbf{Q}}[Y(\pi(X))]=\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right){\mathbf{E}}_{\mathbf{P}}\left[\left(-Y(\pi(X))+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}.

We then generalize Lemma 5 to Lemma 9 and Lemma 6 to Lemma 10 for fkf_{k}-divergence uncertainty set.

Lemma 9.

For any probability measures 𝐏1,𝐏2\mathbf{P}_{1},\mathbf{P}_{2} supported on 𝐑\mathbf{R} and k[1,+)k\in[1,+\infty), we have

|supα𝐑{ck(δ)𝐄𝐏1[(Y+α)+k]1k+α}supα𝐑{ck(δ)𝐄𝐏2[(Y+α)+k]1k+α}|\displaystyle\left|\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{1}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}-\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{2}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}\right|
\displaystyle\leq ck(δ)supt[0,1]|q𝐏1(t)q𝐏2(t)|.\displaystyle c_{k}(\delta)\sup_{t\in[0,1]}\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right|.
Lemma 10.

Suppose 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2} are supported on 𝔻\mathbb{D} and satisfy Assumption 1.3. We further assume 𝐏2\mathbf{P}_{2} satisfies Assumption 2.2. When TV(𝐏1,𝐏2)<b¯/2,\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2})<\underline{b}/2, we have for k>1k>1

|supα𝐑{ck(δ)𝐄𝐏1[(Y+α)+k]1k+α}supα𝐑{ck(δ)𝐄𝐏2[(Y+α)+k]1k+α}|\displaystyle\left|\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{1}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}-\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{2}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}\right|
\displaystyle\leq 2ck(δ)Mb¯k(b¯/2)1/kTV(𝐏1,𝐏2).\displaystyle\frac{2c_{k}\left(\delta\right)M}{\underline{b}k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}).

where TV\mathrm{TV} denotes the total variation distance.

The proofs of Lemma 9 and 10 and are in Appendix A.7. As an analog of Definitions 3, 4 and Equation (4), we further make the following definition.

Definition 7.
  1. 1.

    The distributionally robust value estimator Q^DROk:Π𝐑\hat{Q}^{k}_{\mathrm{\rm DRO}}:\Pi\rightarrow\mathbf{R} is defined by

    Q^DROk(π)supα𝐑{ck(δ)1nSnπi=1n𝟏{π(Xi)=Ai}π0(AiXi)((Y(π(X))+α)+k)1k+α}.\hat{Q}^{k}_{\mathrm{\rm DRO}}(\pi)\triangleq\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}\left(\left(-Y(\pi(X))+\alpha\right)_{+}^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}+\alpha\right\}.
  2. 2.

    The distributionally robust regret RDROk(π)R^{k}_{\mathrm{\rm DRO}}(\pi) of a policy πΠ\pi\in\Pi is defined as

    RDROk(π)maxπΠinf𝐏𝒰𝐏0k(δ)𝐄𝐏[Y(π(X))]inf𝐏𝒰𝐏0k(δ)𝐄𝐏[Y(π(X))].R^{k}_{\mathrm{\rm DRO}}(\pi)\triangleq\max_{\pi^{\prime}\in\Pi}\inf_{\mathbf{P}\in\mathcal{U}^{k}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi^{\prime}(X))]-\inf_{\mathbf{P}\in\mathcal{U}^{k}_{\mathbf{P}_{0}}(\delta)}\mathbf{E}_{\mathbf{P}}[Y(\pi(X))].
  3. 3.

    The optimal policy π^DROk\hat{\pi}^{k}_{\rm DRO} which maximizes the value of Q^DROk\hat{Q}^{k}_{\rm DRO} is defined as

    π^DROkargmaxπΠQ^DROk(π).\hat{\pi}^{k}_{\rm DRO}\triangleq\arg\max_{\pi\in\Pi}\hat{Q}^{k}_{\rm DRO}(\pi).

By applying Lemmas 9 and 10 and executing the same lines as the proof of Theorem 2, we have an analogous theorem below.

Theorem 4.

Suppose Assumption 1 is enforced and k>1k>1. With probability at least 1ε1-\varepsilon, under Assumption 2.1, we have

RDROk(π^DROk)\displaystyle R^{k}_{\rm DRO}(\hat{\pi}^{k}_{\rm DRO}) \displaystyle\leq 4ck(δ)b¯ηn(24(2+1)κ(n)(Π)+2log(2ε)+C),\displaystyle\frac{4c_{k}(\delta)}{\underline{b}\eta\sqrt{n}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}+C\right),

where CC is a universal constant; and under Assumption 2.2, when

n{4b¯η(24(2+1)κ(n)(Π)+48|𝔻|log(2)+2log(2ε))}2,n\geq\left\{\frac{4}{\underline{b}\eta}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}\right)\right\}^{2},

we have

RDRO(π^DRO)4ck(δ)Mkb¯ηn(b¯/2)1/k(24(2+1)κ(n)(Π)+48|𝔻|log(2)+2log(2ε)).R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}})\leq\frac{4c_{k}\left(\delta\right)M}{k^{\ast}\underline{b}\eta\sqrt{n}}\left(\underline{b}/2\right)^{1/k_{\ast}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}+\sqrt{2\log\left(\frac{2}{\varepsilon}\right)}\right).

We emphasize on the importance of Assumption 2. Without Assumption 2, [24] show a minimax rate (in the supervise learning setting)

RDROk(π^DROk)=Op(n1k2logn),R^{k}_{\mathrm{\mathrm{DRO}}}(\hat{\pi}^{k}_{\mathrm{DRO}})=O_{p}\left(n^{-\frac{1}{k_{*}\vee 2}}\log n\right), (8)

which is much slower than our results under a natural assumption (Assumption 2).

8 Conclusion

We have provided a distributionally robust formulation for policy evaluation and policy learning in batch contextual bandits. Our results focus on providing finite-sample learning guarantees. Especially interesting is that such learning is enabled by a dual optimization formulation.

A natural subsequent direction would be to extend the algorithm and results to the Wasserstein distance case for batch contextual bandits, which cannot be classified as a special case in our ff-divergence framework. We remark that the extension is non-trivial. Given a lower semicontinuous function cc, recall that the Wasserstein distance between two measures, 𝐏\mathbf{P} and 𝐐\mathbf{Q}, is defined as DWc(𝐏,𝐐)=minπΠ(𝐏,𝐐)Eπ{c(X,Y)},D_{W_{c}}(\mathbf{P},\mathbf{Q})=\min_{\pi\in\Pi(\mathbf{P},\mathbf{Q})}E_{\pi}\left\{c(X,Y)\right\}, where Π(𝐏,𝐐)\Pi(\mathbf{P},\mathbf{Q}) denotes the set of all joint distributions of the random vector (X,Y)(X,Y) with marginal distributions 𝐏\mathbf{P} and 𝐐\mathbf{Q}, respectively. The key distinguishing feature of Wasserstein distance is that unlike ff-divergence, it does not restrict the perturbed distributions to have the same support as 𝐏0\mathbf{P}_{0}, thus including more realistic scenarios. This feature, although desirable, also makes distributionally robust policy learning more challenging. To illustrate the difficulty, we consider the separable cost function family c((x,y1,y2,,yd),(x,y1,y2,,yd))=d(x,x)+αi=1dd(yi,yi)c\left(\left(x,y_{1},y_{2},\ldots,y_{d}\right),\left(x^{\prime},y_{1}^{\prime},y_{2}^{\prime},\ldots,y_{d}^{\prime}\right)\right)=d(x,x^{\prime})+\alpha\sum_{i=1}^{d}d(y_{i},y_{i}^{\prime}) with α>0\alpha>0, where dd is a metric. We aim to find πWDRO\pi_{\mathrm{W-DRO}}^{*} that maximizes QWDRO(π)infDWc(𝐏,𝐏0)δ𝐄𝐏[Y(π(X))]Q_{{\mathrm{W-DRO}}}(\pi)\triangleq\inf_{D_{W_{c}}(\mathbf{P},\mathbf{P}_{0})\leq\delta}\mathbf{E}_{\mathbf{P}}[Y(\pi(X))]. Leveraging strong duality results [12, 33, 56], we can write:

QWDRO(π)=supγ0{γδ+𝐄𝐏0[infu,v{v+γ(d(X,u)+αd(v,Y(π(u)))}]}.Q_{\mathrm{W-DRO}}(\pi)=\sup_{\gamma\geq 0}\left\{-\gamma\delta+\mathbf{E}_{\mathbf{P}_{0}}\left[\inf_{u,v}\left\{v+\gamma(d(X,u)+\alpha d(v,Y(\pi(u)))\right\}\right]\right\}.

However, the difficulty now is that Y(π(u))Y(\pi(u)) is not observed if π0(u)π(u)\pi_{0}(u)\neq\pi(u). We leave this challenge for future work.

References

  • Abadeh et al. [2018] Soroosh Shafieezadeh Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Esfahani. Wasserstein distributionally robust kalman filtering. In Advances in Neural Information Processing Systems, pages 8483–8492, 2018.
  • Abeille et al. [2017] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic Journal of Statistics, 11(2):5165–5197, 2017.
  • Agrawal and Goyal [2013a] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99–107, 2013a.
  • Agrawal and Goyal [2013b] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013b.
  • Araujo and Giné [1980] Aloisio Araujo and Evarist Giné. The central limit theorem for real and Banach valued random variables. John Wiley & Sons, 1980.
  • Bastani and Bayati [2020] Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
  • Bayraksan and Love [2015] Güzin Bayraksan and David K Love. Data-driven stochastic programming using phi-divergences. In The Operations Research Revolution, pages 1–19. Catonsville: Institute for Operations Research and the Management Sciences, 2015.
  • Ben-david et al. [1995] Shai Ben-david, Nicolo Cesabianchi, David Haussler, and Philip M Long. Characterizations of learnability for classes of (0,…,n)-valued functions. Journal of Computer and System Sciences, 50(1):74–86, 1995.
  • Bertsimas and Dunn [2017] Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
  • Bertsimas and Mersereau [2007] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120–1135, 2007.
  • Bertsimas and Sim [2004] Dimitris Bertsimas and Melvyn Sim. The price of robustness. Operations Research, 52(1):35–53, 2004.
  • Blanchet and Murthy [2019] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565–600, 2019. doi: 10.1287/moor.2018.0936.
  • Bonnans and Shapiro [2013] J Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
  • Bubeck et al. [2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chapelle [2014] Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105. ACM, 2014.
  • Chen et al. [2018] Zhi Chen, Daniel Kuhn, and Wolfram Wiesemann. Data-driven chance constrained programs over wasserstein balls. arXiv preprint arXiv:1809.00210, 2018.
  • Chernozhukov et al. [2019] Victor Chernozhukov, Mert Demirer, Greg Lewis, and Vasilis Syrgkanis. Semi-parametric efficient policy learning with continuous actions. In Advances in Neural Information Processing Systems, pages 15039–15049, 2019.
  • Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • Cressie and Read [1984] Noel Cressie and Timothy RC Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):440–464, 1984.
  • Daniely et al. [2011] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207–232. JMLR Workshop and Conference Proceedings, 2011.
  • Daniely et al. [2012] Amit Daniely, Sivan Sabato, and Shai S Shwartz. Multiclass learning approaches: A theoretical comparison with implications. In Advances in Neural Information Processing Systems, pages 485–493, 2012.
  • Delage and Ye [2010] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595–612, 2010.
  • Dimakopoulou et al. [2017] Maria Dimakopoulou, Susan Athey, and Guido Imbens. Estimation considerations in contextual bandits. arXiv preprint arXiv:1711.07077, 2017.
  • Duchi and Namkoong [2018] John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
  • Duchi and Namkoong [2019] John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1):2450–2504, 2019.
  • Duchi et al. [2016] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.
  • Duchi et al. [2019] John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses against mixture covariate shifts. arXiv preprint arXiv:2007.13982, 2019.
  • Dudík et al. [2011] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning, pages 1097–1104, 2011.
  • Dudley [1967] Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
  • Faury et al. [2020] Louis Faury, Ugo Tanielian, Elvis Dohmatob, Elena Smirnova, and Flavian Vasile. Distributionally robust counterfactual risk minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3850–3857, 2020.
  • Filippi et al. [2010] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
  • Friedman et al. [2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • Gao and Kleywegt [2016] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199, 2016.
  • Gao et al. [2018] Rui Gao, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using wasserstein uncertainty sets. In Advances in Neural Information Processing Systems, pages 7902–7912, 2018.
  • Gerber et al. [2008] Alan S Gerber, Donald P Green, and Christopher W Larimer. Social pressure and voter turnout: Evidence from a large-scale field experiment. American political Science review, 102(1):33–48, 2008.
  • Ghosh and Lam [2019] Soumyadip Ghosh and Henry Lam. Robust analysis in stochastic simulation: Computation and performance guarantees. Operations Research, 2019.
  • Goldenshluger and Zeevi [2013] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230–261, 2013.
  • Hamming [1950] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147–160, 1950.
  • Ho-Nguyen et al. [2020] Nam Ho-Nguyen, Fatma Kılınç-Karzan, Simge Küçükyavuz, and Dabeen Lee. Distributionally robust chance-constrained programs with right-hand side uncertainty under wasserstein ambiguity. arXiv preprint arXiv:2003.12685, 2020.
  • Hu and Hong [2013] Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013.
  • Imbens [2004] Guido W Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics, 86(1):4–29, 2004.
  • Imbens and Rubin [2015] G.W. Imbens and D.B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015. ISBN 9780521885881.
  • Joachims et al. [2018] Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged bandit feedback. In International Conference on Learning Representations, May 2018.
  • Jun et al. [2017] Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99–109, 2017.
  • Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. Advances in Neural Information Processing Systems, pages 8895–8906, 2018.
  • Kallus and Zhou [2018] Nathan Kallus and Angela Zhou. Confounding-robust policy improvement. arXiv preprint arXiv:1805.08593, 2018.
  • Kitagawa and Tetenov [2018] Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018.
  • Lam [2019] Henry Lam. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090–1105, 2019.
  • Lam and Zhou [2017] Henry Lam and Enlu Zhou. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301–307, 2017.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Lee and Raginsky [2018] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 2692–2701, USA, 2018. Curran Associates Inc.
  • Lehmann and Casella [2006] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.
  • Li et al. [2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
  • Li et al. [2017] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071–2080. JMLR. org, 2017.
  • Luenberger and Ye [2010] David G Luenberger and Yinyu Ye. Linear and Nonlinear Programming, volume 228. Springer, 2010.
  • Mohajerin Esfahani and Kuhn [2018] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115–166, Sep 2018. ISSN 1436-4646. doi: 10.1007/s10107-017-1172-1.
  • Namkoong and Duchi [2016] Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 2216–2224. Red Hook: Curran Associates Inc., 2016.
  • Natarajan [1989] Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67–97, 1989.
  • Nguyen et al. [2018] Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. arXiv preprint arXiv:1805.07194, 2018.
  • Qu et al. [2021] Zhaonan Qu, Zhengyuan Zhou, Fang Cai, and Xia Li. Interpretable personalization via optimal linear decision boundaries. preprint, 2021.
  • Rakhlin and Sridharan [2016] Alexander Rakhlin and Karthik Sridharan. BISTRO: An efficient relaxation-based method for contextual bandits. In Proceedings of the International Conference on Machine Learning, pages 1977–1985, 2016.
  • Rigollet and Zeevi [2010] Philippe Rigollet and Assaf Zeevi. Nonparametric bandits with covariates. arXiv preprint arXiv:1003.1630, 2010.
  • Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • Rusmevichientong and Tsitsiklis [2010] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Russo and Van Roy [2016] Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  • Schwartz et al. [2017] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017.
  • Shafieezadeh-Abadeh et al. [2015] Soroosh Shafieezadeh-Abadeh, Peyman Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems 28, pages 1576–1584. 2015.
  • Shapiro [2017] Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
  • Shapiro et al. [2009] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
  • Si et al. [2020] Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pages 8884–8894. PMLR, 2020.
  • Sinha et al. [2018] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
  • Slivkins et al. [2019] Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
  • Staib and Jegelka [2017] Matthew Staib and Stefanie Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, 2017.
  • Swaminathan and Joachims [2015a] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16:1731–1755, 2015a.
  • Swaminathan and Joachims [2015b] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239. Citeseer, 2015b.
  • Tsybakov [2009] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
  • Van der Vaart [2000] Aad W. Van der Vaart. Asymptotic Statistics, volume 3. Cambridge University Press, 2000.
  • Vapnik and Chervonenkis [1971] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
  • Volpi et al. [2018] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. arXiv preprint arXiv:1805.12018, 2018.
  • Wainwright [2019] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • Yang [2018] Insoon Yang. Wasserstein distributionally robust stochastic control: A data-driven approach. arXiv preprint arXiv:1812.09808, 2018.
  • Yang et al. [2021] Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Towards theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. arXiv preprint arXiv:2105.03863, 2021.
  • Zhang et al. [2012] Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Eric Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012.
  • Zhao and Guan [2018] Chaoyue Zhao and Yongpei Guan. Data-driven risk-averse stochastic optimization with Wasserstein metric. Operations Research Letters, 46(2):262 – 267, 2018. ISSN 0167-6377. doi: https://doi.org/10.1016/j.orl.2018.01.011.
  • Zhao and Jiang [2017] Chaoyue Zhao and Ruiwei Jiang. Distributionally robust contingency-constrained unit commitment. IEEE Transactions on Power Systems, 33(1):94–102, 2017.
  • Zhao et al. [2014] Ying-Qi Zhao, Donglin Zeng, Eric B Laber, Rui Song, Ming Yuan, and Michael Rene Kosorok. Doubly robust learning for estimating individualized treatment with censored data. Biometrika, 102(1):151–168, 2014.
  • Zhao et al. [2012] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
  • Zhou et al. [2017] Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187, 2017.
  • Zhou et al. [2018] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Offline multi-action policy learning: Generalization and optimization. arXiv preprint arXiv:1810.04778, 2018.

Appendix A Proofs of Main Results

Appendix A.1 Auxiliary Results

To prove Theorem 1 and 2, we first collect some theorems in stochastic optimization [71] and complexity theory (Wainwright [82, Section 4]).

Definition 8 (Gâteaux and Hadamard directional differentiability).

Let B1B_{1} and B2B_{2} be Banach spaces and G:B1B2G:B_{1}\rightarrow B_{2} be a mapping. It is said that GG is directionally differentiable at a considered point μB1\mu\in B_{1} if the limits

Gμ(d)=limt0G(μ+td)G(μ)tG_{\mu}^{\prime}(d)=\lim_{t\downarrow 0}\frac{G\left(\mu+td\right)-G(\mu)}{t}

exists for all dB1.d\in B_{1}.

Furthermore, it is said that GG is Gâteaux directionally differentiable at μ\mu if the directional derivative Gμ(d)G_{\mu}^{\prime}(d) exists for all dB1d\in B_{1} and Gμ(d)G_{\mu}^{\prime}(d) is linear and continuous in d.d. For ease of notation, we also denote Dμ(μ0)D_{\mu}(\mu_{0}) be the operator Gμ0().G_{\mu_{0}}^{\prime}(\cdot).

Finally, it is said that GG is Hadamard directionally differentiable at μ\mu if the directional derivative Gμ(d)G_{\mu}^{\prime}(d) exists for all dB1d\in B_{1} and

Gμ(d)=limt0ddG(μ+td)G(μ)t.G_{\mu}^{\prime}(d)=\lim_{\begin{subarray}{c}t\downarrow 0\\ d^{\prime}\rightarrow d\end{subarray}}\frac{G\left(\mu+td^{\prime}\right)-G(\mu)}{t}.
Theorem A5 (Danskin theorem, Theorem 4.13 in [13]).

Let Θ𝐑d\Theta\in\mathbf{R}^{d} be a nonempty compact set and BB be a Banach space. Suppose the mapping G:B×Θ𝐑G:B\times\Theta\rightarrow\mathbf{R} satisfies that G(μ,θ)G(\mu,\theta) and Dμ(μ,θ)D_{\mu}\left(\mu,\theta\right) are continuous on Oμ0×ΘO_{\mu_{0}}\times\Theta, where Oμ0BO_{\mu_{0}}\subset B is a neighborhood around μ0.\mu_{0}. Let ϕ:B𝐑\phi:B\rightarrow\mathbf{R} be the inf-functional ϕ(μ)=infθΘG(μ,θ)\phi(\mu)=\inf_{\theta\in\Theta}G(\mu,\theta) and Θ¯(μ)=argmaxθΘG(μ,θ).\bar{\Theta}(\mu)=\arg\max_{\theta\in\Theta}G(\mu,\theta). Then, the functional ϕ\phi is directionally differentiable at μ0\mu_{0} and

Gμ0(d)=infθΘ¯(μ0)Dμ(μ0,θ)d.G_{\mu_{0}}^{\prime}(d)=\inf_{\theta\in\bar{\Theta}(\mu_{0})}D_{\mu}\left(\mu_{0},\theta\right)d.
Theorem A6 (Delta theorem, Theorem 7.59 in [71]).

Let B1B_{1} and B2B_{2} be Banach spaces, equipped with their Borel σ\sigma-algrebras, YNY_{N} be a sequence of random elements of B1B_{1}, G:B1B2G:B_{1}\rightarrow B_{2} be a mapping, and τN\tau_{N} be a sequence of positive numbers tending to infinity as N.N\rightarrow\infty. Suppose that the space B1B_{1} is separable, the mapping GG is Hadamard directionally differentiable at a point μB1,\mu\in B_{1}, and the sequence XN=τN(YNμ)X_{N}=\tau_{N}\left(Y_{N}-\mu\right) converges in distribution to a random element YY of B1.B_{1}. Then,

τN(G(YN)G(μ))Gμ(Y) in distribution,\tau_{N}\left(G\left(Y_{N}\right)-G\left(\mu\right)\right)\Rightarrow G_{\mu}^{\prime}\left(Y\right)\text{ in distribution,}

and

τN(G(YN)G(μ))=Gμ(XN)+op(1).\tau_{N}\left(G\left(Y_{N}\right)-G\left(\mu\right)\right)=G_{\mu}^{\prime}\left(X_{N}\right)+o_{p}(1).
Proposition 3 (Proposition 7.57 in [71]).

Let B1B_{1} and B2B_{2} be Banach spaces, G:B1B2,G:B_{1}\rightarrow B_{2}, and μB1.\mu\in B_{1}. Then the following hold: (i) If G()G\left(\cdot\right) is Hadamard directionally differentiable at μ,\mu, then the directional derivative Gμ()G_{\mu}^{\prime}\left(\cdot\right) is continuous. (ii) If G()G(\cdot) is Lipschitz continuous in a neighborhood of μ\mu and directionally differentiable at μ,\mu, then G()G(\cdot) is Hadamard directionally differentiable at μ.\mu.

Definition 9 (Rademacher complexity).

Let \mathcal{F} be a family of real-valued functions f:Z𝐑.f:Z\rightarrow\mathbf{R.} Then, the Rademacher complexity of \mathcal{F} is defined as

n()𝐄z,σ[supf|1ni=1nσif(zi)|],\mathcal{R}_{n}\left(\mathcal{F}\right)\triangleq\mathbf{E}_{z,\sigma}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(z_{i})\right|\right],

where σ1,σ2,,σn\sigma_{1},\sigma_{2},\ldots,\sigma_{n} are i.i.d with the distribution 𝐏(σi=1)=𝐏(σi=1)=1/2.\mathbf{P}\left(\sigma_{i}=1\right)=\mathbf{P}\left(\sigma_{i}=-1\right)=1/2.

Theorem A7 (Theorem 4.10 in [82]).

If f(z)[B,B],f(z)\in[-B,B], we have with probability at least 1exp(nϵ22B2)1-\exp\left(-\frac{n\epsilon^{2}}{2B^{2}}\right),

supf|1ni=1nf(zi)𝐄f(z)|2n()+ϵ.\sup_{f\in\mathcal{F}}\left|\frac{1}{n}\sum_{i=1}^{n}f(z_{i})-\mathbf{E}f(z)\right|\leq 2\mathcal{R}_{n}\left(\mathcal{F}\right)+\epsilon.
Theorem A8 (Dudley’s Theorem, (5.48) in [82]).

If f(z)[B,B],f(z)\in[-B,B], we have a bound for the Rademacher complexity,

n()𝐄[24n02BlogN(t,,𝐏n)dt],\mathcal{R}_{n}\left(\mathcal{F}\right)\leq\mathbf{E}\left[\frac{24}{\sqrt{n}}\int_{0}^{2B}\sqrt{\log N(t,\mathcal{F}\text{,}\left\|\cdot\right\|_{\mathbf{P}_{n}})}{\rm d}t\right],

where N(t,N(t,\mathcal{F},𝐏n)\left\|\cdot\right\|_{\mathbf{P}_{n}}) is tt-covering number of set \mathcal{F} and the metric 𝐏n\left\|\cdot\right\|_{\mathbf{P}_{n}} is defined by

𝐏n1ni=1n(f(zi)g(zi))2.\left\|\cdot\right\|_{\mathbf{P}_{n}}\triangleq\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(f(z_{i}\right)-g(z_{i}))^{2}}.

Appendix A.2 Proofs of Lemma 1 Corollary 1 and Lemma 2 in Section 3.1

Proof of Lemma 1.

The first equality follows from Hu and Hong [40, Theorem 1]. The second equality holds, because for any (Borel measurable) function f:𝐑𝐑f:\mathbf{R}\rightarrow\mathbf{R} and any policy πΠ\pi\in\Pi, we have

𝐄𝐏[f(Y(π(X)))]=𝐄𝐏π0[f(Y(π(X)))𝟏{π(X)=A}π0(AX)]=𝐄𝐏π0[f(Y(A))𝟏{π(X)=A}π0(AX)].\mathbf{E}_{\mathbf{P}}\left[f(Y(\pi(X)))\right]=\mathbf{E}_{\mathbf{P}*\pi_{0}}\left[\frac{f(Y(\pi(X)))\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A\mid X)}\right]=\mathbf{E}_{\mathbf{P}*\pi_{0}}\left[\frac{f(Y(A))\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A\mid X)}\right]. (A.1)

Plugging in f(x)=exp(x/α)f(x)=\exp(-x/{\alpha}) yields the result. ∎

Proof of Corollary 1.

Since Y(a1),Y(a2),,Y(ad)Y(a^{1}),Y(a^{2}),\ldots,Y(a^{d}) are mutually independent conditional on XX, and Y(ai)|XY(a^{i})|X has a density if Y(ai)Y(a^{i}) is a continuous random variable, we can write the joint measure of Y(a1),Y(a2),,Y(ad),XY(a^{1}),Y(a^{2}),\ldots,Y(a^{d}),X as

(i=1dfi(yi|x)λ(dyi))μ(dx),\left(\prod_{i=1}^{d}f_{i}(y_{i}|x)\lambda({\rm d}y_{i})\right)\mu({\rm d}x),

where λ\lambda denotes the Lebesgue measure in 𝐑\mathbf{R} if Assumption 2.1 is enforced, and denotes a uniformly distributed measure on 𝔻\mathbb{D} that λ(d)=1\lambda(d)=1 for d𝔻d\in\mathbb{D} if Assumption 2.2 is enforced, and μ(dx)\mu({\rm d}x) denotes the measure induced by XX on the space 𝒳\mathcal{X}. Without loss of generality, we assume essinf{Y(π(X)}=0\operatorname*{ess\,inf}\{Y(\pi(X)\}=0. Then, For the simplicity of notation, when α(π)=0,\alpha^{*}(\pi)=0, we write

exp(Y(π(X)/α(π)))=𝟏{Y(π(X))=essinf{Y(π(X)}},\exp(-Y(\pi(X)/\alpha^{*}(\pi)))=\mathbf{1}\{Y(\pi(X))=\operatorname*{ess\,inf}\{Y(\pi(X)\}\},

which means

𝐄𝐏0[exp(Y(π(X)/α(π)))]=𝐏0(Y(π(X))=essinf{Y(π(X)}).{\mathbf{E}_{\mathbf{P}_{0}}[\exp(-Y(\pi(X)/\alpha^{*}(\pi)))]}={\mathbf{P}_{0}(Y(\pi(X))=\operatorname*{ess\,inf}\{Y(\pi(X)\})}.

Then, by Proposition 1, we have under 𝐏(π)\mathbf{P}(\pi), Y(a1),Y(a2),,Y(ad),XY(a^{1}),Y(a^{2}),\ldots,Y(a^{d}),X have a joint measure

exp(Y(π(X)/α(π)))𝐄𝐏0[exp(Y(π(X)/α(π)))](i=1dfi(yi|x)λ(dyi))μ(dx)=(i=1dfi(yi|x)λ(dyi))μ(dx),\displaystyle\frac{\exp(-Y(\pi(X)/\alpha^{*}(\pi)))}{\mathbf{E}_{\mathbf{P}_{0}}[\exp(-Y(\pi(X)/\alpha^{*}(\pi)))]}\left(\prod_{i=1}^{d}f_{i}(y_{i}|x)\lambda({\rm d}y_{i})\right)\mu({\rm d}x)=\left(\prod_{i=1}^{d}f^{\prime}_{i}(y_{i}|x)\lambda({\rm d}y_{i})\right)\mu^{\prime}({\rm d}x),

where

fi(yi|x)={fi(yi|x)exp(yi/α(π))fi(yi|x)exp(yi/α(π))fi(yi|x)λ(dyi)for iπ(x)for i=π(x),f_{i}^{\prime}(y_{i}|x)=\left\{\begin{array}[]{c}f_{i}(y_{i}|x)\\ \frac{\exp(-y_{i}/\alpha^{*}(\pi))f_{i}(y_{i}|x)}{\int\exp(-y_{i}/\alpha^{*}(\pi))f_{i}(y_{i}|x)\lambda({\rm d}y_{i})}\end{array}\begin{array}[]{c}\text{for }i\neq\pi(x)\\ \text{for }i=\pi(x)\end{array}\right.,

and

μ(dx)=exp(yi/α(π))fi(yi|x)μ(dyi)𝐄𝐏0[exp(Y(π(X)/α(π)))],\mu^{\prime}({\rm d}x)=\frac{\int\exp(-y_{i}/\alpha^{*}(\pi))f_{i}(y_{i}|x)\mu({\rm d}y_{i})}{\mathbf{E}_{\mathbf{P}_{0}}[\exp(-Y(\pi(X)/\alpha^{*}(\pi)))]},

which completes the proof. ∎

Proof of Lemma 2.

The closed form expression of αϕ^n(π,α)\frac{\partial}{\partial\alpha}\hat{\phi}_{n}(\pi,\alpha) and 2α2ϕ^n(π,α)\frac{\partial^{2}}{\partial\alpha^{2}}\hat{\phi}_{n}(\pi,\alpha) follows from elementary algebra. By the Cauchy Schwartz’s inequality, we have

(i=1nYi(Ai)Wi(π,α))2nSnπW^n(π,α)i=1nYi2(Ai)Wi(π,α).\Big{(}\sum_{i=1}^{n}Y_{i}(A_{i})W_{i}(\pi,\alpha)\Big{)}^{2}\leq nS_{n}^{\pi}\hat{W}_{n}(\pi,\alpha)\sum_{i=1}^{n}Y^{2}_{i}(A_{i})W_{i}(\pi,\alpha).

Therefore, it follows that 2α2ϕ^n(π,α)0\frac{\partial^{2}}{\partial\alpha^{2}}\hat{\phi}_{n}(\pi,\alpha)\leq 0. Note that the Cauchy Schwartz’s inequality is actually an equality if and only if

Yi2(Ai)Wi(π,α)=cWi(π,α) if Wi(π,α)0Y^{2}_{i}(A_{i})W_{i}(\pi,\alpha)=cW_{i}(\pi,\alpha)\quad\mbox{ if }\quad W_{i}(\pi,\alpha)\neq 0

for some constant cc independent of ii. Since the above condition is violated if {Yi(Ai)𝟏{π(Xi)=Ai}}i=1n\{Y_{i}(A_{i})\mathbf{1}\{\pi(X_{i})=A_{i}\}\}_{i=1}^{n} has at least two different non-zero values, we have in this case ϕ^n(π,α)\hat{\phi}_{n}(\pi,\alpha) is strictly-concave in α\alpha. ∎

Appendix A.3 Proof of the central limit theorem in Section 1

We first give the upper and lower bounds for α(π)\alpha^{\ast}(\pi) in Lemmas A11 and A12.

Lemma A11 (Uniform upper bound of α(π)\alpha^{\ast}(\pi)).

Suppose that Assumption 1.3 is imposed, we have the optimal dual solution α(π)α¯\alpha^{\ast}(\pi)\leq\overline{\alpha} and the empirical dual solution αn(π)α¯\alpha_{n}(\pi)\leq\overline{\alpha}, where α¯=M/δ\overline{\alpha}=M/\delta .

Proof.

First note that inf𝐏𝒰𝐏0(δ)𝐄𝐏[Y(π(X))]essinfY(π(X))0\inf_{\mathbf{P}\in\mathcal{U}_{\mathbf{P}_{0}(\delta)}}\mathbf{E}_{\mathbf{P}}\left[Y(\pi(X))\right]\geq\operatorname*{ess\,inf}{Y(\pi(X))}\geq 0 and

αlog𝐄𝐏0[exp(Y(π(X))/α]αδMαδ.-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp\left(-Y(\pi(X)\right)/\alpha\right]-\alpha\delta\leq M-\alpha\delta.

Mαδ0M-\alpha\delta\geq 0 gives the upper bound α(π)α¯:=M/δ.\alpha^{\ast}(\pi)\leq\overline{\alpha}:=M/\delta.

Lemma A12 (Lower bound of α(π)\alpha^{\ast}(\pi)).

Suppose that Assumption 2.1 is imposed, we have α(π)>0\alpha^{\ast}(\pi)>0.

Proof.

To ease the notation, we abbreviate Y(π(X))Y(\pi(X)) as Y.Y. It is easy to check the density of fYf_{Y} has a lower bound b¯>0.\underline{b}>0. Since fYf_{Y} is a continuous function on a compact space, we have fYf_{Y} is upper bounded. Let b¯(π)=supx[0,M]fY(x).\overline{b}(\pi)=\sup_{x\in[0,M]}f_{Y}(x).

First, note that limα0ϕ(π,α)=0.\lim_{\alpha\rightarrow 0}\phi\left(\pi,\alpha\right)=0. We only need to show liminfα0ϕ(π,α)α>0.\lim\inf_{\alpha\rightarrow 0}\frac{\partial\phi\left(\pi,\alpha\right)}{\partial\alpha}>0.

The derivative of ϕ(π,α)\phi\left(\pi,\alpha\right) is given by

ϕ(π,α)α=𝐄[Y/αexp(Y/α)]𝐄[exp(Y/α)]log(𝐄[exp(Y/α)])δ.\frac{\partial\phi\left(\pi,\alpha\right)}{\partial\alpha}=-\frac{\mathbf{E}\left[Y/\alpha\exp\left(-Y/\alpha\right)\right]}{\mathbf{E}\left[\exp\left(-Y/\alpha\right)\right]}-\log\left(\mathbf{E}\left[\exp\left(-Y/\alpha\right)\right]\right)-\delta.

Since 𝐏0\mathbf{P}_{0} has a continuous density, we have log(𝐄[exp(Y/α)]).\log\left(\mathbf{E}\left[\exp\left(-Y/\alpha\right)\right]\right)\rightarrow-\infty. Notice that

𝐄[Y/αexp(Y/α)]αb¯ and liminfα0𝐄[exp(Y/α)]/αb¯.\mathbf{E}\left[Y/\alpha\exp\left(-Y/\alpha\right)\right]\leq\alpha\overline{b}\text{ and }\lim\inf_{\alpha\rightarrow 0}\mathbf{E}\left[\exp\left(-Y/\alpha\right)\right]/\alpha\geq\underline{b}.

Therefore, we have

limsupα0𝐄[Y/αexp(Y/α)]𝐄[exp(Y/α)]b¯(π)b¯.\lim\sup_{\alpha\rightarrow 0}\frac{\mathbf{E}\left[Y/\alpha\exp\left(-Y/\alpha\right)\right]}{\mathbf{E}\left[\exp\left(-Y/\alpha\right)\right]}\leq\frac{\overline{b}(\pi)}{\underline{b}}.

Finally, we arrive the desired result

liminfα0ϕ(π,α)α+,\lim\inf_{\alpha\rightarrow 0}\frac{\partial\phi\left(\pi,\alpha\right)}{\partial\alpha}\rightarrow+\infty,

which completes the proof. ∎

Lemma A13.

Suppose Assumption 1.1 is enforced. We have pointwise central limit theorem,

n(W^n(π,α)𝐄[Wi(π,α)])𝒩(0,𝐄[1π0(π(X)X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2]),\displaystyle\sqrt{n}\left(\hat{W}_{n}(\pi,\alpha)-\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]\right)\Rightarrow\mathcal{N}\left(0,\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)\mid X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right]\right),

for any πΠ\pi\in\Pi and α>0\alpha>0.

Proof.

After reformulation, we have

W^n(π,α)𝐄[Wi(π,α)]\displaystyle\hat{W}_{n}(\pi,\alpha)-\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right] =\displaystyle= 1ni=1nWi(π,α)(1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi))𝐄[Wi(π,α)]1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)\displaystyle\frac{\frac{1}{n}\sum_{i=1}^{n}W_{i}\left(\pi,\alpha\right)-\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right)\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]}{\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}}
=\displaystyle= 1ni=1n(Wi(π,α)𝟏{π(Xi)=Ai}π0(Ai|Xi)𝐄[Wi(π,α)])1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi).\displaystyle\frac{\frac{1}{n}\sum_{i=1}^{n}\left(W_{i}\left(\pi,\alpha\right)-\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]\right)}{\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}}.

The denominator 1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)𝑝1\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\overset{p}{\rightarrow}1 and the nominator converges as

1ni=1n(Wi(π,α)𝟏{π(Xi)=Ai}π0(Ai|Xi)𝐄[Wi(π,α)])\displaystyle\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left(W_{i}\left(\pi,\alpha\right)-\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]\right)
\displaystyle\Rightarrow N(0,Var(Wi(π,α)𝟏{π(Xi)=Ai}π0(Ai|Xi)𝐄[Wi(π,α)]),\displaystyle N(0,\textbf{Var}\left(W_{i}\left(\pi,\alpha\right)-\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]\right),

where

𝐕𝐚𝐫(Wi(π,α)𝟏{π(Xi)=Ai}π0(Ai|Xi)𝐄[Wi(π,α)])\displaystyle\mathbf{Var}\left(W_{i}\left(\pi,\alpha\right)-\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]\right)
=\displaystyle= 𝐄[Wi(π,α)2]2𝐄[(𝟏{π(Xi)=Ai}π0(Ai|Xi))2exp(Y(π(X))/α)]𝐄[Wi(π,α)]\displaystyle\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)^{2}\right]-2\mathbf{E}\left[\left(\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right)^{2}\exp\left(-Y(\pi(X))/\alpha\right)\right]\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]
+𝐄[𝟏{π(Xi)=Ai}π0(Ai|Xi)]2𝐄[Wi(π,α)]2\displaystyle+\mathbf{E}\left[\frac{\mathbf{1}\left\{\pi(X_{i})=A_{i}\right\}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right]^{2}\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]^{2}
=\displaystyle= 𝐄[1π0(π(X)|X)𝐄[exp(2Y(π(X))/α)|X]]2𝐄[1π0(π(X)|X)𝐄[exp(Y(π(X))/α)|X]]𝐄[Wi(π,α)]\displaystyle\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\mathbf{E}\left[\exp\left(-2Y(\pi(X))/\alpha\right)|X\right]\right]-2\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)|X\right]\right]\mathbf{E}\left[W_{i}\left(\pi,\alpha\right)\right]
+𝐄[1π0(π(X)|X)]𝐄[exp(Y(π(X))/α)]2\displaystyle+\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\right]\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]^{2}
=\displaystyle= 𝐄[1π0(π(X)|X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2].\displaystyle\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right].

By Slutsky’s theorem, the desired result follows.

To ease the proofs below, we define 𝐏^nπ\mathbf{\hat{P}}^{\pi}_{n} as the weighted empirical distribution by

𝐏^nπ1nSnπi=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)Δ{Yi,Xi},\mathbf{\hat{P}}_{n}^{\pi}\triangleq\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\Delta\left\{Y_{i},X_{i}\right\},

where Δ{}\Delta\{\cdot\} denote a dirac measure.

Lemma A14.

Suppose Assumptions 1.1 and 2.2 are enforced. Then, we have

limn+supv𝔻(|𝐏^nπ(Y=v)𝐏0(Y=v)|)=0 almost surely.\lim_{n\rightarrow+\infty}\sup_{v\in\mathbb{D}}\left(|\hat{\mathbf{P}}_{n}^{\pi}(Y=v)-\mathbf{P}_{0}(Y=v)|\right)=0\text{ almost surely.}
Proof.

Note that

𝐏^nπ(Y=v)=1nSnπi=1n𝟏{π(Xi)=Ai}𝟏{Yi=v}π0(Ai|Xi),\hat{\mathbf{P}}_{n}^{\pi}(Y=v)=\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}=v\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)},

and by Assumption 1.1,

𝐄[𝟏{π(Xi)=Ai}𝟏{Yi=v}π0(Ai|Xi)]\displaystyle\mathbf{E}\left[\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}=v\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right] =\displaystyle= 𝐄[𝐄[𝟏{π(Xi)=Ai|X}]𝐄[𝟏{Yi=v|X}]π0(Ai|Xi)]=𝐏0(Y=v).\displaystyle\mathbf{E}\left[\frac{\mathbf{E}\left[\mathbf{1\{}\pi(X_{i})=A_{i}|X\mathbf{\}}\right]\mathbf{E}\left[\mathbf{1\{}Y_{i}=v|X\mathbf{\}}\right]}{\pi_{0}\left(A_{i}|X_{i}\right)}\right]=\mathbf{P}_{0}(Y=v).

By the law of large number, we have Snπ1S_{n}^{\pi}\rightarrow 1 almost surely, and

1ni=1n𝟏{π(Xi)=Ai}𝟏{Yi=v}π0(Ai|Xi)𝐏0(Y=v) almost surely.\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}=v\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\rightarrow\mathbf{P}_{0}(Y=v)\text{ almost surely.}

By Slutsky’s lemma [e.g. 52, Theorem 1.8.10]), we have 𝐏^nπ(Y=v)𝐏0(Y=v)0\hat{\mathbf{P}}_{n}^{\pi}(Y=v)-\mathbf{P}_{0}(Y=v)\rightarrow 0 almost surely. Since 𝔻\mathbb{D} is a finite set, we arrive the desired results. ∎

Lemma A15.

Suppose Assumptions 1.1 and 2.2 are enforced. We have that when α=0\alpha=0,

limn𝐏(ϕ^n(π,0)=ϕ(π,0))=1,\lim_{n\rightarrow\infty}\mathbf{P}_{\otimes}(\hat{\phi}_{n}(\pi,0)={\phi}(\pi,0))=1,

where 𝐏\mathbf{P}_{\otimes} denotes the product measure i=1𝐏0\prod_{i=1}^{\infty}\mathbf{P}_{0}, which is guaranteed to be unique by the Kolmogorov extension theorem. and thus

n(ϕ^n(π,0)ϕ(π,0))0 in probability.\displaystyle\sqrt{n}(\hat{\phi}_{n}(\pi,0)-{\phi}(\pi,0))\rightarrow 0\text{ in probability.}
Proof.

By Remark 2, we have

ϕ^n(π,0)=essinfY^nπ,\hat{\phi}_{n}(\pi,0)=\operatorname*{ess\,inf}\hat{Y}_{n}^{\pi},

where Y^nπ\hat{Y}_{n}^{\pi} is the distribution of YY under the measure 𝐏nπ\mathbf{P}_{n}^{\pi}. Therefore,

𝐏(ϕ^n(π,0)=ϕ(π,0))=𝐏(𝐏^nπ(Y=minv𝔻v)>0)1,\mathbf{P}_{\otimes}(\hat{\phi}_{n}(\pi,0)={\phi}(\pi,0))=\mathbf{P}_{\otimes}(\hat{\mathbf{P}}_{n}^{\pi}(Y=\min_{v\in\mathbb{D}}{v})>0)\rightarrow 1,

because 𝐏^nπ(Y=minv𝔻v)𝐏0(Y=minv𝔻v)\hat{\mathbf{P}}_{n}^{\pi}(Y=\min_{v\in\mathbb{D}}{v})\rightarrow\mathbf{P}_{0}(Y=\min_{v\in\mathbb{D}}{v}). ∎

Now we are ready to show the proof of Theorem 1.

Proof.

Proof of Theorem 1 We divide the proof into two parts: continuous case and discrete case.

(1) Continuous case, i.e., Assumption 2.1 is satisfied: Note that

n(W^n(π,α)𝐄[Wi(π,α)])Z1(α),\sqrt{n}\left(\hat{W}_{n}(\pi,\alpha)-\mathbf{E}[W_{i}(\pi,\alpha)]\right)\Rightarrow Z_{1}\left(\alpha\right),

where by Lemma A13

Z1(α)𝒩(0,𝐄[1π0(π(X)|X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2]).Z_{1}\left(\alpha\right)\sim\mathcal{N}\left(0,\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right]\right).

By Lemma A12, there exists α¯(π)>0\underline{\alpha}(\pi)>0 such that α(π)α¯(π)\alpha^{*}(\pi)\geq\underline{\alpha}(\pi). To ease the notation, we abbreviate α¯(π),α(π)\underline{\alpha}(\pi),\alpha^{*}(\pi) as α¯,α\underline{\alpha},\alpha^{*}. Since W^n(π,α)\hat{W}_{n}(\pi,\alpha) is Lipschitz continuous over the set α[α¯/2,2α¯]\alpha\in[\underline{\alpha}/2,2\overline{\alpha}], we have

n(W^n(π,)𝐄[Wi(π,)])Z1(),\sqrt{n}\left(\hat{W}_{n}(\pi,\cdot)-\mathbf{E}[W_{i}(\pi,\cdot)]\right)\Rightarrow Z_{1}\left(\cdot\right), (A.3)

uniformly in Banach space 𝒞([α¯/2,2α¯])\mathcal{C}([\underline{\alpha}/2,2\overline{\alpha}]) of continuous functions ψ:[α¯/2,2α¯]𝐑\psi:[\underline{\alpha}/2,2\overline{\alpha}]\mathcal{\ \rightarrow}\mathbf{R} equipped with the the sup-norm ψ:=supx[α¯/2,2α¯]ψ(x)\left\|\psi\right\|:=\sup_{x\in[\underline{\alpha}/2,2\overline{\alpha}]}\psi(x) [e.g. 5, Corollary 7.17]. Z1Z_{1} is a random element in 𝒞([α¯/2,2α¯])\mathcal{C}([\underline{\alpha}/2,2\overline{\alpha}]).

Define the functionals

G1(ψ,α)=αlog(ψ(α))+αδ, and V1(ψ)=infα[α¯/2,2α¯]G1(ψ,α),G_{1}(\psi,\alpha)=\alpha\log\left(\psi(\alpha)\right)+\alpha\delta,\text{ and }V_{1}(\psi)=\inf_{\alpha\in[\underline{\alpha}/2,2\overline{\alpha}]}G_{1}(\psi,\alpha),

for ψ>0\psi>0. By the Danskin theorem [13, Theorem 4.13], V1()V_{1}\left(\cdot\right) is directionally differentiable at any μ𝒞([α¯/2,2α¯])\mu\in\mathcal{C}([\underline{\alpha}/2,2\overline{\alpha}]) with μ>0\mu>0 and

V1,μ(ν)=infαX¯(μ)α(1/μ(α))ν(α), ν𝒞([α¯/2,2α¯])),V_{1,\mu}^{\prime}\left(\nu\right)=\inf_{\alpha\in\bar{X}\left(\mu\right)}\alpha\left(1/\mu(\alpha)\right)\nu(\alpha),\text{ }\forall\nu\in\mathcal{C}([\underline{\alpha}/2,2\overline{\alpha}])\mathcal{)}\text{,}

where X¯(μ)=argminα[α¯/2,2α¯])αlog(μ(α))+αδ\bar{X}\left(\mu\right)=\arg\min_{\alpha\in[\underline{\alpha}/2,2\overline{\alpha}])}\alpha\log\left(\mu(\alpha)\right)+\alpha\delta and V1,μ(ν)V_{1,\mu}^{\prime}\left(\nu\right) is the directional derivative of V1()V_{1}\left(\cdot\right) at μ\mu in the direction of ν.\nu. On the other hand, V1(ψ)V_{1}(\psi) is Lipschitz continuous if ψ()\psi\left(\cdot\right) is bounded away from zero. Notice that

𝐄[Wi(π,α)]=𝐄[exp(Y(π(X))/α)]exp(2M/α¯).\mathbf{E}[W_{i}(\pi,\alpha)]=\mathbf{E}[\exp\left(-Y(\pi(X))/\alpha\right)]\geq\exp\left(-2M/\underline{\alpha}\right). (A.4)

Therefore, V1()V_{1}\left(\cdot\right) is Hadamard directionally differentiable at μ=𝐄[Wi(π,)]\mu=\mathbf{E}[W_{i}(\pi,\cdot)] (see, for example, Proposition 7.57 in [71]). By the Delta theorem (Theorem 7.59 in [71]), we have

n(V1(W^n(π,))V1(𝐄[Wi(π,)]))V1,𝐄[Wi(π,)](Z1).\sqrt{n}\left(V_{1}(\hat{W}_{n}(\pi,\cdot))-V_{1}(\mathbf{E}[W_{i}(\pi,\cdot)])\right)\Rightarrow V_{1,\mathbf{E}[W_{i}(\pi,\cdot)]}^{\prime}\left(Z_{1}\right).

Furthermore, we know that log(𝐄(exp(βY)))\log\left(\mathbf{E}\left(\exp\left(-\beta Y\right)\right)\right) is strictly convex w.r.t β\beta given 𝐕𝐚𝐫(Y)>0\mathbf{Var}\left(Y\right)>0 and xf(1/x)xf(1/x) is strictly convex if f(x)f(x) is strictly convex. Therefore, αlog(𝐄[Wi(π,α)])+αδ\alpha\log\left(\mathbf{E}[W_{i}(\pi,\alpha)]\right)+\alpha\delta is strictly convex for α>0\alpha>0 and thus

V1,𝐄[Wi(π,)](Z1)=α(1/𝐄[Wi(π,α)])Z1(α)\displaystyle V_{1,\mathbf{E}[W_{i}(\pi,\cdot)]}^{\prime}\left(Z_{1}\right)=\alpha^{\ast}\left(1/\mathbf{E}[W_{i}(\pi,\alpha^{\ast})]\right)Z_{1}\left(\alpha^{\ast}\right)
=𝑑\displaystyle\overset{d}{=} 𝒩(0,(α)2𝐄[Wi(π,α)]2𝐄[1π0(π(X)|X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2]),\displaystyle\mathcal{N}\left(0,\frac{\left(\alpha^{\ast}\right)^{2}}{\mathbf{E}\left[W_{i}(\pi,\alpha^{\ast})\right]^{2}}\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right]\right),

where =𝑑\overset{d}{=} denotes equality in distribution. By Lemma 1, we have that

Q^DRO(π)=infα0(αlog(W^n(π,α))+αδ),\hat{Q}_{\mathrm{\rm DRO}}(\pi)=-\inf_{\alpha\geq 0}\left(\alpha\log\left(\hat{W}_{n}(\pi,\alpha)\right)+\alpha\delta\right)\text{,}

and

QDRO(π)=infα0(αlog(𝐄[Wi(π,α)])+αδ)=V1(𝐄[Wi(π,α)]).Q_{\mathrm{\rm DRO}}(\pi)=-\inf_{\alpha\geq 0}\left(\alpha\log\left(\mathbf{E}[W_{i}(\pi,\alpha)]\right)+\alpha\delta\right)=-V_{1}(\mathbf{E}[W_{i}(\pi,\alpha)]).

We remain to show 𝐏\mathbf{P}(Q^DRO(π)V1(W^n(π,α)))0\hat{Q}_{\mathrm{\rm DRO}}(\pi)\neq-V_{1}(\hat{W}_{n}(\pi,\alpha)))\rightarrow 0, as n.n\rightarrow\infty. The weak convergence (A.3) also implies the uniform convergence,

supα[α¯/2,2α¯])|W^n(π,α)𝐄[Wi(π,α)]|0 a.s..\sup_{\alpha\in[\underline{\alpha}/2,2\overline{\alpha}])}\left|\hat{W}_{n}(\pi,\alpha)-\mathbf{E}[W_{i}(\pi,\alpha)]\right|\rightarrow 0\text{ a.s..}

Therefore, we further have

supα[α¯/2,2α¯])|(αlog(W^n(π,α))+αδ)(αlog(𝐄[Wi(π,α)])+αδ)|0 a.s.\sup_{\alpha\in[\underline{\alpha}/2,2\overline{\alpha}])}\left|\left(\alpha\log\left(\hat{W}_{n}(\pi,\alpha)\right)+\alpha\delta\right)-\left(\alpha\log\left(\mathbf{E}[W_{i}(\pi,\alpha)]\right)+\alpha\delta\right)\right|\rightarrow 0\text{ a.s.}

given 𝐄[Wi(π,α)]\mathbf{E}[W_{i}(\pi,\alpha)] is bounded away from zero in (A.4). Let

ϵ=min{α¯/2log(𝐄[Wi(π,α¯/2)])+α¯δ/2,2α¯log(𝐄[Wi(π,2α¯)])+2α¯δ}(αlog(𝐄[Wi(π,α)])+αδ)>0.\epsilon=\min\left\{\underline{\alpha}/2\log\left(\mathbf{E}[W_{i}(\pi,\underline{\alpha}/2)]\right)+\underline{\alpha}\delta/2,2\overline{\alpha}\log\left(\mathbf{E}[W_{i}(\pi,2\overline{\alpha})]\right)+2\overline{\alpha}\delta\right\}-\left(\alpha^{\ast}\log\left(\mathbf{E}[W_{i}(\pi,\alpha^{\ast})]\right)+\alpha^{\ast}\delta\right)>0.

Then, given the event

{supα[α¯/2,2α¯])|(αlog(W^n(π,α))+αδ)(αlog(𝐄[Wi(π,α)])+αδ)|<ϵ/2},\left\{\sup_{\alpha\in[\underline{\alpha}/2,2\overline{\alpha}])}\left|\left(\alpha\log\left(\hat{W}_{n}(\pi,\alpha)\right)+\alpha\delta\right)-\left(\alpha\log\left(\mathbf{E}[W_{i}(\pi,\alpha)]\right)+\alpha\delta\right)\right|<\epsilon/2\right\},

we have

αlog(W^n(π,α))+αδ<min{α¯/2log(W^n(π,α¯/2))+α¯δ/2,2α¯log(W^n(π,2α¯))+2α¯δ},\alpha^{\ast}\log\left(\hat{W}_{n}(\pi,\alpha)\right)+\alpha^{\ast}\delta<\min\left\{\underline{\alpha}/2\log\left(\hat{W}_{n}(\pi,\underline{\alpha}/2)\right)+\underline{\alpha}\delta/2,2\overline{\alpha}\log\left(\hat{W}_{n}(\pi,2\overline{\alpha})\right)+2\overline{\alpha}\delta\right\},

which means Q^DRO(π)=V1(W^n(π,α))\hat{Q}_{\mathrm{\rm DRO}}(\pi)=-V_{1}(\hat{W}_{n}(\pi,\alpha)) by the convexity of αlog(W^n(π,α))+αδ.\alpha\log\left(\hat{W}_{n}(\pi,\alpha)\right)+\alpha\delta.

Finally, we complete the proof by Slutsky’s lemma [e.g. 52, Theorem 1.8.10]).:

n(Q^DRO(π)QDRO(π))\displaystyle\sqrt{n}\left(\hat{Q}_{\mathrm{\rm DRO}}(\pi)-Q_{\mathrm{\rm DRO}}(\pi)\right) =\displaystyle= n(Q^DRO(π)+V1(W^n(π,α)))+n(V1(𝐄[Wi(π,α)])V1(W^n(π,α)))\displaystyle\sqrt{n}\left(\hat{Q}_{\mathrm{\rm DRO}}(\pi)+V_{1}(\hat{W}_{n}(\pi,\alpha))\right)+\sqrt{n}\left(V_{1}(\mathbf{E}[W_{i}(\pi,\alpha)])-V_{1}(\hat{W}_{n}(\pi,\alpha))\right)
\displaystyle\Rightarrow 0+𝒩(0,σ2(α))=𝑑𝒩(0,σ2(α)),\displaystyle 0+\mathcal{N}\left(0,\sigma^{2}(\alpha^{*})\right)\overset{d}{=}\mathcal{N}\left(0,\sigma^{2}(\alpha^{*})\right),

where

σ2(α)=α2𝐄[Wi(π,α)]2𝐄[1π0(π(X)|X)(exp(Y(π(X))/α)𝐄[exp(Y(π(X))/α)])2].\sigma^{2}(\alpha)=\frac{\alpha^{2}}{\mathbf{E}\left[W_{i}(\pi,\alpha)\right]^{2}}\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)|X\right)}\left(\exp\left(-Y(\pi(X))/\alpha\right)-\mathbf{E}\left[\exp\left(-Y(\pi(X))/\alpha\right)\right]\right)^{2}\right].

(2) Discrete case, i.e., Assumption 2.2 is satisfied: First, if 𝐕𝐚𝐫(Y(π(X)))=0\mathbf{Var}(Y(\pi(X)))=0, we have Q^DRO(π)=QDRO(π)=Y\hat{Q}_{\rm DRO}(\pi)={Q}_{\rm DRO}(\pi)=Y almost surely with α(π)=0\alpha^{*}(\pi)=0. Therefore, in the following, we focus on the case 𝐕𝐚𝐫(Y(π(X)))>0\mathbf{Var}(Y(\pi(X)))>0. Without loss of generality, we assume the smallest element in the set 𝔻\mathbb{D} is zero since we can always translate YY. Note that

n(W^n(π,α)𝐄[Wi(π,α)])Z1(α),\sqrt{n}\left(\hat{W}_{n}(\pi,\alpha)-\mathbf{E}[W_{i}(\pi,\alpha)]\right)\Rightarrow Z_{1}\left(\alpha\right),

for α>0\alpha>0, and

n(W^n(π,0)𝐄[Wi(π,0)])𝒩(0,𝐄[1π0(π(X)X)(1𝐏[Y=0])2]).\sqrt{n}\left(\hat{W}_{n}(\pi,0)-\mathbf{E}[W_{i}(\pi,0)]\right)\Rightarrow\mathcal{N}\left(0,\mathbf{E}\left[\frac{1}{\pi_{0}\left(\pi(X)\mid X\right)}\left(1-\mathbf{P}[Y=0]\right)^{2}\right]\right).

Further, in the discrete case, we have

𝐄[Wi(π,α)]=𝐄[exp(Y(π(X))/α)]b¯ for α0.\mathbf{E}[W_{i}(\pi,\alpha)]=\mathbf{E}[\exp\left(-Y(\pi(X))/\alpha\right)]\geq\underline{b}\text{ for }\alpha\geq 0. (A.5)

Since 𝐕𝐚𝐫(Y)>0\mathbf{Var}(Y)>0, the proof of continuous case shows that ϕ(π,α)\phi(\pi,\alpha) is strictly concave for α>0\alpha>0. Then, ϕ(π,α)\phi(\pi,\alpha) has a unique maximizer in [0,α¯][0,\overline{\alpha}]. The remaining proof is the same as the continuous case. ∎

Appendix A.4 Proof of the statistical performance guarantee in Section 6

Proof of Lemma 6.

For any probability measure 𝐏1\mathbf{P}_{1} supported on 𝔻\mathbb{D}, we have

supα0{αlog𝐄𝐏1[exp(Y/α)]αδ}=supα0{αlog(d𝔻[exp(d/α)𝐏1(Y=d)])αδ}.\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}=\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{1}\left(Y=d\right)\right]\right)-\alpha\delta\right\}.

Therefore, we have

|supα0{αlog𝐄𝐏1[exp(Y/α)]αδ}supα0{αlog𝐄𝐏2[exp(Y/α)]αδ}|\displaystyle\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}\right|
\displaystyle\leq supα0|αlog(d𝔻[exp(d/α)𝐏1(Y=d)]d𝔻[exp(d/α)𝐏2(Y=d)])|\displaystyle\sup_{\alpha\geq 0}\left|\alpha\log\left(\frac{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{1}\left(Y=d\right)\right]}{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]}\right)\right|
=\displaystyle= supα0|αlog(d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]d𝔻[exp(d/α)𝐏2(Y=d)]+1)|.\displaystyle\sup_{\alpha\geq 0}\left|\alpha\log\left(\frac{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]}{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]}+1\right)\right|.

Since we can always divide the denominator and nominator of (Appendix A.4) by exp(mind𝔻d/α),\exp\left(-\min_{d\in\mathbb{D}}d/\alpha\right), we can assume mind𝔻d=0\min_{d\in\mathbb{D}}d=0 without loss of generality. Therefore, we have

d𝔻[exp(d/α)𝐏2(Y=d)]=𝐏2(Y=0)+d𝔻,d0[exp(d/α)𝐏2(Y=d)]b¯.\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]=\mathbf{P}_{2}\left(Y=0\right)+\sum_{d\in\mathbb{D},d\neq 0}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]\geq\underline{b}.

Then, if |d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]|<b¯/2,\left|\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right|<\underline{b}/2, we have

supα0|αlog(d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]d𝔻[exp(d/α)𝐏2(Y=d)]+1)|\displaystyle\sup_{\alpha\geq 0}\left|\alpha\log\left(\frac{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]}{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]}+1\right)\right| (A.7)
\displaystyle\leq 2supα0{α|d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]|d𝔻[exp(d/α)𝐏2(Y=d)]}\displaystyle 2\sup_{\alpha\geq 0}\left\{\alpha\frac{\left|\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right|}{\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\mathbf{P}_{2}\left(Y=d\right)\right]}\right\}
\displaystyle\leq 2b¯supα0{α|d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]|}.\displaystyle\frac{2}{\underline{b}}\sup_{\alpha\geq 0}\left\{\alpha\left|\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right|\right\}.

Then, we turn to (A.7),

|d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]|\displaystyle\left|\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right|
=\displaystyle= |(d:𝐏1(Y=d)𝐏2(Y=d)[ed/α(𝐏1(Y=d)𝐏2(Y=d))])(d:𝐏1(Y=d)<𝐏2(Y=d)[ed/α(𝐏2(Y=d)𝐏1(Y=d))])|,\displaystyle\left|\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ \geq\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left[e^{-d/\alpha}\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right)-\right.\left.\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ <\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left[e^{-d/\alpha}\left(\mathbf{P}_{2}\left(Y=d\right)-\mathbf{P}_{1}\left(Y=d\right)\right)\right]\right)\right|,

which can be bounded by

(d:𝐏1(Y=d)𝐏2(Y=d)eM/α(𝐏1(Y=d)𝐏2(Y=d)))(d:𝐏1(Y=d)<𝐏2(Y=d)(𝐏2(Y=d)𝐏1(Y=d)))\displaystyle\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ \geq\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}e^{-M/\alpha}\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right)-\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ <\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left(\mathbf{P}_{2}\left(Y=d\right)-\mathbf{P}_{1}\left(Y=d\right)\right)\right)
(d:𝐏1(Y=d)𝐏2(Y=d)[ed/α(𝐏1(Y=d)𝐏2(Y=d))])(d:𝐏1(Y=d)<𝐏2(Y=d)[ed/α(𝐏2(Y=d)𝐏1(Y=d))])\displaystyle\leq\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ \geq\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left[e^{-d/\alpha}\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right)-\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ <\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left[e^{-d/\alpha}\left(\mathbf{P}_{2}\left(Y=d\right)-\mathbf{P}_{1}\left(Y=d\right)\right)\right]\right)
(d:𝐏1(Y=d)𝐏2(Y=d)(𝐏1(Y=d)𝐏2(Y=d)))(d:𝐏1(Y=d)<𝐏2(Y=d)[eM/α(𝐏2(Y=d)𝐏1(Y=d))]).\displaystyle\leq\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ \geq\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right)-\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(Y=d\right)\\ <\mathbf{P}_{2}\left(Y=d\right)\end{subarray}}\left[e^{-M/\alpha}\left(\mathbf{P}_{2}\left(Y=d\right)-\mathbf{P}_{1}\left(Y=d\right)\right)\right]\right).

Further, we observe that

TV(𝐏1,𝐏2)\displaystyle\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}) =\displaystyle= d:𝐏1(Y=d)𝐏2(Y=d)(𝐏1(Y=d)𝐏2(Y=d))\displaystyle\sum_{d:\mathbf{P}_{1}\left(Y=d\right)\geq\mathbf{P}_{2}\left(Y=d\right)}\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)
=\displaystyle= d:𝐏1(Y=d)<𝐏2(Y=d)(𝐏2(Y=d)𝐏1(Y=d)),\displaystyle\sum_{d:\mathbf{P}_{1}\left(Y=d\right)<\mathbf{P}_{2}\left(Y=d\right)}\left(\mathbf{P}_{2}\left(Y=d\right)-\mathbf{P}_{1}\left(Y=d\right)\right),

where TV(𝐏1,𝐏2)\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}) denotes the total variation between 𝐏1\mathbf{P}_{1} and 𝐏2.\mathbf{P}_{2}. Therefore, we have

supα0{α|d𝔻[exp(d/α)(𝐏1(Y=d)𝐏2(Y=d))]|}\displaystyle\sup_{\alpha\geq 0}\left\{\alpha\left|\sum_{d\in\mathbb{D}}\left[\exp\left(-d/\alpha\right)\left(\mathbf{P}_{1}\left(Y=d\right)-\mathbf{P}_{2}\left(Y=d\right)\right)\right]\right|\right\}
\displaystyle\leq supα0{α(1exp(M/α))}TV(𝐏1,𝐏2).\displaystyle\sup_{\alpha\geq 0}\left\{\alpha(1-\exp\left(-M/\alpha\right))\right\}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}).

And it is easy to verify that

supα0{α(1exp(M/α))}M.\sup_{\alpha\geq 0}\left\{\alpha(1-\exp\left(-M/\alpha\right))\right\}\leq M.

By combining all the above together, we have when TV(𝐏1,𝐏2)<b¯/2,\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2})<\underline{b}/2,

|supα0{αlog𝐄𝐏1[exp(Y/α)]αδ}supα0{αlog𝐄𝐏2[exp(Y/α)]αδ}|\displaystyle\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\delta\right\}\right|
\displaystyle\leq 2Mb¯TV(𝐏1,𝐏2).\displaystyle\frac{2M}{\underline{b}}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}).

We first give the proof of Lemma 5.

Proof of Lemma 5.

Notice that

|supα0{αlog𝐄𝐏1{exp(Y/α)}αδ}supα0{αlog𝐄𝐏2{exp(Y/α)}αδ}|\displaystyle\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left\{\exp\left(-Y/\alpha\right)\right\}-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left\{\exp\left(-Y/\alpha\right)\right\}-\alpha\delta\right\}\right|
\displaystyle\leq supα0|αlog𝐄𝐏1[exp(Y/α)]αlog𝐄𝐏2[exp(Y/α)]|\displaystyle\sup_{\alpha\geq 0}\left|\alpha\log\mathbf{E}_{\mathbf{P}_{1}}\left[\exp\left(-Y/\alpha\right)\right]-\alpha\log\mathbf{E}_{\mathbf{P}_{2}}\left[\exp\left(-Y/\alpha\right)\right]\right|
=\displaystyle= supα0|αlog𝐄𝐏U[exp(q𝐏1(U)/α)]αlog𝐄𝐏U[exp(q𝐏2(U)/α)]|,\displaystyle\sup_{\alpha\geq 0}\left|\alpha\log\mathbf{E}_{\mathbf{P}_{U}}\left[\exp\left(-q_{\mathbf{P}_{1}}\left(U\right)/\alpha\right)\right]-\alpha\log\mathbf{E}_{\mathbf{P}_{U}}\left[\exp\left(-q_{\mathbf{P}_{2}}\left(U\right)/\alpha\right)\right]\right|,

where 𝐏UU([0,1])\mathbf{P}_{U}\sim U([0,1]) and the last equality is based on the fact that q𝐏(U)=𝑑𝐏.q_{\mathbf{P}}\left(U\right)\overset{d}{=}\mathbf{P.}

Denote T=supt[0,1]|q𝐏1(t)q𝐏2(t)|T=\sup_{t\in[0,1]}\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right| and we have

αlog𝐄𝐏U[exp(q𝐏1(U)/α)]αlog[𝐄𝐏Uexp(q𝐏2(U)/α)]\displaystyle\alpha\log\mathbf{E}_{\mathbf{P}_{U}}\left[\exp\left(-q_{\mathbf{P}_{1}}\left(U\right)/\alpha\right)\right]-\alpha\log\left[\mathbf{E}_{\mathbf{P}_{U}}\exp\left(-q_{\mathbf{P}_{2}}\left(U\right)/\alpha\right)\right]
=\displaystyle= αlog[01exp(q𝐏1(u)/α)du]αlog[01exp(q𝐏2(u)/α)du]\displaystyle\alpha\log\left[\int_{0}^{1}\exp\left(-q_{\mathbf{P}_{1}}\left(u\right)/\alpha\right){\rm d}u\right]-\alpha\log\left[\int_{0}^{1}\exp\left(-q_{\mathbf{P}_{2}}\left(u\right)/\alpha\right){\rm d}u\right]
\displaystyle\leq αlog[01exp(q𝐏2(u)/α)exp(T/α)du]αlog[01exp(q𝐏2(u)/α)du]\displaystyle\alpha\log\left[\int_{0}^{1}\exp\left(-q_{\mathbf{P}_{2}}\left(u\right)/\alpha\right)\exp\left(T/\alpha\right){\rm d}u\right]-\alpha\log\left[\int_{0}^{1}\exp\left(-q_{\mathbf{P}_{2}}\left(u\right)/\alpha\right){\rm d}u\right]
=\displaystyle= T.\displaystyle T.

Similarly, we have

αlog[𝐄𝐏Uexp(q𝐏2(U)/α)]αlog[𝐄𝐏Uexp(q𝐏1(U)/α)]T.\alpha\log\left[\mathbf{E}_{\mathbf{P}_{U}}\exp\left(-q_{\mathbf{P}_{2}}\left(U\right)/\alpha\right)\right]-\alpha\log\left[\mathbf{E}_{\mathbf{P}_{U}}\exp\left(-q_{\mathbf{P}_{1}}\left(U\right)/\alpha\right)\right]\leq T.

The desired result then follows.

Lemma A16.

Suppose probability measures 𝐏1\mathbf{P}_{1} and 𝐏2\mathbf{P}_{2} supported on [0,M][0,M] and 𝐏1\mathbf{P}_{1} has a positive density f𝐏1()f_{\mathbf{P}_{1}}\left(\cdot\right) with a lower bound f𝐏1b¯f_{\mathbf{P}_{1}}\geq\underline{b} over the interval [0,M].[0,M]. Then, we have

supt[0,1]|q𝐏1(t)q𝐏2(t)|1b¯supx[0,M]|F𝐏1(x)F𝐏2(x)|.\sup_{t\in[0,1]}\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right|\leq\frac{1}{\underline{b}}\sup_{x\in[0,M]}\left|F_{\mathbf{P}_{1}}\left(x\right)-F_{\mathbf{P}_{2}}\left(x\right)\right|.
Proof.

For ease of notation, let x1=q𝐏1(t)x_{1}=q_{\mathbf{P}_{1}}\left(t\right) and x2=q𝐏2(t).x_{2}=q_{\mathbf{P}_{2}}\left(t\right). Since the distribution is right-continuous with left limits and 𝐏1\mathbf{P}_{1} is continuous, we have

F𝐏2(x2)t,F𝐏2(x2)t, and F𝐏1(x1)=t.F_{\mathbf{P}_{2}}\left(x_{2}-\right)\leq t,F_{\mathbf{P}_{2}}\left(x_{2}\right)\geq t,\text{ and }F_{\mathbf{P}_{1}}\left(x_{1}\right)=t.

If x1x2,x_{1}\geq x_{2}, we have

x1x2\displaystyle x_{1}-x_{2} \displaystyle\leq 1b¯(F𝐏1(x1)F𝐏1(x2))\displaystyle\frac{1}{\underline{b}}\left(F_{\mathbf{P}_{1}}\left(x_{1}\right)-F_{\mathbf{P}_{1}}\left(x_{2}\right)\right)
\displaystyle\leq 1b¯((F𝐏1(x1)F𝐏1(x2))+(F𝐏2(x2)F𝐏1(x1)))\displaystyle\frac{1}{\underline{b}}\left(\left(F_{\mathbf{P}_{1}}\left(x_{1}\right)-F_{\mathbf{P}_{1}}\left(x_{2}\right)\right)+\left(F_{\mathbf{P}_{2}}\left(x_{2}\right)-F_{\mathbf{P}_{1}}\left(x_{1}\right)\right)\right)
=\displaystyle= 1b¯(F𝐏2(x2)F𝐏1(x2)).\displaystyle\frac{1}{\underline{b}}\left(F_{\mathbf{P}_{2}}\left(x_{2}\right)-F_{\mathbf{P}_{1}}\left(x_{2}\right)\right).

If x1<x2,x_{1}<x_{2}, we construct a monotone increasing sequence x(1),x(2),x^{(1)},x^{(2)},\ldots with x(n)x2.x^{(n)}\uparrow x_{2}. Since 𝐏1\mathbf{P}_{1} is continuous, we have F𝐏1(x(n))F𝐏1(x2).F_{\mathbf{P}_{1}}\left(x^{(n)}\right)\uparrow F_{\mathbf{P}_{1}}\left(x_{2}\right). Then, notice that

x2x1\displaystyle x_{2}-x_{1} \displaystyle\leq 1b¯(F𝐏1(x2)F𝐏1(x1))\displaystyle\frac{1}{\underline{b}}\left(F_{\mathbf{P}_{1}}\left(x_{2}\right)-F_{\mathbf{P}_{1}}\left(x_{1}\right)\right)
\displaystyle\leq 1b¯((F𝐏1(x2)F𝐏1(x1))+(F𝐏1(x1)F𝐏2(x(n))))\displaystyle\frac{1}{\underline{b}}\left(\left(F_{\mathbf{P}_{1}}\left(x_{2}\right)-F_{\mathbf{P}_{1}}\left(x_{1}\right)\right)+\left(F_{\mathbf{P}_{1}}\left(x_{1}\right)-F_{\mathbf{P}_{2}}\left(x^{(n)}\right)\right)\right)
=\displaystyle= 1b¯(F𝐏1(x2)F𝐏2(x(n)))\displaystyle\frac{1}{\underline{b}}\left(F_{\mathbf{P}_{1}}\left(x_{2}\right)-F_{\mathbf{P}_{2}}\left(x^{(n)}\right)\right)
=\displaystyle= limn1b¯(F𝐏1(x(n))F𝐏2(x(n))).\displaystyle\lim_{n\rightarrow\infty}\frac{1}{\underline{b}}\left(F_{\mathbf{P}_{1}}\left(x^{(n)}\right)-F_{\mathbf{P}_{2}}\left(x^{(n)}\right)\right).

Therefore, for every tt, we have

|q𝐏1(t)q𝐏2(t)|=|x1x2|1b¯supx[0,M]|F𝐏1(x)F𝐏2(x)|.\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right|=\left|x_{1}-x_{2}\right|\leq\frac{1}{\underline{b}}\sup_{x\in[0,M]}\left|F_{\mathbf{P}_{1}}\left(x\right)-F_{\mathbf{P}_{2}}\left(x\right)\right|.

The desired results then follows. ∎

By utilizing Lemmas 5, 6 and A16, we are ready to prove Theorem 2.

Proof of Theorem 2.

Recall RDRO(π^DRO)=QDRO(πDRO)QDRO(π^DRO).R_{\rm DRO}(\hat{\pi}_{\rm DRO})=Q_{\rm DRO}(\pi^{\ast}_{\rm DRO})-Q_{\rm DRO}(\hat{\pi}_{\rm DRO}). Then,

RDRO(π^DRO)\displaystyle R_{\rm DRO}(\hat{\pi}_{\rm DRO})
=\displaystyle= QDRO(πDRO)Q^DRO(π^DRO)+Q^DRO(π^DRO)QDRO(π^DRO)\displaystyle Q_{\rm DRO}(\pi^{\ast}_{\rm DRO})-\hat{Q}_{\rm DRO}(\hat{\pi}_{\rm DRO})+\hat{Q}_{\rm DRO}(\hat{\pi}_{\rm DRO})-Q_{\rm DRO}(\hat{\pi}_{\rm DRO})
\displaystyle\leq (QDRO(πDRO)Q^DRO(πDRO))+(Q^DRO(π^DRO)QDRO(π^DRO))\displaystyle\left(Q_{\rm DRO}(\pi^{\ast}_{\rm DRO})-\hat{Q}_{\rm DRO}(\pi^{\ast}_{\rm DRO})\right)+\left(\hat{Q}_{\rm DRO}(\hat{\pi}_{\rm DRO})-Q_{\rm DRO}(\hat{\pi}_{\rm DRO})\right)
\displaystyle\leq 2supπΠ|Q^DRO(π)QDRO(π)|.\displaystyle 2\sup_{\pi\in\Pi}\left|\hat{Q}_{\rm DRO}(\pi)-Q_{\rm DRO}(\pi)\right|.

Note that

Q^DRO(π)=supα0{αlog[𝐄𝐏^nπexp(Yi(π(Xi)))/α)]αδ}.\hat{Q}_{\rm DRO}(\pi)=\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{\hat{P}}_{n}^{\pi}}\exp\left(-Y_{i}(\pi(X_{i}))\right)/\alpha)\right]-\alpha\delta\right\}. (A.8)

Therefore, we have

RDRO(π^DRO)\displaystyle R_{\rm DRO}(\hat{\pi}_{\rm DRO}) (A.9)
\displaystyle\leq 2supπΠ|supα0{αlog[𝐄𝐏^nπexp(Yi(π(Xi))/α)]αδ}supα0{αlog[𝐄𝐏0exp(Y(π(X))/α)]αδ}|\displaystyle 2\sup_{\pi\in\Pi}\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{\hat{P}}_{n}^{\pi}}\exp\left(-Y_{i}(\pi(X_{i})\right)/\alpha)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{P}_{0}}\exp\left(-Y(\pi(X)\right)/\alpha)\right]-\alpha\delta\right\}\right|

We then divide the proof into two parts: continuous case and discrete case.

(1) Continuous case, i.e., Assumption 2.1 is satisfied: By Lemmas 5, A16 and Assumption 1, we have

supπΠ|supα0{αlog[𝐄𝐏^nπexp(Yi(π(Xi)))/α)]αδ}supα0{αlog[𝐄𝐏0exp(Y(π(X)))/α)]αδ}|\displaystyle\sup_{\pi\in\Pi}\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{\hat{P}}_{n}^{\pi}}\exp\left(-Y_{i}(\pi(X_{i}))\right)/\alpha)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{P}_{0}}\exp\left(-Y(\pi(X))\right)/\alpha)\right]-\alpha\delta\right\}\right| (A.11)
\displaystyle\leq supπΠsupx[0,M]1b¯|𝐄𝐏^nπ[𝟏{Yi(π(Xi))x}]𝐄𝐏0[𝟏{Yi(π(Xi))x}]|\displaystyle\sup_{\pi\in\Pi}\sup_{x\in[0,M]}\frac{1}{\underline{b}}\left|\mathbf{E}_{\mathbf{\hat{P}}_{n}^{\pi}}\left[\mathbf{1}\left\{Y_{i}(\pi(X_{i}))\leq x\right\}\right]-\mathbf{E}_{{}_{\mathbf{P}_{0}}}\left[\mathbf{1}\left\{Y_{i}(\pi(X_{i}))\leq x\right\}\right]\right|
=\displaystyle= supπΠ,x[0,M]1b¯|(1nSnπi=1n𝟏{π(Xi))=Ai}π0(Ai|Xi)𝟏{Yi(π(Xi))x})𝐄𝐏0[𝟏{Y(π(X))x}]|\displaystyle\sup_{\pi\in\Pi,x\in[0,M]}\frac{1}{\underline{b}}\left|\left(\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i}))=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{1}\left\{Y_{i}(\pi(X_{i}))\leq x\right\}\right)-\mathbf{E}_{{}_{\mathbf{P}_{0}}}\left[\mathbf{1}\left\{Y(\pi(X))\leq x\right\}\right]\right|
\displaystyle\leq supπΠ,x[0,M]1b¯|(1ni=1n𝟏{π(Xi))=Ai}π0(Ai|Xi)𝟏{Yi(π(Xi))x})𝐄𝐏0[𝟏{π(X)=Ai}π0(A|X)𝟏{Y(π(X))x}]|\displaystyle\sup_{\pi\in\Pi,x\in[0,M]}\frac{1}{\underline{b}}\left|\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i}))=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{1}\left\{Y_{i}(\pi(X_{i}))\leq x\right\}\right)-\mathbf{E}_{{}_{\mathbf{P}_{0}}}\left[\frac{\mathbf{1\{}\pi(X)=A_{i}\mathbf{\}}}{\pi_{0}\left(A|X\right)}\mathbf{1}\left\{Y(\pi(X))\leq x\right\}\right]\right|
+supπΠ,x[0,M]1b¯|(Snπ1nSnπi=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)𝟏{Y(π(Xi))x})|.\displaystyle+\sup_{\pi\in\Pi,x\in[0,M]}\frac{1}{\underline{b}}\left|\left(\frac{S_{n}^{\pi}-1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{1}\left\{Y(\pi(X_{i}))\leq x\right\}\right)\right|.

For (A.11), by Wainwright [82, Theorem 4.10], we have with probability 1exp(nϵ2η2/2)1-\exp\left(-n\epsilon^{2}\eta^{2}/2\right)

supπΠ,x[0,M]1b¯|(1ni=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)𝟏{Yi(π(Xi)x})𝐄𝐏0[𝟏{π(X)=A}π0(A|X)𝟏{Y(π(X)x}]|\displaystyle\sup_{\pi\in\Pi,x\in[0,M]}\frac{1}{\underline{b}}\left|\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{1}\left\{Y_{i}(\pi(X_{i})\leq x\right\}\right)-\mathbf{E}_{{}_{\mathbf{P}_{0}}}\left[\frac{\mathbf{1\{}\pi(X)=A\mathbf{\}}}{\pi_{0}\left(A|X\right)}\mathbf{1}\left\{Y(\pi(X)\leq x\right\}\right]\right| (A.12)
\displaystyle\leq 1b¯(2n(Π,x)+ϵ),\displaystyle\frac{1}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,x}\right)+\epsilon\right),

where the function class Π,x\mathcal{F}_{\Pi,x} is defined as

Π,x{fπ,x(X,Y,A)=𝟏{π(X)=A}𝟏{Y(π(X))x}π0(A|X)|πΠ,x[0,M]}.\mathcal{F}_{\Pi,x}\triangleq\left\{f_{\pi,x}(X,Y,A)=\left.\frac{\mathbf{1\{}\pi(X)=A\mathbf{\}1\{}Y(\pi(X))\leq x\}}{\pi_{0}(A|X)}\right|\pi\in\Pi,x\in[0,M]\right\}.

For (A.11), we have

1b¯|(Snπ1nSnπi=1n𝟏{π(Xi)=Ai}π0(Ai|Xi)𝟏{Y(π(Xi)x})|1b¯|Snπ1|,\frac{1}{\underline{b}}\left|\left(\frac{S_{n}^{\pi}-1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\mathbf{1}\left\{Y(\pi(X_{i})\leq x\right\}\right)\right|\leq\frac{1}{\underline{b}}\left|S_{n}^{\pi}-1\right|, (A.13)

Further, by Wainwright [82, Theorem 4.10] and the fact 𝟏π0(A|X)1η\frac{\mathbf{1}}{\pi_{0}(A|X)}\leq\frac{1}{\eta} and 𝐄[𝟏{π(X)=A}π0(A|X)]=1\mathbf{E}\left[\frac{\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A|X)}\right]=1, we have with probability at least 1exp(nϵ2η2/2),1-\exp\left(-{n\epsilon^{2}\eta^{2}}/{2}\right),

supπΠ|Snπ1|2n(Π)+ϵ,\sup_{\pi\in\Pi}\left|S_{n}^{\pi}-1\right|\leq 2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi}\right)+\epsilon, (A.14)

where Π{fπ(X,A)=𝟏{π(X)=A}π0(A|X)|πΠ}.\mathcal{F}_{\Pi}\triangleq\left\{\left.f_{\pi}(X,A)=\frac{\mathbf{1}\{\pi(X)=A\}}{\pi_{0}(A|X)}\right|\pi\in\Pi\right\}.

By combining (A.12), (A.13), and (A.14), we have with probability at least 12exp(nϵ2η2/2)1-2\exp\left(-n\epsilon^{2}\eta^{2}/2\right),

RDRO(π^DRO)2b¯(2n(Π,x)+ϵ)+2b¯(2n(Π)+ϵ).R_{\rm DRO}(\hat{\pi}_{\rm DRO})\leq\frac{2}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,x}\right)+\epsilon\right)+\frac{2}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi}\right)+\epsilon\right).

Now, we turn to the Rademacher complexity of the classes Π,x\mathcal{F}_{\Pi,x} and Π\mathcal{F}_{\Pi}. First, for the class Π\mathcal{F}_{\Pi}, notice that

i=1n1n(𝟏{π1(Xi)=Ai}π0(Ai|Xi)𝟏{π2(Xi)=Ai}π0(Ai|Xi))2\displaystyle\sqrt{\sum_{i=1}^{n}\frac{1}{n}\left(\frac{\mathbf{1}\{\pi_{1}(X_{i})=A_{i}\}}{\pi_{0}(A_{i}|X_{i})}-\frac{\mathbf{1}\{\pi_{2}(X_{i})=A_{i}\}}{\pi_{0}(A_{i}|X_{i})}\right)^{2}}
=\displaystyle= 1ni=1n(𝟏{π1(Xi)π2(Xi)}π0(Ai|Xi))2\displaystyle\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(\frac{\mathbf{1\{}\pi_{1}(X_{i})\neq\pi_{2}(X_{i})\mathbf{\}}}{\pi_{0}(A_{i}|X_{i})}\right)^{2}}
\displaystyle\leq 1ηH(π1,π2).\displaystyle\frac{1}{\eta}\sqrt{H\left(\pi_{1},\pi_{2}\right)}.

Therefore, the covering number

N(t,Π,𝐏n)N(t,Π,H(,)/η)=NH(n)(η2t2,Π,{x1,,xn})NH(n)(η2t2,Π).N\left(t,\mathcal{F}_{\Pi},\left\|\cdot\right\|_{\mathbf{P}_{n}}\right)\leq N\left(t,\Pi,\sqrt{H\left(\cdot,\cdot\right)}/\eta\right)=N_{H}^{\left(n\right)}\left(\eta^{2}t^{2},\Pi,\{x_{1},\ldots,x_{n}\}\right)\leq N_{H}^{\left(n\right)}\left(\eta^{2}t^{2},\Pi\right).

For the class Π,x\mathcal{F}_{\Pi,x}, we claim

N(t,Π,x,𝐏n)NH(n)(η2t2/2,Π)sup𝐏N(ηt/2,I,𝐏),N\left(t,\mathcal{F}_{\Pi,x},\left\|\cdot\right\|_{\mathbf{P}_{n}}\right)\leq N_{H}^{\left(n\right)}\left(\eta^{2}t^{2}/2,\Pi\right)\sup_{\mathbf{P}}N\left(\eta t/\sqrt{2},\mathcal{F}_{I},\left\|\cdot\right\|_{\mathbf{P}}\right),

where I={f(t)𝟏{tx}|x[0,M]}.\mathcal{F}_{I}=\{f(t)\triangleq\mathbf{1}\left\{t\leq x\right\}|x\in[0,M]\}. For ease of notation, let

NΠ(t)=NH(n)(η2t2/2,Π), and NI(t)=sup𝐏N(ηt/2,I,𝐏).N_{\Pi}(t)=N_{H}^{(n)}\left(\eta^{2}t^{2}/2,\Pi\right),\text{ and }N_{I}(t)=\sup_{\mathbf{P}}N\left(\eta t/\sqrt{2},\mathcal{F}_{I},\left\|\cdot\right\|_{\mathbf{P}}\right).

Suppose {π1,π2,,πNΠ(t)}\{\pi_{1},\pi_{2},\ldots,\pi_{N_{\Pi}(t)}\} is a cover for Π\Pi and {𝟏π{tx1},𝟏π{tx2},,𝟏π{txNI(t)}}\left\{\mathbf{1}^{\pi}\left\{t\leq x_{1}\right\},\mathbf{1}^{\pi}\left\{t\leq x_{2}\right\},\ldots,\mathbf{1}^{\pi}\left\{t\leq x_{N_{I}(t)}\right\}\right\} is a cover for I\mathcal{F}_{I} under the distance 𝐏^π,\left\|\cdot\right\|_{\hat{\mathbf{P}}_{\pi}}, defined by

𝐏^π1ni=1nΔ{Yi(π(Xi))}.\hat{\mathbf{P}}_{\pi}\triangleq\frac{1}{n}\sum_{i=1}^{n}\Delta\left\{Y_{i}(\pi(X_{i}))\right\}.

Then, we claim that Π,xt\mathcal{F}_{\Pi,x}^{t} is a tt-cover set for Π,x,\mathcal{F}_{\Pi,x}, where Π,xt\mathcal{F}_{\Pi,x}^{t} is defined as

Π,xt{𝟏{πi(X)=A}𝟏πi{Y(πi(X))xj}π0(A|X)|iNΠ(t),jNI(t)}.\mathcal{F}_{\Pi,x}^{t}\triangleq\left\{\left.\frac{\mathbf{1\{}\pi_{i}(X)=A\}\mathbf{1}^{\pi_{i}}\{Y(\pi_{i}(X))\leq x_{j}\}}{\pi_{0}(A|X)}\right|i\leq N_{\Pi}(t),j\leq N_{I}(t)\right\}.

For fπ,x(X,Y,A)Π,x,f_{\pi,x}(X,Y,A)\in\mathcal{F}_{\Pi,x}, we can pick π~,x~\tilde{\pi},\tilde{x} such that fπ~,x~(X,Y,A)Π,xtf_{\tilde{\pi},\tilde{x}}(X,Y,A)\in\mathcal{F}_{\Pi,x}^{t}, H(π,π~)t2/2H\left(\pi,\tilde{\pi}\right)\leq t^{2}/2 and 𝟏{Yx}𝟏{Yx~}P^π~t/2.\left\|\mathbf{1}\left\{Y\leq x\right\}-\mathbf{1}\left\{Y\leq\tilde{x}\right\}\right\|_{\hat{P}_{\tilde{\pi}}}\leq t/\sqrt{2}. Then, we have

i=1n1n(𝟏{π(Xi)=Ai}𝟏{Y(π(Xi))x1}π0(Ai|Xi)𝟏{π~(Xi)=Ai}𝟏{Y(π~(Xi))x~}π0(Ai|Xi))2\displaystyle\sqrt{\sum_{i=1}^{n}\frac{1}{n}\left(\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y(\pi(X_{i}))\leq x_{1}\}}{\pi_{0}(A_{i}|X_{i})}-\frac{\mathbf{1\{}\tilde{\pi}(X_{i})=A_{i}\mathbf{\}1\{}Y(\tilde{\pi}(X_{i}))\leq\tilde{x}\}}{\pi_{0}(A_{i}|X_{i})}\right)^{2}}
\displaystyle\leq 1ηH(π,π~)+1ni=1n𝟏{π(Xi)=π~(Xi)}(𝟏{Y(π~(Xi))x}𝟏{Y(π~(Xi))x~})2\displaystyle\frac{1}{\eta}\sqrt{H\left(\pi,\tilde{\pi}\right)+\frac{1}{n}\sum_{i=1}^{n}\mathbf{1\{}\pi(X_{i})=\tilde{\pi}(X_{i})\mathbf{\}}\left(\mathbf{1\{}Y(\tilde{\pi}(X_{i}))\leq x\}-\mathbf{1\{}Y(\tilde{\pi}(X_{i}))\leq\tilde{x}\}\right)^{2}}
\displaystyle\leq 1ηη2t2/2+η2t2/2=t.\displaystyle\frac{1}{\eta}\sqrt{\eta^{2}t^{2}/2+\eta^{2}t^{2}/2}=t.

From Lemma 19.15 and Example 19.16 in [79], we know

sup𝐏N(t,I,𝐏)K(1t)2,\sup_{\mathbf{P}}N\left(t,\mathcal{F}_{I},\left\|\cdot\right\|_{\mathbf{P}}\right)\leq K\left(\frac{1}{t}\right)^{2},

where KK is a universal constant. Finally, by Dudley’s theorem [e.g. 82, (5.48)], we have

n(Π)24𝐄[02/ηlogN(t,Π,𝐏n)ndt]24κ(n)(Π)ηn,\mathcal{R}_{n}\left(\mathcal{F}_{\Pi}\right)\leq 24\mathbf{E}\left[\int_{0}^{2/\eta}\sqrt{\frac{\log N\left(t,\mathcal{F}_{\Pi},\left\|\cdot\right\|_{\mathbf{P}_{n}}\right)}{n}}{\rm d}t\right]\leq\frac{24\kappa^{(n)}\left(\Pi\right)}{\eta\sqrt{n}},

and

n(Π,x)\displaystyle\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,x}\right) \displaystyle\leq 24n02/ηlog(NH(n)(η2t2/2,Π))+log(sup𝐏N(ηt/2,I,𝐏))dt\displaystyle\frac{24}{\sqrt{n}}\int_{0}^{2/\eta}\sqrt{\log\left(N_{H}^{\left(n\right)}\left(\eta^{2}t^{2}/2,\Pi\right)\right)+\log\left(\sup_{\mathbf{P}}N\left(\eta t/\sqrt{2},\mathcal{F}_{I},\left\|\cdot\right\|_{\mathbf{P}}\right)\right)}{\rm d}t
=\displaystyle= 24ηn02log(NH(n)(s2/2,Π))+log(sup𝐏N(s/2,I,𝐏))ds\displaystyle\frac{24}{\eta\sqrt{n}}\int_{0}^{2}\sqrt{\log\left(N_{H}^{\left(n\right)}\left(s^{2}/2,\Pi\right)\right)+\log\left(\sup_{\mathbf{P}}N\left(s/\sqrt{2},\mathcal{F}_{I},\left\|\cdot\right\|_{\mathbf{P}}\right)\right)}{\rm d}s
\displaystyle\leq 24ηn02(log(NH(n)(s2/2,Π))+logK+4log(1/s))ds\displaystyle\frac{24}{\eta\sqrt{n}}\int_{0}^{\sqrt{2}}\left(\sqrt{\log\left(N_{H}^{\left(n\right)}\left(s^{2}/2,\Pi\right)\right)}+\sqrt{\log K}+\sqrt{4\log\left(1/s\right)}\right){\rm d}s
\displaystyle\leq 242κ(n)(Π)ηn+C/n,\displaystyle\frac{24\sqrt{2}\kappa^{(n)}\left(\Pi\right)}{\eta\sqrt{n}}+C/\sqrt{n},

where CC is a universal constant. Therefore, by picking ε=2exp(nϵ2η2/2),\varepsilon^{\prime}=2\exp\left(-n\epsilon^{2}\eta^{2}/2\right), we have with probability 1ε1-\varepsilon^{\prime},

RDRO(π^DRO)\displaystyle R_{\rm DRO}(\hat{\pi}_{\rm DRO}) \displaystyle\leq 4b¯ηn(24(2+1)κ(n)(Π)+2log(2ε)+C).\displaystyle\frac{4}{\underline{b}\eta\sqrt{n}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+\sqrt{2\log\left(\frac{2}{\varepsilon^{\prime}}\right)}+C\right). (A.15)

(2) Discrete case, i.e., Assumption 2.2 is satisfied: By Lemmas 6 and Assumption 1, when supπΠTV(𝐏^nπ,𝐏0)b¯/2,\sup_{\pi\in\Pi}\mathrm{TV}\left(\mathbf{\hat{P}}_{n}^{\pi},\mathbf{P}_{0}\right)\leq\underline{b}/2, we have

supπΠ|supα0{αlog[𝐄𝐏^nπexp(Yi(π(Xi)))/α)]αδ}supα0{αlog[𝐄𝐏0exp(Y(π(X)))/α)]αδ}|\displaystyle\sup_{\pi\in\Pi}\left|\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{\hat{P}}_{n}^{\pi}}\exp\left(-Y_{i}(\pi(X_{i}))\right)/\alpha)\right]-\alpha\delta\right\}-\sup_{\alpha\geq 0}\left\{-\alpha\log\left[\mathbf{E}_{\mathbf{P}_{0}}\exp\left(-Y(\pi(X))\right)/\alpha)\right]-\alpha\delta\right\}\right|
\displaystyle\leq supπΠ2Mb¯TV(𝐏^nπ,𝐏0)\displaystyle\sup_{\pi\in\Pi}\frac{2M}{\underline{b}}\mathrm{TV}\left(\mathbf{\hat{P}}_{n}^{\pi},\mathbf{P}_{0}\right)
=\displaystyle= supπΠsupS𝒮𝔻2Mb¯|𝐏^nπ[S]𝐏0[S]|,\displaystyle\sup_{\pi\in\Pi}\sup_{S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left|\mathbf{\hat{P}}_{n}^{\pi}\left[S\right]-\mathbf{P}_{0}\left[S\right]\right|,

where 𝒮𝔻={S:S𝔻}\mathcal{S}_{\mathbb{D}}=\left\{S:S\subset\mathbb{D}\right\} contains all subsets of 𝔻.\mathbb{D}. Then

supπΠsupS𝒮𝔻2Mb¯|𝐏^nπ[S]𝐏0[S]|\displaystyle\sup_{\pi\in\Pi}\sup_{S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left|\mathbf{\hat{P}}_{n}^{\pi}\left[S\right]-\mathbf{P}_{0}\left[S\right]\right| (A.17)
=\displaystyle= supπΠ,S𝒮𝔻2Mb¯(1nSnπi=1n𝟏{π(Xi)=Ai}𝟏{YiS}π0(Ai|Xi)𝐄𝐏0[𝟏{YiS}])\displaystyle\sup_{\pi\in\Pi,S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left(\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}-\mathbf{E}_{\mathbf{P}_{0}}\left[\mathbf{1\{}Y_{i}\in S\mathbf{\}}\right]\right)
\displaystyle\leq supπΠ,S𝒮𝔻2Mb¯(1ni=1n𝟏{π(Xi)=Ai}𝟏{YiS}π0(Ai|Xi)𝐄𝐏0[𝟏{YiS}])\displaystyle\sup_{\pi\in\Pi,S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}-\mathbf{E}_{\mathbf{P}_{0}}\left[\mathbf{1\{}Y_{i}\in S\mathbf{\}}\right]\right)
+supπΠ,S𝒮𝔻2Mb¯(Snπ1nSnπi=1n𝟏{π(Xi)=Ai}𝟏{YiS}π0(Ai|Xi))\displaystyle+\sup_{\pi\in\Pi,S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left(\frac{S_{n}^{\pi}-1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right)

For (A.17), by Wainwright [82, Theorem 4.10], we have with probability 1exp(nϵ2η2/2)1-\exp\left(-n\epsilon^{2}\eta^{2}/2\right)

supπΠ,S𝒮𝔻2Mb¯(1ni=1n𝟏{π(Xi)=Ai}𝟏{YiS}π0(Ai|Xi)𝐄𝐏0[𝟏{YiS}])\displaystyle\sup_{\pi\in\Pi,S\in\mathcal{S}_{\mathbb{D}}}\frac{2M}{\underline{b}}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}-\mathbf{E}_{\mathbf{P}_{0}}\left[\mathbf{1\{}Y_{i}\in S\mathbf{\}}\right]\right) (A.18)
\displaystyle\leq 2Mb¯(2n(Π,𝔻)+ϵ),\displaystyle\frac{2M}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,\mathbb{D}}\right)+\epsilon\right),

where the function class Π,𝔻\mathcal{F}_{\Pi,\mathbb{D}} is defined as

Π,𝔻{fπ,S(X,Y,A)=𝟏{π(Xi)=Ai}𝟏{YiS}π0(A|X)|πΠ,S𝔻}.\mathcal{F}_{\Pi,\mathbb{D}}\triangleq\left\{f_{\pi,S}(X,Y,A)=\left.\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}(A|X)}\right|\pi\in\Pi,S\subset\mathbb{D}\right\}.

For (A.17), we have

2Mb¯(Snπ1nSnπi=1n𝟏{π(Xi)=Ai}𝟏{YiS}π0(Ai|Xi))2Mb¯|Snπ1|.\frac{2M}{\underline{b}}\left(\frac{S_{n}^{\pi}-1}{nS_{n}^{\pi}}\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}1\{}Y_{i}\in S\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}\right)\leq\frac{2M}{\underline{b}}\left|S_{n}^{\pi}-1\right|. (A.19)

By combining (A.18), (A.19) and (A.14), we have with probability at least 12exp(nϵ2η2/2)1-2\exp\left(-n\epsilon^{2}\eta^{2}/2\right),

RDRO(π^DRO)2Mb¯(2n(Π,𝔻)+ϵ)+2Mb¯(2n(Π)+ϵ).R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}})\leq\frac{2M}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,\mathbb{D}}\right)+\epsilon\right)+\frac{2M}{\underline{b}}\left(2\mathcal{R}_{n}\left(\mathcal{F}_{\Pi}\right)+\epsilon\right).

Now, we turn to the Rademacher complexity of the classes Π,𝔻.\mathcal{F}_{\Pi,\mathbb{D}}. We claim

N(t,Π,𝔻,𝐏n)NH(n)(η2t2/2,Π)2|𝔻|,N\left(t,\mathcal{F}_{\Pi,\mathbb{D}},\left\|\cdot\right\|_{\mathbf{P}_{n}}\right)\leq N_{H}^{\left(n\right)}\left(\eta^{2}t^{2}/2,\Pi\right)2^{|\mathbb{D}|},

where |𝔻||\mathbb{D}| denotes the cardinality of set 𝔻\mathbb{D}. By similar argument with the continuous case, we have Π,𝔻t\mathcal{F}_{\Pi,\mathbb{D}}^{t} is a tt-cover set for Π,𝔻,\mathcal{F}_{\Pi,\mathbb{D}}, where Π,𝔻t\mathcal{F}_{\Pi,\mathbb{D}}^{t} is defined as

Π,𝔻t{𝟏{πi(X)=A}𝟏{Yi(πi(Xi))S}π0(A|X)|iNΠ(t),S𝔻}.\mathcal{F}_{\Pi,\mathbb{D}}^{t}\triangleq\left\{\left.\frac{\mathbf{1\{}\pi_{i}(X)=A\}\mathbf{1\{}Y_{i}(\pi_{i}(X_{i}))\in S\mathbf{\}}}{\pi_{0}(A|X)}\right|i\leq N_{\Pi}(t),S\subset\mathbb{D}\right\}.

Finally, by Dudley’s theorem [e.g. 82, (5.48)]

n(Π,𝔻)\displaystyle\mathcal{R}_{n}\left(\mathcal{F}_{\Pi,\mathbb{D}}\right) \displaystyle\leq 24n02/ηlog(NH(n)(η2t2/2,Π))+|𝔻|log(2)dt\displaystyle\frac{24}{\sqrt{n}}\int_{0}^{2/\eta}\sqrt{\log\left(N_{H}^{\left(n\right)}\left(\eta^{2}t^{2}/2,\Pi\right)\right)+|\mathbb{D}|\log\left(2\right)}\mathrm{d}t
=\displaystyle= 24ηn02log(NH(n)(s2/2,Π))+|𝔻|log(2)ds\displaystyle\frac{24}{\eta\sqrt{n}}\int_{0}^{2}\sqrt{\log\left(N_{H}^{\left(n\right)}\left(s^{2}/2,\Pi\right)\right)+|\mathbb{D}|\log\left(2\right)}\mathrm{d}s
\displaystyle\leq 24ηn(02(log(NH(n)(s2/2,Π)))ds+2|𝔻|log(2))\displaystyle\frac{24}{\eta\sqrt{n}}\left(\int_{0}^{\sqrt{2}}\left(\sqrt{\log\left(N_{H}^{\left(n\right)}\left(s^{2}/2,\Pi\right)\right)}\right)\mathrm{d}s+2\sqrt{|\mathbb{D}|\log\left(2\right)}\right)
\displaystyle\leq 242κ(n)(Π)+48|𝔻|log(2)ηn\displaystyle\frac{24\sqrt{2}\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}}{\eta\sqrt{n}}

Therefore, by picking ε=2exp(nϵ2η2/2),\varepsilon^{\prime}=2\exp\left(-n\epsilon^{2}\eta^{2}/2\right), we have with probability 1ε1-\varepsilon^{\prime},

RDRO(π^DRO)4Mb¯ηn(24(2+1)κ(n)(Π)+48|𝔻|log(2)+2log(2ε)).R_{\mathrm{DRO}}(\hat{\pi}_{\mathrm{DRO}})\leq\frac{4M}{\underline{b}\eta\sqrt{n}}\left(24(\sqrt{2}+1)\kappa^{(n)}\left(\Pi\right)+48\sqrt{|\mathbb{D}|\log\left(2\right)}+\sqrt{2\log\left(\frac{2}{\varepsilon^{\prime}}\right)}\right).

Appendix A.5 Proof of the statistical lower bound in Section 4.2

We first define some useful notions. For p,q[0,1],p,q\in[0,1], Let

DKL(p||q)=plog(p/q)+(1p)log((1p)/(1q)).D_{\mathrm{KL}}(p||q)=p\log(p/q)+(1-p)\log((1-p)/(1-q)).

Let g(p)=infDKL(p||q)δq.g(p)=\inf_{D_{\mathrm{KL}}(p||q)\leq\delta}q.

Lemma A17.

For δ0.226,\delta\leq 0.226, g(p)g(p) is differentiable and g(p)1/2g^{\prime}(p)\geq 1/2 for p[0.4,0.6].p\in[0.4,0.6].

Proof.

Since δ0.226\delta\leq 0.226 and p0.4,p\geq 0.4, we have pg(p)0.1.p\geq g(p)\geq 0.1. By Yang et al. [84, Lemma B.12], we have

g(p)=log(p/g(p))log((1p)/(1g(p)))p/g(p)(1p)/(1g(p)),g^{\prime}(p)=\frac{\log(p/g(p))-\log((1-p)/(1-g(p)))}{p/g(p)-(1-p)/(1-g(p))},

and g(p)g^{\prime}(p) is increasing. Therefore, g(p)g(0.4)0.5.g^{\prime}(p)\geq g^{\prime}(0.4)\geq 0.5.

Proof of Theorem 3.

Since we consider two-action scenario, we denote the actions to be 0 and 1.1. We first follow the lines in the proof in Kitagawa and Tetenov [47, Theorem 2.2]. By Lemma 3, the VC dimension of the policy class Π\Pi is larger or equal than v4/25κ(n)(Π)2v\triangleq\lceil 4/25\kappa^{(n)}(\Pi)^{2}\rceil. Let x1,x2,,xvx_{1},x_{2},\ldots,x_{v} be vv points that are shattered by policy Π.\Pi. Let 𝐛={b1,b2,,bv}{0,1}v.\mathbf{b}=\left\{b_{1},b_{2},\ldots,b_{v}\right\}\in\left\{0,1\right\}^{v}. By definition, for each 𝐛{0,1}v,\mathbf{b}\in\left\{0,1\right\}^{v}, there exist a πΠ\pi\in\Pi such that π(xi)=bi\pi(x_{i})=b_{i} for i=1,2,,v.i=1,2,\ldots,v. We use π={π1,π2,,πv}{0,1}v\pi=\left\{\pi_{1},\pi_{2},\ldots,\pi_{v}\right\}\in\left\{0,1\right\}^{v} to denote the policy π\pi restricted in {x1,x2,,xv}.\left\{x_{1},x_{2},\ldots,x_{v}\right\}. Now, we consider you two distributions

Y0={M0with prob. 1pγwith prob. p+γ, Y1={M0with prob. 1p+γwith prob. pγ,Y_{0}=\left\{\begin{array}[]{c}M\\ 0\end{array}\right.\begin{array}[]{l}\text{with prob. }1-p-\gamma\\ \text{with prob. }p+\gamma\end{array},\text{ }Y_{1}=\left\{\begin{array}[]{c}M\\ 0\end{array}\right.\begin{array}[]{l}\text{with prob. }1-p+\gamma\\ \text{with prob. }p-\gamma\end{array},

for p,γ>0,p,\gamma>0, which will be determined later. Then, we construct 𝐏𝐛π𝐛,0𝒫(M)\mathbf{P}_{\mathbf{b}}*\pi_{\mathbf{b},0}\in\mathcal{P}(M) for every 𝐛{0,1}v.\mathbf{b}\in\left\{0,1\right\}^{v}. The marginal distribution of XX in 𝐏𝐛\mathbf{P}_{\mathbf{b}} supports uniformly on {x1,x2,,xv},\left\{x_{1},x_{2},\ldots,x_{v}\right\}, with equal mass 1/v.1/v. Further,

π𝐛,0(A=bi|X=xi)=π𝐛,0(A=1bi|X=xi)=12,\pi_{\mathbf{b},0}(A=b_{i}|X=x_{i})=\pi_{\mathbf{b},0}(A=1-b_{i}|X=x_{i})=\frac{1}{2},

and conditional on xi,x_{i}, Y(bi)Y(b_{i}) follows the distribution of Y1Y_{1} and Y(1bi)Y(1-b_{i}) follows the distribution of Y0.Y_{0}. We also define dH(𝐛,𝐛)=i=1v|bibi|.d_{H}(\mathbf{b,b}^{\prime})=\sum_{i=1}^{v}\left|b_{i}-b_{i}^{\prime}\right|.

To emphasize the dependence on the underlying distrbution 𝐏𝐛,\mathbf{P}_{\mathbf{b}}, we rewrite QDRO(π)=QDRO(π,𝐏𝐛).Q_{\mathrm{DRO}}(\pi)=Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}). Now, by Lemma 1, we have

supπΠQDRO(π,𝐏𝐛)\displaystyle\sup_{\pi\in\Pi}Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}) =\displaystyle= supπΠsupα0{αlog(𝐄𝐏𝐛[exp(Y(π(X))/α)])αδ}\displaystyle\sup_{\pi\in\Pi}\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\mathbf{E}_{\mathbf{P}_{\mathbf{b}}}\left[\exp(-Y(\pi(X))/\alpha)\right]\right)-\alpha\delta\right\}
=\displaystyle= supα0supπΠ{αlog(𝐄𝐏𝐛[exp(Y(π(X))/α)])αδ}.\displaystyle\sup_{\alpha\geq 0}\sup_{\pi\in\Pi}\left\{-\alpha\log\left(\mathbf{E}_{\mathbf{P}_{\mathbf{b}}}\left[\exp(-Y(\pi(X))/\alpha)\right]\right)-\alpha\delta\right\}.

Since for every α0,\alpha\geq 0, we have 𝐄[exp(Y1/α)]𝐄[exp(Y0/α)],\mathbf{E}\left[\exp(-Y_{1}/\alpha)\right]\leq\mathbf{E}\left[\exp(-Y_{0}/\alpha)\right], we have the optimal policy for the distribution 𝐏𝐛\mathbf{P}_{\mathbf{b}} is π𝐛=𝐛\pi_{\mathbf{b}}^{*}=\mathbf{b}. Further,

supπΠQDRO(π,𝐏𝐛)\displaystyle\sup_{\pi\in\Pi}Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}) =\displaystyle= QDRO(π𝐛,𝐏𝐛)=supα0{αlog(𝐄[exp(Y1/α)])αδ}\displaystyle Q_{\mathrm{DRO}}(\pi_{\mathbf{b}}^{*},\mathbf{P}_{\mathbf{b}})=\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\mathbf{E}\left[\exp(-Y_{1}/\alpha)\right]\right)-\alpha\delta\right\}
=\displaystyle= infD(𝐏||Y1)δ𝐄Y𝐏[Y]\displaystyle\inf_{D(\mathbf{P}||Y_{1})\leq\delta}\mathbf{E}_{Y\sim\mathbf{P}}\left[Y\right]
=\displaystyle= Mg(1p+γ).\displaystyle Mg(1-p+\gamma).

Then, for any π\pi\in Π\Pi for any 𝐏𝐛,\mathbf{P}_{\mathbf{b}}, we have

QDRO(π,𝐏𝐛)\displaystyle Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}) =\displaystyle= supα0{αlog(1vi=1vexp(Y(πi))/α))αδ}\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\frac{1}{v}\sum_{i=1}^{v}\exp(-Y(\pi_{i}))/\alpha)\right)-\alpha\delta\right\}
=\displaystyle= supα0{αlog(1v(i=1,πi=biv𝐄[exp(Y1/α)]+i=1,πibiv𝐄[exp(Y0/α)]))αδ}\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\frac{1}{v}\left(\sum_{i=1,\pi_{i}=b_{i}}^{v}\mathbf{E}\left[\exp(-Y_{1}/\alpha)\right]+\sum_{i=1,\pi_{i}\neq b_{i}}^{v}\mathbf{E}\left[\exp(-Y_{0}/\alpha)\right]\right)\right)-\alpha\delta\right\}
=\displaystyle= supα0{αlog(vdH(𝐛,π)v𝐄[exp(Y1/α)]+dH(𝐛,π)v𝐄[exp(Y0/α)])αδ}.\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\frac{v-d_{H}(\mathbf{b},\pi)}{v}\mathbf{E}\left[\exp(-Y_{1}/\alpha)\right]+\frac{d_{H}(\mathbf{b},\pi)}{v}\mathbf{E}\left[\exp(-Y_{0}/\alpha)\right]\right)-\alpha\delta\right\}.

To simplify the notation, let m=dH(𝐛,π).m=d_{H}(\mathbf{b},\pi). Then, we have

QDRO(π,𝐏𝐛)\displaystyle Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}})
=\displaystyle= supα0{αlog(vmv𝐄[exp(Y1/α)]+mv𝐄[exp(Y0/α)])αδ}\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\frac{v-m}{v}\mathbf{E}\left[\exp(-Y_{1}/\alpha)\right]+\frac{m}{v}\mathbf{E}\left[\exp(-Y_{0}/\alpha)\right]\right)-\alpha\delta\right\}
=\displaystyle= supα0{αlog((vmv(1p+γ)+mv(1pγ))𝐄[exp(M/α)]+(mv(p+γ)+vmv(pγ)))αδ}\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\left(\frac{v-m}{v}\left(1-p+\gamma\right)+\frac{m}{v}\left(1-p-\gamma\right)\right)\mathbf{E}\left[\exp(-M/\alpha)\right]+\left(\frac{m}{v}\left(p+\gamma\right)+\frac{v-m}{v}\left(p-\gamma\right)\right)\right)-\alpha\delta\right\}
=\displaystyle= supα0{αlog((1p+v2mvγ)𝐄[exp(M/α)]+(pv2mvγ))αδ}\displaystyle\sup_{\alpha\geq 0}\left\{-\alpha\log\left(\left(1-p+\frac{v-2m}{v}\gamma\right)\mathbf{E}\left[\exp(-M/\alpha)\right]+\left(p-\frac{v-2m}{v}\gamma\right)\right)-\alpha\delta\right\}

We construct the distribution of Y~m:\tilde{Y}_{m}:

Y~m={M0with prob. 1p+v2mvγwith prob. pv2mvγ.\tilde{Y}_{m}=\left\{\begin{array}[]{c}M\\ 0\end{array}\right.\begin{array}[]{l}\text{with prob. }1-p+\frac{v-2m}{v}\gamma\\ \text{with prob. }p-\frac{v-2m}{v}\gamma.\end{array}

Then, QDRO(π,𝐏𝐛)Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}) becomes

QDRO(π,𝐏𝐛)=infD(𝐏||Y~m)δ𝐄Y𝐏[Y]=Mg(1p+γ2mvγ).Q_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}})=\inf_{D(\mathbf{P}||\tilde{Y}_{m})\leq\delta}\mathbf{E}_{Y\sim\mathbf{P}}\left[Y\right]=Mg\left(1-p+\gamma-\frac{2m}{v}\gamma\right).

By Lemma A17, if p=1/2p=1/2 and γ0.1,\gamma\leq 0.1, we have

RDRO(π,𝐏𝐛)\displaystyle R_{\mathrm{DRO}}(\pi,\mathbf{P}_{\mathbf{b}}) =\displaystyle= Mg(1p+γ)Mg(1p+γ2mvγ)\displaystyle Mg(1-p+\gamma)-Mg\left(1-p+\gamma-\frac{2m}{v}\gamma\right)
\displaystyle\geq 2mvMγminx[1pγ,1p+γ]g(x)dH(𝐛,π)Mγv.\displaystyle\frac{2m}{v}M\gamma\min_{x\in\left[1-p-\gamma,1-p+\gamma\right]}g^{\prime}(x)\geq\frac{d_{H}(\mathbf{b},\pi)M\gamma}{v}.

Then, by Assouad Lemma [78, Theorem 2.12 (ii)], we have

max𝐏0π0𝒫(M)𝐄(𝐏π00)n[RDRO(π,𝐏0)]\displaystyle\max_{\mathbf{P}_{0}*\pi_{0}\in\mathcal{P}(M)}\mathbf{E}_{\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n}}\left[R_{\mathrm{DRO}}(\pi,\mathbf{P}_{0})\right]
\displaystyle\geq Mγvmax𝐏0π0𝒫(M)𝐄(𝐏π00)n[dH(𝐛,π)]\displaystyle\frac{M\gamma}{v}\max_{\mathbf{P}_{0}*\pi_{0}\in\mathcal{P}(M)}\mathbf{E}_{\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n}}\left[d_{H}(\mathbf{b},\pi)\right]
\displaystyle\geq Mγ2(1maxdH(𝐛,𝐛)=1TV((𝐏𝐛π𝐛,0)n,(𝐏𝐛π𝐛,0)n)),\displaystyle\frac{M\gamma}{2}\left(1-\max_{d_{H}(\mathbf{b,b}^{\prime})=1}\mathrm{TV}\left(\left(\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\right)^{n},\left(\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\right)^{n}\right)\right),

where TV(,)\mathrm{TV}(\mathbf{\cdot,\cdot}) denotes the total variation distance between two measures. By Pinsker’s inequality ([78, Lemma 2.5]), we have

TV((𝐏𝐛π𝐛,0)n,(𝐏𝐛π𝐛,0)n)D((𝐏𝐛π𝐛,0)n||(𝐏𝐛π𝐛,0)n)/2=nD(𝐏𝐛π𝐛,0||𝐏𝐛π𝐛,0)/2.\mathrm{TV}\left(\left(\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\right)^{n},\left(\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\right)^{n}\right)\leq\sqrt{D\left(\left(\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\right)^{n}||\left(\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\right)^{n}\right)/2}=\sqrt{nD\left(\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}||\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\right)/2}.

For 𝐛,𝐛\mathbf{b,b}^{\prime} such that dH(𝐛,𝐛)=1,d_{H}(\mathbf{b,b}^{\prime})=1, Let blblb_{l}\neq b_{l}^{\prime} and without loss of generality, we assume bl=1.b_{l}=1. Then, wave

D(𝐏𝐛π𝐛,0||𝐏𝐛π𝐛,0)\displaystyle D\left(\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}||\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\right) =\displaystyle= i=1vj=01k=01𝐏𝐛π𝐛,0(X=xi,A=j,Y=Mk)log(𝐏𝐛π𝐛,0(X=xi,A=j,Y=Mk)𝐏𝐛π𝐛,0(X=xi,A=j,Y=Mk))\displaystyle\sum_{i=1}^{v}\sum_{j=0}^{1}\sum_{k=0}^{1}\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\left(X=x_{i},A=j,Y=Mk\right)\log\left(\frac{\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\left(X=x_{i},A=j,Y=Mk\right)}{\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\left(X=x_{i},A=j,Y=Mk\right)}\right)
=\displaystyle= j=01k=01𝐏𝐛π𝐛,0(X=xl,A=j,Y=Mk)log(𝐏𝐛π𝐛,0(X=xl,A=j,Y=Mk)𝐏𝐛π𝐛,0(X=xl,A=j,Y=Mk))\displaystyle\sum_{j=0}^{1}\sum_{k=0}^{1}\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\left(X=x_{l},A=j,Y=Mk\right)\log\left(\frac{\mathbf{P}_{\mathbf{b}}^{\pi_{\mathbf{b},0}}\left(X=x_{l},A=j,Y=Mk\right)}{\mathbf{P}_{\mathbf{b}^{\prime}}^{\pi_{\mathbf{b}^{\prime},0}}\left(X=x_{l},A=j,Y=Mk\right)}\right)
=\displaystyle= 12v(p+γ)log(p+γpγ)+12v(1pγ)log(1pγ1p+γ)\displaystyle\frac{1}{2v}\left(p+\gamma\right)\log\left(\frac{p+\gamma}{p-\gamma}\right)+\frac{1}{2v}\left(1-p-\gamma\right)\log\left(\frac{1-p-\gamma}{1-p+\gamma}\right)
+12v(pγ)log(pγp+γ)+12v(1p+γ)log(1p+γ1pγ)\displaystyle+\frac{1}{2v}\left(p-\gamma\right)\log\left(\frac{p-\gamma}{p+\gamma}\right)+\frac{1}{2v}\left(1-p+\gamma\right)\log\left(\frac{1-p+\gamma}{1-p-\gamma}\right)
=\displaystyle= 1vDKL(p+γ||pγ).\displaystyle\frac{1}{v}D_{\mathrm{KL}}(p+\gamma||p-\gamma).

For p=1/2p=1/2 and γ0.1,\gamma\leq 0.1, we have by Tsybakov [78, Lemma 2.7]

DKL(p+γ||pγ)(2γ)2/(p2γ2).D_{\mathrm{KL}}(p+\gamma||p-\gamma)\leq(2\gamma)^{2}/\left(p^{2}-\gamma^{2}\right).

By picking γ=14vn0.1,\gamma=\frac{1}{4}\sqrt{\frac{v}{n}}\leq 0.1, which requires nκ(n)(Π)2n\geq\kappa^{(n)}(\Pi)^{2}, we have

max𝐏0π0𝒫(M)𝐄(𝐏π00)n[RDRO(π,𝐏0)]M40vn.\max_{\mathbf{P}_{0}*\pi_{0}\in\mathcal{P}(M)}\mathbf{E}_{\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n}}\left[R_{\mathrm{DRO}}(\pi,\mathbf{P}_{0})\right]\geq\frac{M}{40}\sqrt{\frac{v}{n}}.

Recall that v=4/25κ(n)(Π)2v=\lceil 4/25\kappa^{(n)}(\Pi)^{2}\rceil. Therefore, we have

max𝐏0π0𝒫(M)𝐄(𝐏π00)n[RDRO(π,𝐏0)]Mκ(n)(Π)100n.\max_{\mathbf{P}_{0}*\pi_{0}\in\mathcal{P}(M)}\mathbf{E}_{\left(\mathbf{P}^{\pi_{0}}_{0}\right)^{n}}\left[R_{\mathrm{DRO}}(\pi,\mathbf{P}_{0})\right]\geq\frac{M\kappa^{(n)}(\Pi)}{100\sqrt{n}}.

Appendix A.6 Proof of the Bayes DRO policy result in Section 5.1

Proof of Proposition 2.

By Lemma 1, we have

QDRO(πDRO)\displaystyle Q_{\rm DRO}(\pi^{*}_{\rm DRO}) =\displaystyle= supπΠ¯supα0{αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ}\displaystyle\sup_{\pi\in\overline{\Pi}}\sup_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta\right\} (A.20)
=\displaystyle= supα0supπΠ¯{αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ}.\displaystyle\sup_{\alpha\geq 0}\sup_{\pi\in\overline{\Pi}}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta\right\}.

The inner maximization (A.20) can be further simplified as

supπΠ¯{αlog𝐄𝐏0[exp(Y(π(X))/α)]αδ}\displaystyle\sup_{\pi\in\overline{\Pi}}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\exp(-Y(\pi(X))/\alpha)\right]-\alpha\delta\right\}
=\displaystyle= supπΠ¯{αlog𝐄𝐏0[𝐄[exp(Y(π(X))/α)|X]]αδ}\displaystyle\sup_{\pi\in\overline{\Pi}}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}[\mathbf{E}\left[\exp(-Y(\pi(X))/\alpha)|X]\right]-\alpha\delta\right\}
=\displaystyle= αlog𝐄𝐏0[infπΠ¯𝐄[exp(Y(π(X))/α)|X]]αδ.\displaystyle-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\inf_{\pi\in\overline{\Pi}}\mathbf{E}\left[\exp(-Y(\pi(X))/\alpha)|X\right]\right]-\alpha\delta.

Since Π¯\overline{\Pi} contains all measurable policies, we have

infπΠ¯𝐄[exp(Y(π(X))/α)|X]=mina𝒜{𝐄[exp(Y(a)/α)|X]},\displaystyle\inf_{\pi\in\overline{\Pi}}\mathbf{E}\left[\exp(-Y(\pi(X))/\alpha)|X\right]=\min_{a\in\mathcal{A}}\left\{\mathbf{E}\left[\exp(-Y(a)/\alpha)|X\right]\right\},

and the optimal dual variable is

α(πDRO)=argmaxα0{αlog𝐄𝐏0[mina𝒜{𝐄𝐏0[exp(Y(a)/α)|X]}]αδ}.\alpha^{*}(\pi^{*}_{\rm DRO})=\mathop{\rm arg\,max}_{\alpha\geq 0}\left\{-\alpha\log\mathbf{E}_{\mathbf{P}_{0}}\left[\min_{a\in\mathcal{A}}\left\{\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-Y(a)/\alpha\right)\right|X\right]\right\}\right]-\alpha\delta\right\}.

Finally, we have for any a𝒜a\in\mathcal{A}, the set

{x𝒳:πDRO(x)=a}={x𝒳:𝐄𝐏0[exp(Y(a)/α)|X=x]𝐄𝐏0[exp(Y(a)/α)|X=x], for a𝒜/{a}}\{x\in\mathcal{X}:\pi^{*}_{\rm DRO}(x)=a\}=\left\{x\in\mathcal{X}:\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-Y(a)/\alpha\right)\right|X=x\right]\leq\mathbf{E}_{\mathbf{P}_{0}}\left[\left.\exp\left(-Y(a^{\prime})/\alpha\right)\right|X=x\right],\text{ for }\forall a^{\prime}\in\mathcal{A}/\{a\}\right\}

is measurable. ∎

Appendix A.7 Proof of the extension results in Section 7

Proof of Lemma 9.

If k=1k=1, we have ck(δ)=1c_{k}(\delta)=1, and thus Lemma 9 recovers Lemma 5. For k(1,+)k\in(1,+\infty), notice that

|supα𝐑{ck(δ)𝐄𝐏1[(Y+α)+k]1k+α}supα𝐑{ck(δ)𝐄𝐏2[(Y+α)+k]1k+α}|\displaystyle\left|\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{1}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}-\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\mathbf{E}_{\mathbf{P}_{2}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}+\alpha\right\}\right|
\displaystyle\leq ck(δ)supα𝐑|𝐄𝐏1[(Y+α)+k]1k𝐄𝐏2[(Y+α)+k]1k|\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\in\mathbf{R}}\left|\mathbf{E}_{\mathbf{P}_{1}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}-\mathbf{E}_{\mathbf{P}_{2}}\left[\left(-Y+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}\right|
=\displaystyle= ck(δ)supα𝐑|𝐄𝐏U[(q𝐏1(U)+α)+k]1k𝐄𝐏U[(q𝐏2(U)+α)+k]1k|,\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\in\mathbf{R}}\left|\mathbf{E}_{{}_{\mathbf{P}_{U}}}\left[\left(-q_{\mathbf{P}_{1}}\left(U\right)+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}-\mathbf{E}_{\mathbf{P}_{U}}\left[\left(-q_{\mathbf{P}_{2}}\left(U\right)+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}\right|,

where 𝐏UU([0,1])\mathbf{P}_{U}\sim U([0,1]) and the last equality is based on the fact that q𝐏(U)=𝑑𝐏.q_{\mathbf{P}}\left(U\right)\overset{d}{=}\mathbf{P.}

By the triangular inequality in Lk(U)L^{k_{*}}\left(U\right) space, we have

|𝐄𝐏U[(q𝐏1(U)+α)+k]1k𝐄𝐏U[(q𝐏2(U)+α)+k]1k|\displaystyle\left|\mathbf{E}_{{}_{\mathbf{P}_{U}}}\left[\left(-q_{\mathbf{P}_{1}}\left(U\right)+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}-\mathbf{E}_{\mathbf{P}_{U}}\left[\left(-q_{\mathbf{P}_{2}}\left(U\right)+\alpha\right)_{+}^{k_{\ast}}\right]^{\frac{1}{k_{\ast}}}\right|
\displaystyle\leq 𝐄𝐏U[|q𝐏1(U)q𝐏2(U)|k]1k\displaystyle\mathbf{E}_{{}_{\mathbf{P}_{U}}}\left[\left|q_{\mathbf{P}_{1}}\left(U\right)-q_{\mathbf{P}_{2}}\left(U\right)\right|^{{}^{k_{\ast}}}\right]^{\frac{1}{k_{\ast}}}
\displaystyle\leq supt[0,1]|q𝐏1(t)q𝐏2(t)|.\displaystyle\sup_{t\in[0,1]}\left|q_{\mathbf{P}_{1}}\left(t\right)-q_{\mathbf{P}_{2}}\left(t\right)\right|.

Proof of Lemma 10.

We begin with

|supα𝐑{ck(δ)(d𝔻[(d+α)+k𝐏1(d)])1k+α}supα𝐑{ck(δ)(d𝔻[(d+α)+k𝐏2(d)])1k+α}|\displaystyle\left|\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}+\alpha\right\}-\sup_{\alpha\in\mathbf{R}}\left\{-c_{k}\left(\delta\right)\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}+\alpha\right\}\right| (A.21)
\displaystyle\leq ck(δ)supα𝐑|(d𝔻[(d+α)+k𝐏1(d)])1k(d𝔻[(d+α)+k𝐏2(d)])1k|\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\in\mathbf{R}}\left|\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right|
=\displaystyle= ck(δ)max{supαM|(d𝔻[(d+α)+k𝐏1(d)])1k(d𝔻[(d+α)+k𝐏2(d)])1k|,\displaystyle c_{k}\left(\delta\right)\max\left\{\sup_{\alpha\leq M}\left|\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right|\right., (A.23)
supα>M|(d𝔻[(d+α)+k𝐏1(d)])1k(d𝔻[(d+α)+k𝐏2(d)])1k|}.\displaystyle\left.\sup_{\alpha>M}\left|\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right|\right\}.

We tackle the two cases αM\alpha\leq M and α>M\alpha>M separately. To ease of notation, we abbreviate 𝐏1(Y=d),𝐏2(Y=d)\mathbf{P}_{1}\left(Y=d\right),\mathbf{P}_{2}\left(Y=d\right) as 𝐏1(d),𝐏2(d)\mathbf{P}_{1}\left(d\right),\mathbf{P}_{2}\left(d\right).

1) Case αM:\alpha\leq M: Note that

ck(δ)supαM|((d𝔻[(d+α)+k𝐏1(d)])1k(d𝔻[(d+α)+k𝐏2(d)])1k)|\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\leq M}\left|\left(\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right)\right|
=\displaystyle= ck(δ)supαM|((d𝔻((d+α)+(𝐏1(d))1k)k)1k(d𝔻((d+α)+(𝐏2(d))1k)k)1k)|\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\leq M}\left|\left(\left(\sum_{d\in\mathbb{D}}\left(\left(-d+\alpha\right)_{+}\left(\mathbf{P}_{1}\left(d\right)\right)^{\frac{1}{k_{\ast}}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left(\left(-d+\alpha\right)_{+}\left(\mathbf{P}_{2}\left(d\right)\right)^{\frac{1}{k_{\ast}}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}\right)\right|
\displaystyle\leq ck(δ)supαMmax{(d:𝐏1(d)𝐏2(d)((d+α)+𝐏1(d)1k)k)1k(d:𝐏1(d)𝐏2(d)((d+α)+𝐏2(d)1k)k)1k,\displaystyle c_{k}\left(\delta\right)\sup_{\alpha\leq M}\max\left\{\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}\left(\left(-d+\alpha\right)_{+}\mathbf{P}_{1}\left(d\right)^{\frac{1}{k_{\ast}}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\left(-d+\alpha\right)_{+}\mathbf{P}_{2}\left(d\right)^{\frac{1}{k_{\ast}}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}},\right.
(d:𝐏1(d)<𝐏2(d)((d+α)+𝐏1(d)1/k)k)1k(d:𝐏1(d)<𝐏2(d)((d+α)+𝐏2(d)1/k)k)1k}.\displaystyle\left.\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ <\mathbf{P}_{2}\left(d\right)\end{subarray}}\left(\left(-d+\alpha\right)_{+}\mathbf{P}_{1}\left(d\right)^{1/k_{\ast}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ <\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\left(-d+\alpha\right)_{+}\mathbf{P}_{2}\left(d\right)^{1/k_{\ast}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}\right\}.

By the kk_{\ast}-norm triangular inequality and the fact that (d+α)+M\left(-d+\alpha\right)_{+}\leq M for αM,\alpha\leq M, we have

(d:𝐏1(d)𝐏2(d)((d+α)+(𝐏1(d))1/k)k)1k(d:𝐏1(d)𝐏2(d)((d+α)+(𝐏2(d))1/k)k)1k\displaystyle\left(\sum_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}\left(\left(-d+\alpha\right)_{+}\left(\mathbf{P}_{1}\left(d\right)\right)^{1/k_{\ast}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\left(-d+\alpha\right)_{+}\left(\mathbf{P}_{2}\left(d\right)\right)^{1/k_{\ast}}\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}
\displaystyle\leq ck(δ)M(d:𝐏1(d)𝐏2(d)|(𝐏1(d))1/k(𝐏2(d))1/k|k)1k.\displaystyle c_{k}\left(\delta\right)M\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left|\left(\mathbf{P}_{1}\left(d\right)\right)^{1/k_{\ast}}-\left(\mathbf{P}_{2}\left(d\right)\right)^{1/k_{\ast}}\right|^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}.

Consider the function h(x)=x1/k,h(x)=x^{1/k_{\ast}},

h(x)=1kx1/k11k(b¯/2)1/k1, when xb¯/2.h^{\prime}(x)=\frac{1}{k^{\ast}}x^{1/k_{\ast}-1}\leq\frac{1}{k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1},\text{ when }x\geq\underline{b}/2.

Then, when TV(𝐏1,𝐏2)b¯/2,\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2})\leq\underline{b}/2, we have

ck(δ)M(d:𝐏1(d)𝐏2(d)|(𝐏1(d))1/k(𝐏2(d))1/k|k)1k\displaystyle c_{k}\left(\delta\right)M\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left|\left(\mathbf{P}_{1}\left(d\right)\right)^{1/k_{\ast}}-\left(\mathbf{P}_{2}\left(d\right)\right)^{1/k_{\ast}}\right|^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}
\displaystyle\leq ck(δ)M(d:𝐏1(d)𝐏2(d)(1k(b¯/2)1/k1|𝐏1(d)𝐏2(d)|)k)1k\displaystyle c_{k}\left(\delta\right)M\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\frac{1}{k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1}\left|\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right|\right)^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}
=\displaystyle= ck(δ)Mk(b¯/2)1/k1(d:𝐏1(d)𝐏2(d)|𝐏1(d)𝐏2(d)|k)1k\displaystyle\frac{c_{k}\left(\delta\right)M}{k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1}\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left|\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right|^{k_{\ast}}\right)^{\frac{1}{k_{\ast}}}
\displaystyle\leq ck(δ)Mk(b¯/2)1/k1d:𝐏1(d)𝐏2(d)|𝐏1(d)𝐏2(d)|\displaystyle\frac{c_{k}\left(\delta\right)M}{k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1}\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ \geq\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left|\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right|
\displaystyle\leq ck(δ)Mk(b¯/2)1/k1TV(𝐏1,𝐏2).\displaystyle\frac{c_{k}\left(\delta\right)M}{k^{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}).

The same bound holds for {d𝔻:𝐏1(d)<𝐏2(d)},\{d\in\mathbb{D}:\mathbf{P}_{1}\left(d\right)<\mathbf{P}_{2}\left(d\right)\}, which completes this case.

2) Case α>M:\alpha>M: In this case, we have

ck(δ)supα>M|(d𝔻[(d+α)+k𝐏1(d)])1k(d𝔻[(d+α)+k𝐏2(d)])1k|\displaystyle c_{k}\left(\delta\right)\sup_{\alpha>M}\left|\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(-d+\alpha\right)_{+}^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right|
=\displaystyle= ck(δ)supα>M|(d𝔻[(αd)k𝐏1(d)])1k(d𝔻[(αd)k𝐏2(d)])1k|.\displaystyle c_{k}\left(\delta\right)\sup_{\alpha>M}\left|\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right|.

We will focus on |(d𝔻[(αd)k𝐏1(d)])1k(d𝔻[(αd)k𝐏2(d)])1k|\left|\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}\right| and without loss of generality, we assume

d𝔻[(αd)k𝐏1(d)]d𝔻[(αd)k𝐏2(d)].\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\geq\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right].

Recall that for the function h(x)=x1/k,h(x)=x^{1/k_{\ast}}, the derivative is

h(x)=1kx1/k11k(x¯)1/k1, when xx¯.h^{\prime}(x)=\frac{1}{k^{\ast}}x^{1/k_{\ast}-1}\leq\frac{1}{k^{\ast}}\left(\underline{x}\right)^{1/k_{\ast}-1},\text{ when }x\geq\underline{x}.

Therefore, we have

ck(δ)(d𝔻[(αd)k𝐏1(d)])1k(d𝔻[(αd)k𝐏2(d)])1k\displaystyle c_{k}\left(\delta\right)\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}} (A.25)
\displaystyle\leq ck(δ)k(d𝔻[(αd)k𝐏2(d)])1/k1(d𝔻[(αd)k(𝐏1(d)𝐏2(d))]).\displaystyle\frac{c_{k}\left(\delta\right)}{k^{\ast}}\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{1/k_{\ast}-1}\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)\right]\right).

Then, when TV(𝐏1,𝐏2)b¯/2,\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2})\leq\underline{b}/2, we have

d𝔻[(αd)k𝐏2(d)]b¯2(αmind𝔻d)k\displaystyle\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\geq\frac{\underline{b}}{2}\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}} (A.26)
\displaystyle\Rightarrow (d𝔻[(αd)k𝐏2(d)])1/k1(b¯/2)1/k1(αmind𝔻d)1k.\displaystyle\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{1/k_{\ast}-1}\leq\left(\underline{b}/2\right)^{1/k_{\ast}-1}\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{1-k_{\ast}}.

Furthermore, we have

d𝔻[(αd)k(𝐏1(d)𝐏2(d))]\displaystyle\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)\right] (A.27)
=\displaystyle= (d:𝐏1(d)>𝐏2(d)(αd)k(𝐏1(d)𝐏2(d)))(d:𝐏1(d)<𝐏2(d)(αd)k(𝐏2(d)𝐏1(d)))\displaystyle\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ >\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\alpha-d\right)^{k_{\ast}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)\right)-\left(\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ <\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\alpha-d\right)^{k_{\ast}}\left(\mathbf{P}_{2}\left(d\right)-\mathbf{P}_{1}\left(d\right)\right)\right)
\displaystyle\leq (αmind𝔻d)kd:𝐏1(d)>𝐏2(d)(𝐏1(d)𝐏2(d))(αmaxd𝔻d)kd:𝐏1(d)<𝐏2(d)(𝐏2(d)𝐏1(d))\displaystyle\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}}\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ >\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)-\left(\alpha-\max_{d\in\mathbb{D}}d\right)^{k_{\ast}}\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ <\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\mathbf{P}_{2}\left(d\right)-\mathbf{P}_{1}\left(d\right)\right)
=\displaystyle= ((αmind𝔻d)k(αmaxd𝔻d)k)d:𝐏1(d)>𝐏2(d)(𝐏1(d)𝐏2(d)).\displaystyle\left(\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}}-\left(\alpha-\max_{d\in\mathbb{D}}d\right)^{k_{\ast}}\right)\sum_{{}_{\begin{subarray}{c}d:\mathbf{P}_{1}\left(d\right)\\ >\mathbf{P}_{2}\left(d\right)\end{subarray}}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right).

The last equation is due to

d:𝐏1(d)>𝐏2(d)(𝐏1(d)𝐏2(d))=d:𝐏1(d)<𝐏2(d)(𝐏2(d)𝐏1(d)).\sum_{{}_{d:\mathbf{P}_{1}\left(d\right)>\mathbf{P}_{2}\left(d\right)}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)=\sum_{{}_{d:\mathbf{P}_{1}\left(d\right)<\mathbf{P}_{2}\left(d\right)}}\left(\mathbf{P}_{2}\left(d\right)-\mathbf{P}_{1}\left(d\right)\right).

We further note that

d:𝐏1(d)>𝐏2(d)(𝐏1(d)𝐏2(d))=TV(𝐏1,𝐏2),\sum_{{}_{d:\mathbf{P}_{1}\left(d\right)>\mathbf{P}_{2}\left(d\right)}}\left(\mathbf{P}_{1}\left(d\right)-\mathbf{P}_{2}\left(d\right)\right)=\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}), (A.28)

and

((αmind𝔻d)k(αmaxd𝔻d)k)\displaystyle\left(\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}}-\left(\alpha-\max_{d\in\mathbb{D}}d\right)^{k_{\ast}}\right) (A.29)
\displaystyle\leq (maxd𝔻dmind𝔻d)(αmind𝔻d)k1\displaystyle\left(\max_{d\in\mathbb{D}}d-\min_{d\in\mathbb{D}}d\right)\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}-1}
\displaystyle\leq M(αmind𝔻d)k1.\displaystyle M\left(\alpha-\min_{d\in\mathbb{D}}d\right)^{k_{\ast}-1}.

By combining bounds (A.25) - (A.29), we have

ck(δ)(d𝔻[(αd)k𝐏1(d)])1k(d𝔻[(αd)k𝐏2(d)])1k\displaystyle c_{k}\left(\delta\right)\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{1}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}-\left(\sum_{d\in\mathbb{D}}\left[\left(\alpha-d\right)^{k_{\ast}}\mathbf{P}_{2}\left(d\right)\right]\right)^{\frac{1}{k_{\ast}}}
\displaystyle\leq ck(δ)Mk(b¯/2)1/k1TV(𝐏1,𝐏2), for any α>M,\displaystyle\frac{c_{k}\left(\delta\right)M}{k_{\ast}}\left(\underline{b}/2\right)^{1/k_{\ast}-1}\mathrm{TV}(\mathbf{P}_{1},\mathbf{P}_{2}),\text{ for any }\alpha>M,

which completes the proof. ∎

Appendix B Experiments

Appendix B.1 Optimization of multi-linear policy

This section provides implementation details for how to compute argminΘ𝐑(p+1)×dW^n(πΘ,α)\mathop{\rm arg\,min}_{\Theta\in\mathbf{R}^{(p+1)\times d}}\hat{W}_{n}(\pi_{\Theta},\alpha) where πΘ\pi_{\Theta} is the multilinear policy associated with parameter Θ\Theta. Recall the definition of W^n(π,α)\hat{W}_{n}(\pi,\alpha) from definition 4 that

W^n(π,α)=1nSnπi=1nWi(π,α)=i=1n𝟏{π(Xi)=Ai}π0(AiXi)exp(Yi(Ai)/α)i=1n𝟏{π(Xi)=Ai}π0(Ai|Xi).\hat{W}_{n}(\pi,\alpha)=\frac{1}{nS_{n}^{\pi}}\sum_{i=1}^{n}W_{i}(\pi,\alpha)=\frac{\sum_{i=1}^{n}\frac{\mathbf{1}\{\pi(X_{i})=A_{i}\}}{\pi_{0}(A_{i}\mid X_{i})}\exp(-Y_{i}(A_{i})/\alpha)}{\sum_{i=1}^{n}\frac{\mathbf{1\{}\pi(X_{i})=A_{i}\mathbf{\}}}{\pi_{0}\left(A_{i}|X_{i}\right)}}.

As we did in Section 5.2, we employ the smooth approximation of the indicator function

𝟏{πΘ(Xi)=Ai}exp(θAiXi)a=1dexp(θaXi).\mathbf{1\{}\pi_{\Theta}(X_{i})=A_{i}\mathbf{\}}\approx\frac{\exp(\theta_{A_{i}}^{\top}X_{i})}{\sum_{a=1}^{d}\exp(\theta_{a}^{\top}X_{i})}.

Now for i=1,,ni=1,\ldots,n, we the smooth weight function pi:𝐑(p+1)×d𝐑+p_{i}:\mathbf{R}^{(p+1)\times d}\rightarrow\mathbf{R}_{+} as

pi(Θ)exp(θAiXi)π0(Ai|Xi)a=1dexp(θaXi),p_{i}(\Theta)\triangleq\frac{\exp(\theta_{A_{i}}^{\top}X_{i})}{\pi_{0}\left(A_{i}|X_{i}\right)\sum_{a=1}^{d}\exp(\theta_{a}^{\top}X_{i})},

then the estimator W^n(πΘ,α)\hat{W}_{n}(\pi_{\Theta},\alpha) admits the smooth approximation

W^n(πΘ,α)W~n(πΘ,α)i=1npi(Θ)exp(Yi(Ai)/α)i=1npi(Θ).\hat{W}_{n}(\pi_{\Theta},\alpha)\approx\tilde{W}_{n}(\pi_{\Theta},\alpha)\triangleq\frac{\sum_{i=1}^{n}p_{i}(\Theta)\exp(-Y_{i}(A_{i})/\alpha)}{\sum_{i=1}^{n}p_{i}(\Theta)}.

In addition, we have

ΘW~n(πΘ,α)\displaystyle\nabla_{\Theta}\tilde{W}_{n}(\pi_{\Theta},\alpha) =W~n(πΘ,α)Θlog(W~n(πΘ,α))\displaystyle=\tilde{W}_{n}(\pi_{\Theta},\alpha)\nabla_{\Theta}\log\left(\tilde{W}_{n}(\pi_{\Theta},\alpha)\right)
W~n(πΘ,α)(Θlog(i=1npi(Θ)exp(Yi(Ai)/α))Θlog(i=1npi(Θ)))\displaystyle\approx\tilde{W}_{n}(\pi_{\Theta},\alpha)\cdot\left(\nabla_{\Theta}\log\left(\sum_{i=1}^{n}p_{i}(\Theta)\exp(-Y_{i}(A_{i})/\alpha)\right)-\nabla_{\Theta}\log\left(\sum_{i=1}^{n}p_{i}(\Theta)\right)\right)
=W~n(πΘ,α)(i=1nΘpi(Θ)exp(Yi(Ai)/α)i=1npi(Θ)exp(Yi(Ai)/α)i=1nΘpi(Θ)i=1npi(Θ)).\displaystyle=\tilde{W}_{n}(\pi_{\Theta},\alpha)\cdot\left(\frac{\sum_{i=1}^{n}\nabla_{\Theta}p_{i}(\Theta)\exp(-Y_{i}(A_{i})/\alpha)}{\sum_{i=1}^{n}p_{i}(\Theta)\exp(-Y_{i}(A_{i})/\alpha)}-\frac{\sum_{i=1}^{n}\nabla_{\Theta}p_{i}(\Theta)}{\sum_{i=1}^{n}p_{i}(\Theta)}\right).

Therefore, we can employ gradient to descent to solve for Θ\Theta that minimizes W~n(πΘ,α)\tilde{W}_{n}(\pi_{\Theta},\alpha), which approximately minimizes W^n(πΘ,α)\hat{W}_{n}(\pi_{\Theta},\alpha) as well. This is how we solve for argminΘ𝐑(p+1)×dW^n(πΘ,α)\mathop{\rm arg\,min}_{\Theta\in\mathbf{R}^{(p+1)\times d}}\hat{W}_{n}(\pi_{\Theta},\alpha) in implementation.

Appendix B.2 Experimental details of δ\delta selection in Section 6.4

In this section we will provide further details on the δ\delta selection experiment.

Recall that we intend to estimate δ\delta based on the data of 100 cities in the training set. To this end, we will partition the data in 20% of cities as our validation set with distribution denoted by 𝐏20\mathbf{P}^{20}, and we use 𝐏80\mathbf{P}^{80} to denote the distribution of the rest 80% of the training set. We will explain the detail of how to estimate D(𝐏20||𝐏80)D(\mathbf{P}^{20}||\mathbf{P}^{80}) in the rest of this section.

We first explain how to estimate the divergence between marginal distributions of XX, which is denoted by D(𝐏X20||𝐏X80)D(\mathbf{P}_{X}^{20}||\mathbf{P}_{X}^{80}). A directly computation using the sampled distributions 𝐏X20\mathbf{P}_{X}^{20} and 𝐏X80\mathbf{P}_{X}^{80} may result in infinite value, because (1) some features such as year of birth contains outliers whose value only appears in 𝐏X20\mathbf{P}_{X}^{20} or 𝐏X80\mathbf{P}_{X}^{80}; (2) XX is a nine-dimensional vector, which exaggerates the prior problem. In view of that the demographic features (year of birth, sex, household size) are weakly correlated with the historical voting records, we will compute the divergence on them separately. In order to avoid infinite KL-divergence, we first regroup two demographic features, Year of Birth (YoB) and Household Size (HS), according to the following rules:

YoB group={1if YoB19432if 1943<YoB19523if 1952<YoB19594if 1959<YoB19665if 1966<YoBHS group={1if HS=12if HS=23if HS=34if HS4\mbox{YoB group}=\begin{cases}1&\mbox{if YoB}\leq 1943\\ 2&\mbox{if }1943<\rm{YoB}\leq 1952\\ 3&\mbox{if }1952<\rm{YoB}\leq 1959\\ 4&\mbox{if }1959<\rm{YoB}\leq 1966\\ 5&\mbox{if }1966<\rm{YoB}\\ \end{cases}\qquad\mbox{HS group}=\begin{cases}1&\mbox{if HS}=1\\ 2&\mbox{if HS}=2\\ 3&\mbox{if HS}=3\\ 4&\mbox{if HS}\geq 4\\ \end{cases}

After the regrouping, we define the demographic feature vector Xdemo=( HS group, YoB group, sex)X_{\rm{demo}}=(\mbox{ HS group, YoB group, sex}), and compute the KL-divergence D(𝐏20Xdemo||𝐏80Xdemo)D(\mathbf{P}^{20}_{X_{\rm{demo}}}||\mathbf{P}^{80}_{X_{\rm{demo}}}). The historical voting record vector is defined as Xrec=(g2004, g2002, g2000, d2004, d2002, d2000)X_{\rm{rec}}=(\mbox{g2004, g2002, g2000, d2004, d2002, d2000}), and we compute its divergence D(𝐏20Xrec||𝐏80Xrec)D(\mathbf{P}^{20}_{X_{\rm{rec}}}||\mathbf{P}^{80}_{X_{\rm{rec}}}) directly. We will use the sum D(𝐏20Xdemo||𝐏80Xdemo)+D(𝐏20Xrec||𝐏80Xrec)D(\mathbf{P}^{20}_{X_{\rm{demo}}}||\mathbf{P}^{80}_{X_{\rm{demo}}})+D(\mathbf{P}^{20}_{X_{\rm{rec}}}||\mathbf{P}^{80}_{X_{\rm{rec}}}) as approximation for the divergence between 𝐏X20\mathbf{P}_{X}^{20} and 𝐏X80\mathbf{P}_{X}^{80}.

Next, we will explain how to apply logistic regression to estimate 𝐄𝐏X20[D(𝐏Y20|X||𝐏Y80|X)]\mathbf{E}_{\mathbf{P}_{X}^{20}}[D(\mathbf{P}_{Y}^{20}|X||\mathbf{P}_{Y}^{80}|X)]. We independently fit two logistic regression of YXY\sim X, using data corresponding to 𝐏20\mathbf{P}^{20} and 𝐏80\mathbf{P}^{80} respectively. The result of logistic regression implies a fitted conditional distribution of YY given XX, i.e., 𝐏(Y=1|X)=(1+exp(β^0+β^X))1\mathbf{P}(Y=1|X)=(1+\exp(\hat{\beta}_{0}+\hat{\beta}^{\top}X))^{-1}, where β^0\hat{\beta}_{0} and β^\hat{\beta} are fitted parameters for the logistic regression. We denoted by P^Y20|X\hat{P}_{Y}^{20}|X as the fitted conditional distribution using the data 𝐏20\mathbf{P}^{20}, and P^Y80|X\hat{P}_{Y}^{80}|X as the fitted conditional distribution using the data 𝐏80\mathbf{P}^{80}. Since both P^Y20|X\hat{P}_{Y}^{20}|X and P^Y80|X\hat{P}_{Y}^{80}|X are Bernoulli distribution parameterized as a function of XX, the KL-divergence between them D(𝐏^Y20|X||𝐏^Y80|X)D(\hat{\mathbf{P}}_{Y}^{20}|X||\hat{\mathbf{P}}_{Y}^{80}|X) can also be computed in closed form as a function of XX. Finally, we compute the average value of the estimated KL-divergence using the distribution of 𝐏X20\mathbf{P}_{X}^{20}.

To conclude this section, we provide the full formula used in the computation:

D(𝐏20||𝐏80)D(𝐏20Xdemo||𝐏80Xdemo)+D(𝐏20Xrec||𝐏80Xrec)+𝐄𝐏X20[D(𝐏^Y20|X||𝐏^Y80|X)].D(\mathbf{P}^{20}||\mathbf{P}^{80})\approx D(\mathbf{P}^{20}_{X_{\rm{demo}}}||\mathbf{P}^{80}_{X_{\rm{demo}}})+D(\mathbf{P}^{20}_{X_{\rm{rec}}}||\mathbf{P}^{80}_{X_{\rm{rec}}})+\mathbf{E}_{\mathbf{P}_{X}^{20}}[D(\hat{\mathbf{P}}_{Y}^{20}|X||\hat{\mathbf{P}}_{Y}^{80}|X)].