This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sequential Fair Resource Allocation under a Markov Decision Process Framework

Parisa Hassanzadeh1    Eleonora Kreačić1    Sihan Zeng2    Yuchen Xiao1&Sumitra Ganesh1 1J.P. Morgan AI Research,  2Georgia Institute of Technology {parisa.hassanzadeh, eleonora.kreacic, yuchen.xiao, sumitra.ganesh}@jpmorgan.com, [email protected]
Abstract

We study the sequential decision-making problem of allocating a divisible resource to agents that reveal their stochastic demands on arrival over a finite horizon. Our goal is to design fair allocation algorithms that exhaust the available resource budget. This is challenging in sequential settings where information on future demands is not available at the time of decision-making. We formulate the problem as a discrete time Markov decision process (MDP). We propose a new algorithm, SAFFE, that makes fair allocations with respect to all demands revealed over the horizon by accounting for expected future demands at each arrival time. Using the MDP formulation, we show that the allocation made by SAFFE optimizes an upper bound of the Nash Social Welfare fairness objective. We further introduce SAFFE-D, which improves SAFFE by more carefully balancing the trade-off between the current revealed demands and the future potential demands based on the uncertainty in agents’ future demands. We bound its gap to optimality using concentration bounds on total future demands. On synthetic and real data, we compare SAFFE-D against a set of baseline approaches, and show that it leads to more fair and efficient allocations and achieves close-to-optimal performance.

1 Introduction

The problem of multi-agent resource allocation arises in numerous disciplines including economics Cohen et al. (1965), power system optimization Yi et al. (2016); Nair et al. (2018), and cloud computing Balasubramanian et al. (2007); Abid et al. (2020). It comes in various flavors depending on the nature of resources, the agents’ behavior, the timing of allocation decisions, and the overall objective of the allocation process. A common element shared by various settings is the existence of a limited resource that needs to be distributed to various agents based on their reported requests.

There are various objectives that one might wish to optimize in the context of resource allocation. While the goal may be to maximize efficiency (e.g, minimize the leftover resource), other objectives combine efficiency with various notions of fairness. Among different fairness metrics, Nash Social Welfare (NSW) is a well-known objective defined as the geometric mean of the agents’ “satisfaction” with their allocation Nash (1950); Kaneko and Nakamura (1979). In the offline setting where we observe all requests prior to making an allocation decision, the NSW solution enjoys favorable properties including being Pareto-efficient (for any other allocation there would be at least one agent who is worse off compared to the current one), envy-free (no agent prefers any other agent’s allocation), and proportionally fair (every agent gets their fair share of the resource) Varian (1973), and thus exhausts the total budget under a limited resource.

In this work, we consider the much more challenging online (or sequential) setting where agents place requests in a sequential manner, and allocation decisions have to be made instantaneously and are irrevocable. In order to achieve fairness we need to reserve the resource for anticipated future demands, which can lead to wasted resources if the expected future agents do not arrive. Unlike previous work on sequential allocation Lien et al. (2014); Sinclair et al. (2022), we consider a more general setup where agents can arrive several times over the horizon, and their demand size is not restricted to a finite set of values. This setting has important applications such as allocating stock inventory on a trading platform or supplier-reseller settings (e.g., automaker-dealer) where the degree of fairness impacts their relationship Kumar et al. (1995); Yilmaz et al. (2004).

Contributions

We consider a general sequential setting where truthful agents can submit multiple demands for divisible resources over a fixed horizon, and the goal is to ensure fairness over this horizon. There is no standard notion of fairness in the sequential setting, as it may be impossible to satisfy all of the favorable fair properties simultaneously Sinclair et al. (2022). Therefore, online allocation algorithms are commonly evaluated in comparison to their offline (or hindsight) counterpart Gallego et al. (2015); Sinclair et al. (2022). We formulate the sequential problem in terms of maximizing expected NSW in hindsight, and design algorithms that explicitly optimize for this.

The multiple-arrival setting requires being mindful of all past and future requests when responding to an agent’s current demand. We propose a new algorithm SAFFE-D, which determines fair allocations by accounting for all expected future unobserved demands (e.g., based on historical information on agent demands) as well as each agent’s past allocations. In addition, we introduce a tunable discounting term in SAFFE-D, which improves both efficiency and fairness. It uses the uncertainty of future demands to balance the contention between allocating the resource now vs reserving it for the future across the entire horizon. We provide theoretical upper bounds on the sub-optimality gap of SAFFE-D using concentration bounds around the expected future demands, which also guide us in tuning the discounting term of SAFFE-D. Finally, we use numerical simulations and real data to illustrate the close-to-optimal performance of SAFFE-D in different settings, and we compare it against existing approaches and a reinforcement learning policy.

2 Problem Formulation

We consider a supplier that has a divisible resource with a limited budget size BB, and NN truthful agents that arrive sequentially over TT time steps requesting (possibly fractional) units of the resource. At time t{1,,T}t\in\{1,\dots,T\}, agent i{1,,N}i\in\{1,\dots,N\} arrives and reveals its demand Xit0{X}^{t}_{i}\in\mathbb{R}_{\geq 0} sampled from distribution PXit|Xi1,,Xit1P_{X_{i}^{t}|{X}_{i}^{1},\dots,{X}_{i}^{t-1}}. We assume that each agent has at least one demand over the horizon TT. The supplier observes demands 𝐗t=(X1t,,XNt)\mathbf{{X}}^{t}=({X}^{t}_{1},\dots,{X}^{t}_{N}), and makes an allocation 𝐀t=(A1t,,ANt)\mathbf{{A}}^{t}=({A}^{t}_{1},\dots,{A}^{t}_{N}), where Ait0{A}^{t}_{i}\in\mathbb{R}_{\geq 0} denotes the resource amount allocated to agent ii. We assume that the allocations are immediately taken by the agents and removed from the inventory. We also assume that the setting is repeated, i.e., after TT time steps, a new budget is drawn and the next allocation round of TT time steps starts.

Agent ii has a utility function u(𝐀i,𝐗i)u(\mathbf{{A}}_{i},\mathbf{X}_{i}), representing its satisfaction with the allocation 𝐀i=(Ai1,,AiT)\mathbf{{A}}_{i}=({A}_{i}^{1},\dots,{A}_{i}^{T}) given its (latent) demands. The utility function is a non-decreasing non-negative concave function. In this work, we consider the following utility function, where an agent’s utility linearly increases with its total allocation up to its total request

u(𝐀i,𝐗i)=t=1Tmin{Ait,Xit}.\displaystyle u(\mathbf{{A}}_{i},\mathbf{X}_{i})=\sum_{t=1}^{T}\min\left\{{A}_{i}^{t},\,{X}_{i}^{t}\right\}. (1)

An agent with the utility function in (1) only values an allocation in the time step it requests the resource, and not in earlier or later steps, which is suitable in settings where the allocation process is time sensitive and the supplier is not able to delay the decision-making outside of the current time step. If all agent demands 𝐗1,,𝐗N\mathbf{X}_{1},\dots,\mathbf{X}_{N} are known to the supplier at the time of decision-making, they can be used for determining the allocations. This setting is often referred to as offline. However, in the online or sequential setting, the agent demands are gradually revealed over time, such that the supplier only learns about X1t,,XNtX_{1}^{t},\dots,X_{N}^{t} at time tt. We present the paper in terms of a single resource; however, the setting extends to multiple resources as described in Appendix A.

Notation

(x)+=max{x,0}(x)^{+}=\max\{x,0\}. For vectors 𝐗\mathbf{X} and 𝐘\mathbf{Y}, we use 𝐗𝐘\mathbf{X}\geq\mathbf{Y} to denote XiYiX_{i}\geq Y_{i} for each ii. 𝟎\mathbf{0} denotes a zero vector. N(μ,σ2)N(\mu,\sigma^{2}) denotes the Normal distribution with mean μ\mu and variance σ2\sigma^{2}.

2.1 Fairness in Allocation Problems

The supplier aims to allocate the resource efficiently, i.e., with minimal leftover at the end of horizon TT, and in a fair manner maximizing NSW. The NSW objective is defined as i=1Nu(Ai,𝐗i)wi\prod_{i=1}^{N}u({A}_{i},\mathbf{X}_{i})^{w_{i}}, where wi+w_{i}\in\mathbb{R}_{+} reflects the weight assigned to agent ii by the supplier. NSW is a balanced compromise between the utilitarian welfare objective, which maximizes the utility of a larger number of agents, and the egalitarian welfare objective, which maximizes the utility of the worst-off agent. Since

arg\displaystyle\arg max𝐀ii=1Nu(𝐀i,𝐗i)wi=argmax𝐀ii=1Nwilogu(𝐀i,𝐗i),\displaystyle\max_{\mathbf{A}_{i}}\prod_{i=1}^{N}u(\mathbf{A}_{i},\mathbf{X}_{i})^{w_{i}}=\arg\max_{\mathbf{A}_{i}}\sum_{i=1}^{N}w_{i}\log u(\mathbf{A}_{i},\mathbf{X}_{i}),

the logarithm of NSW is often used as the objective function in allocation problems, such as in the Eisenberg-Gale program Eisenberg and Gale (1959). We refer to this as the log-NSW objective.

In the sequential setting, it is not guaranteed that there always exists an allocation that is simultaneously Pareto-efficient, envy-free, and proportional (Sinclair et al., 2020, Lemma 2.3). Motivated by the properties of the NSW objective in the offline setting, and the fact that our repeated setting can be viewed as multiple draws of the offline setting, we use a modified NSW objective for the sequential setting. Our goal in the sequential setting is to find an allocation that maximizes the log-NSW in expectation. Specifically,

argmax𝐀ii=1Nwi𝔼𝐗i[logu(𝐀i,𝐗i)].\displaystyle\arg\max_{\mathbf{A}_{i}}\sum_{i=1}^{N}w_{i}\mathbb{E}_{\mathbf{X}_{i}}\Big{[}\log u(\mathbf{A}_{i},\mathbf{X}_{i})\Big{]}. (2)

With the goal of measuring the performance of an allocation in the sequential setting, Sinclair et al. (2020) introduces approximate fairness metrics Δ𝐀max\Delta\mathbf{A}^{\text{max}} and Δ𝐀mean\Delta\mathbf{A}^{\text{mean}}. Let 𝐀ionline\mathbf{A}_{i}^{\text{online}} denote the allocation vector of agent ii given by an online algorithm subject to latent demands, and let 𝐀ihindsight\mathbf{A}_{i}^{\text{hindsight}} denote the allocation vector in hindsight after all demands are revealed, defined as (13) in Sec. 3. The expected difference between the overall allocations for each agent can be used to measure the fairness of the online algorithm. Let ΔAi|t=1TAit,hindsightt=1TAit,online|\Delta A_{i}\coloneqq\Big{|}\sum_{t=1}^{T}{A}_{i}^{t,\text{hindsight}}-\sum_{t=1}^{T}{A}_{i}^{t,\text{online}}\Big{|}. Then,

Δ𝐀max=𝔼[maxiΔAi],Δ𝐀mean=1Ni=1N𝔼[ΔAi],\displaystyle\Delta\mathbf{A}^{\text{max}}=\mathbb{E}\Big{[}\max_{i}\Delta A_{i}\Big{]},\;\Delta\mathbf{A}^{\text{mean}}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}[\Delta A_{i}], (3)

are concerned with the worst-off agent and average agent in terms of cumulative allocations in hindsight, respectively111Δ𝐀\Delta\mathbf{A} is defined for the additive utility function in (1), and may need to be redefined for alternative utilities.. While our optimization objective is not to minimize Δ𝐀max\Delta\mathbf{A}^{\text{max}} or Δ𝐀mean\Delta\mathbf{A}^{\text{mean}}, we use these metrics to evaluate the fairness of online allocation algorithms in the experiments in Sec. 6.

2.2 Markov Decision Process Formulation

Determining the optimal allocations under the NSW objective is a sequential decision-making problem as the allocation in one step affects the budget and allocations in future steps. We formulate the problem as a finite-horizon total-reward Markov Decision Process (MDP) modeled as a tuple <{𝒮t,𝒜t,Pt,Rt}t=1,,T><\{\mathcal{S}_{t},\mathcal{A}_{t},P_{t},R_{t}\}_{t=1,\dots,T}>. 𝒮t\mathcal{S}_{t} denotes the underlying time-dependent state space, 𝒜t\mathcal{A}_{t} is the action space, Pt:𝒮t×𝒜t×𝒮t+10P_{t}:\mathcal{S}_{t}\times\mathcal{A}_{t}\times\mathcal{S}_{t+1}\rightarrow\mathbb{R}_{\geq 0} describes the state transition dynamics conditioned upon the previous state and action, Rt:𝒮t×𝒜t0R_{t}:\mathcal{S}_{t}\times\mathcal{A}_{t}\rightarrow\mathbb{R}_{\geq 0} is a non-negative reward function, and TT is the horizon over which the resource is allocated. The goal of the supplier is to find an allocation policy π={π1,,πTπt:𝒮t𝒜t}\pi=\{\pi_{1},\dots,\pi_{T}\mid\pi_{t}:\mathcal{S}_{t}\rightarrow\mathcal{A}_{t}\} mapping the state to an action, in order to maximize the expected sum of rewards 𝔼[t=1TRt(st,πt(st))]\mathbb{E}[\sum_{t=1}^{T}R_{t}(s_{t},\pi_{t}(s_{t}))]. Next, we describe the components of the MDP in details.

State Space

The state space 𝒮t\mathcal{S}_{t} is time-dependent, and the state size increases with time step tt since the state captures the past demand and allocation information. Specifically, the state at step tt is defined as st=(𝐗1:t,𝐀1:t1,Bt)s_{t}=(\mathbf{{X}}^{1:t},\,\mathbf{{A}}^{1:t-1},\,B^{t}), where 𝐗1:t(𝐗1,,𝐗t)\mathbf{{X}}^{1:t}\coloneqq(\mathbf{{X}}^{1},\dots,\mathbf{{X}}^{t}), 𝐀1:t(𝐀1,,𝐀t)\mathbf{{A}}^{1:t}\coloneqq(\mathbf{{A}}^{1},\dots,\mathbf{{A}}^{t}), and

Bt={Bt1i=1NAit1t1Bt=1\displaystyle B^{t}=\begin{cases}B^{t-1}-\sum_{i=1}^{N}A_{i}^{t-1}&\quad t\geq 1\\ B&\quad t=1\end{cases} (4)

Action Space

The actions space is state and time-dependent. For any st=(𝐗1:t,𝐀1:t1,Bt)𝒮ts_{t}=(\mathbf{{X}}^{1:t},\,\mathbf{{A}}^{1:t-1},\,B^{t})\in\mathcal{S}_{t}, we have

𝒜t={𝐀tN:i=1NAitBt}\displaystyle\mathcal{A}_{t}=\{\mathbf{A}^{t}\in\mathbb{R}^{N}:\;\sum_{i=1}^{N}A^{t}_{i}\leq B^{t}\} (5)

The state and action space are both continuous and therefore infinitely large. However, 𝒜t\mathcal{A}_{t} is a compact polytope for any st𝒮ts_{t}\in\mathcal{S}_{t}, and 𝒮t\mathcal{S}_{t} is compact if the requests are bounded.

State Transition Function

Given state s𝒮ts\in\mathcal{S}_{t} and action a𝒜ta\in\mathcal{A}_{t}, the system transitions to the next state s𝒮t+1s^{\prime}\in\mathcal{S}_{t+1} with probability

P(s,a,s)=Prob(st+1=sst=s,at=a),\displaystyle P(s,a,s^{\prime})=\text{Prob}(s_{t+1}=s^{\prime}\mid s_{t}=s,a_{t}=a), (6)

where Xit+1PXit+1|Xi1,,XitX_{i}^{t+1}\sim P_{X_{i}^{t+1}|X_{i}^{1},\dots,X_{i}^{t}}.

Reward Function

The reward at time step t{1,,T}t\in\{1,\dots,T\}, is defined as follows

Rt(st,πt(st))=\displaystyle R_{t}(s_{t},\pi_{t}(s_{t}))= i=1N𝟙{Xit>0}.wi(UitUit1),\displaystyle\sum_{i=1}^{N}\mathbbm{1}\{X_{i}^{t}>0\}\;.\;w_{i}(U_{i}^{t}-U_{i}^{t-1}), (7)

where

Uit\displaystyle U_{i}^{t} =log(τ=1tmin{Aiτ,Xiτ}+ϵ),t{1,,T},\displaystyle=\log\left(\sum_{\tau=1}^{t}\min\{A_{i}^{\tau},X_{i}^{\tau}\}+\epsilon\right),\;t\in\{1,\dots,T\}, (8)

Ui0=0U_{i}^{0}=0, and ϵ\epsilon is a small value added for to ensure values are within the domain of the log function222For ϵ1\epsilon\ll 1, this will lead to a maximum error of ϵ\epsilon for x1.x\geq 1.. The reward RtR_{t} denotes the weighted sum of incremental increases in each agent’s utility at time tt333For agent ii, we have t=1T𝟙{Xit>0}(UitUit1)=t=1T(UitUit1)=UiT=log(t=1Tmin{Ait,Xit}+ϵ)\sum_{t=1}^{T}\mathbbm{1}\{X_{i}^{t}>0\}(U_{i}^{t}-U_{i}^{t-1})=\sum_{t=1}^{T}(U_{i}^{t}-U_{i}^{t-1})=U_{i}^{T}=\log(\sum_{t=1}^{T}\min\{A_{i}^{t},X_{i}^{t}\}+\epsilon), which matches the agent’s log-utility as ϵ0\epsilon\rightarrow 0.. The indicator function 𝟙{Xit>0}\mathbbm{1}\{X_{i}^{t}>0\} ensures that we only account for the agents with a demand at time tt. Then, the expected sum of rewards over the entire horizon TT is equivalent to the expected log-NSW objective defined in (2) for ϵ0\epsilon\rightarrow 0.

At time tt, with state st𝒮ts_{t}\in\mathcal{S}_{t} and action (allocation) 𝐀t𝒜t\mathbf{A}^{t}\in\mathcal{A}_{t}, the optimal Q-values satisfy the Bellman optimality equation Bellman (1966):

Qt(st,𝐀t)\displaystyle Q_{t}(s_{t},\mathbf{A}^{t}) =Rt(st,𝐀t)+𝔼[max𝐀t+1𝒜t+1Qt+1(st+1,𝐀t+1)].\displaystyle=R_{t}(s_{t},\mathbf{A}^{t})+\mathbb{E}\Big{[}\max_{\mathbf{A}^{t+1}\in\mathcal{A}_{t+1}}Q_{t+1}(s_{t+1},\mathbf{A}^{t+1})\Big{]}. (9)

We denote the optimal policy corresponding to Eq. (9) by πt\pi_{t}^{\star}. Then, the optimal allocation is the solution to

argmax𝐀tQt(st,𝐀t).\displaystyle\arg\max_{\mathbf{A}^{t}}Q_{t}(s_{t},\mathbf{A}^{t}). (10)

We show in Appendix B that an optimal policy exists for the MDP, and it can be derived by recursively solving the Bellman optimality equation Bellman (1966) backward in time. This quickly becomes computationally intractable, which motivates us to find alternative solutions that trade-off sub-optimality for computational efficiency. In Sec. 4, we introduce an algorithm that uses estimates of future demands to determine the allocations, and we discuss a learning-based approach in Sec. 6.5.

3 Offline (Hindsight) Setting

If the supplier could postpone the decision-making till the end of the horizon TT, it would have perfect knowledge of all demands 𝐗1,,𝐗T\mathbf{X}^{1},\dots,\mathbf{X}^{T}. Let X~i=τ=1TXiτ\widetilde{{X}}_{i}=\sum_{\tau=1}^{T}X_{i}^{\tau} denote the total demands of agent ii. Then, the supplier solves the following convex program, referred to as the Eisenberg-Gale program Eisenberg (1961), to maximize the log-NSW objective in (2) for allocations 𝐀~=(A~1,,A~N)\widetilde{\mathbf{{A}}}=(\widetilde{{A}}_{1},\dots,\widetilde{{A}}_{N}),

max𝐀~𝟎i=1Nwilog(u(A~i,X~i))s.t.\displaystyle\max_{\widetilde{\mathbf{A}}\geq\mathbf{0}}\sum_{i=1}^{N}w_{i}\log\Big{(}u(\widetilde{A}_{i},\widetilde{X}_{i})\Big{)}\quad\text{s.t. } i=1NA~iB.\displaystyle\sum_{i=1}^{N}\widetilde{A}_{i}\leq B. (11)

While the optimal solution to (11) may not be achievable in the sequential setting, it provides an upper bound on the log-NSW achieved by any online algorithm, and serves as a baseline for comparison. With the utility function in (1), allocating resources beyond the agent’s request does not increase the utility. Therefore, solving (11) is equivalent to the following

max𝐀~\displaystyle\max_{\widetilde{\mathbf{A}}} i=1Nwilog(A~i)\displaystyle\sum_{i=1}^{N}w_{i}\log(\widetilde{A}_{i}) (12)
s.t. 0A~iX~i,ii=1NA~iB.\displaystyle 0\leq\widetilde{A}_{i}\leq\widetilde{X}_{i},\;\forall i\quad\sum_{i=1}^{N}\widetilde{A}_{i}\leq B.

Then, any distribution of agent ii’s allocation A~i\widetilde{A}_{i} across the TT time steps that satisfies the demand constraint at each step, would be an optimal allocation in hindsight 𝐀ihindsight\mathbf{A}_{i}^{\text{hindsight}}, i.e.,

𝐀ihindsight{(Ai1,,AiT):0AitXit,t,τ=1TAit=A~i}.\mathbf{A}_{i}^{\text{hindsight}}\in\Big{\{}(A_{i}^{1},\dots,A_{i}^{T}):0\leq A_{i}^{t}\leq X_{i}^{t},\forall t,\;\sum_{\tau=1}^{T}A_{i}^{t}=\widetilde{A}_{i}\Big{\}}. (13)

The optimal solution to (12) takes the form A~i=min{X~i,μ}\widetilde{A}_{i}^{\star}=\min\{\widetilde{X}_{i},\mu\}, where μ\mu is a function of budget BB and demands 𝐗~\widetilde{\mathbf{X}} such that i=1NA~i=B\sum_{i=1}^{N}\widetilde{A}_{i}=B. The solution can be efficiently derived by the water-filling algorithm in Algorithm 1, and the threshold μ\mu can be interpreted as the water-level shown in Fig. 1(a).

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Water-filling solution for equal unit weights: Each bar shows the demands or allocations of an agent. (a) In hindsight, the resource is allocated using water-level μ\mu determined such that the total allocations are equal to the budget size. (b) In the sequential setting, at time tt, the cumulative prior allocations and total expected future demands are accounted for when determining water-level μt\mu_{t}.
Algorithm 1 Water-filling algorithm with weights for solving (12)
1:  Input: number of agents NN, resource budget BB, demand vector 𝐗~N\widetilde{\mathbf{X}}\in\mathbb{R}^{N}, weight vector 𝐰N\mathbf{w}\in\mathbb{R}^{N}
2:  Output: allocation vector 𝐀~N\widetilde{\mathbf{A}}\in\mathbb{R}^{N}
3:  Find an ordered index set {i1,,iN}\{i_{1},\dots,i_{N}\} by sorting the agents such that X~i1wi1X~iNwiN\frac{\widetilde{X}_{i_{1}}}{w_{i_{1}}}\leq\dots\leq\frac{\widetilde{X}_{i_{N}}}{w_{i_{N}}}
4:  αj=wijwij++wiN\alpha_{j}=\frac{w_{i_{j}}}{w_{i_{j}}+\dots+w_{i_{N}}} for j=1,,Nj=1,\dots,N
5:  j1j\leftarrow 1
6:  while jNj\leq N and B>0B>0 do
7:     if BX~ij/αjB\leq\widetilde{X}_{i_{j}}/\alpha_{j}  then
8:        A~ik=αkB\widetilde{A}_{i_{k}}=\alpha_{k}B for k=j,,Nk=j,\dots,N
9:        break
10:     else
11:        A~ij=X~ij\widetilde{A}_{i_{j}}=\widetilde{X}_{i_{j}}
12:        BBA~ijB\leftarrow B-\widetilde{A}_{i_{j}}
13:        jj+1j\leftarrow j+1
14:     end if
15:  end while

4 Sequential Setting: Heuristic Algorithm

In this section, we propose an intuitive and computationally efficient algorithm, named Sequential Allocations with Fairness based on Future Estimates (SAFFE), and its variant SAFFE-Discounted. This online algorithm uses Algorithm 1 as a sub-routine to compute the allocations at each step tt.

4.1 SAFFE Algorithm

SAFFE is based on the simple principle of substituting unobserved future requests with their expectations. By doing so, we convert the sequential decision-making problem into solving the offline problem at each time step. With the algorithm formally stated in Algorithm 2, at time step tt we use the expected future demands to determine the total resources we expect to allocate to each agent by time TT. This allows us to reserve a portion of the available budget for future demands. Specifically, at t=1,,Tt=1,\dots,T, we solve the following problem

max𝐂t\displaystyle\max_{\mathbf{C}^{t}} i=1N𝟙{Yit>0}.wilog(A~it+Cit)\displaystyle\sum_{i=1}^{N}\mathbbm{1}\{Y_{i}^{t}>0\}\;.\;w_{i}\log(\widetilde{A}_{i}^{t}+C_{i}^{t}) (14)
s.t. 0CitYit,i,i=1NCitBt,\displaystyle 0\leq C_{i}^{t}\leq Y_{i}^{t},\;\forall i,\quad\sum_{i=1}^{N}C_{i}^{t}\leq B^{t},
where Yit=Xit+τ=t+1T𝔼[Xiτ],i=1,,N.\displaystyle\text{where \qquad}Y_{i}^{t}=X_{i}^{t}+\sum_{\tau=t+1}^{T}\mathbb{E}[X_{i}^{\tau}],\;i=1,\dots,N. (15)

The indicator function 𝟙{Yit>0}\mathbbm{1}\{{Y}_{i}^{t}>0\} ensures that we do not consider absent agents, i.e., those with no current or expected future demands444With a slight abuse of notation, in the objective function, we assume 0×=00\times-\infty=0 to ensure that an agent without demand is not included, and gets no allocation.. Cit{C}_{i}^{t} denotes the total allocation for agent ii over the period t,,Tt,\dots,T, if the future demands would arrive exactly as their expectations. In other words, Cit{C}_{i}^{t} consists of the allocation in the current time step and the reserved allocations for the future. Then, the current allocation AitA_{i}^{t} is a fraction of CitC_{i}^{t} computed as

Ait=CitXitYit.\displaystyle A_{i}^{t}=C_{i}^{t}\frac{X_{i}^{t}}{Y_{i}^{t}}. (16)

Similar to the hindsight problem (12), we can efficiently solve (14) using a variant of the water-filling algorithm given in Algorithm 3 in Appendix C. As illustrated in Fig. 1(b), at time step tt, the water-level μt\mu_{t} is calculated while accounting for all previous agent allocations and all expected future demands.

Algorithm 2 SAFFE Algorithm
1:  Input: number of agents NN, resource budget BB, demand vectors 𝐗1,𝐗TN\mathbf{X}^{1},\dots\mathbf{X}^{T}\in\mathbb{R}^{N}, weight vector 𝐰N\mathbf{w}\in\mathbb{R}^{N}, demand distributions P𝐗1,,P𝐗NP_{\mathbf{X}_{1}},\dots,P_{\mathbf{X}_{N}}
2:  Output: allocation vectors 𝐀1,,𝐀TN\mathbf{A}^{1},\dots,\mathbf{A}^{T}\in\mathbb{R}^{N}
3:  for For t=1,,Tt=1,\dots,T do
4:     Yit=Xit+𝔼[τ=t+1TXiτ]Y_{i}^{t}=X_{i}^{t}+\mathbb{E}[\sum_{\tau=t+1}^{T}X_{i}^{\tau}] for i=1,,Ni=1,\dots,N
5:     A~it=τ=1tAiτ\widetilde{A}^{t}_{i}=\sum_{\tau=1}^{t}A_{i}^{\tau} for i=1,,Ni=1,\dots,N
6:     𝐂t\mathbf{C}^{t}\leftarrow Algorithm 3 with input (NN, BB, 𝐘t,𝐰,𝐀~t\mathbf{Y}^{t},\mathbf{w},\widetilde{\mathbf{A}}^{t}) in Appendix C
7:     Ait=CitXit/YitA_{i}^{t}=C_{i}^{t}X_{i}^{t}/Y_{i}^{t} for i=1,,Ni=1,\dots,N
8:     BBi=1NAitB\leftarrow B-\sum_{i=1}^{N}A_{i}^{t}
9:  end for

4.2 SAFFE-Discounted Algorithm

While SAFFE is simple and easy to interpret, it is sub-optimal. In fact, SAFFE maximizes an upper bound of the optimal Q-values of the MDP defined in Sec. 2.2. This implies that SAFFE overestimates the expected reward gain from the future and reserves the budget excessively for future demands. To correct the over-reservation, we propose SAFFE-Discounted (SAFFE-D), which penalizes uncertain future requests by their standard deviations. At every step tt, the algorithm computes

Yit=Xit+τ=t+1T(𝔼[Xiτ]λstd(Xiτ))+\displaystyle Y_{i}^{t}={X}_{i}^{t}+\sum_{\tau=t+1}^{T}\Big{(}\mathbb{E}[{X}_{i}^{\tau}]-\lambda\operatorname{std}({X}_{i}^{\tau})\Big{)}^{+} (17)

for some regularization parameter λ0\lambda\geq 0 and solves (14). As in SAFFE, the current allocation AitA_{i}^{t} is split proportionally from CitC_{i}^{t} according to (16). The regularizer λ\lambda is a hyper-parameter that can be tuned to provide the optimal trade-off between consuming too much of the budget at the current step and reserving too much for the future. We show in Appendix D, the uncertainty in future demands reduces as we approach TT, and we expect better performance with a decreasing time-dependent function such as λ(t)=Ttλ\lambda(t)=\sqrt{T-t}\,\lambda for some λ>0\lambda>0. Alternatively, λ(t)\lambda(t) can be learned as a function of time.

Remark 1.

In this paper, we assume access to historical data from which expected future demands can be estimated, and we mainly focus on the decision-making aspect for allocations. These estimates are directly used by SAFFE and SAFFE-D. We empirically study how sensitive the algorithms are to estimation errors in Sec. 6.3.

5 On the Sub-Optimality of SAFFE-D

In order to determine how sub-optimal SAFFE-D is, we upper bound the performance gap between SAFFE-D and the hindsight solution, in terms of Δ𝐀max\Delta\mathbf{A}^{\text{max}} defined in (3). Based on (Sinclair et al., 2020, Lemma 2.7), this bound translates to bounds on Pareto-efficiency, envy-freeness, and proportionality of the same order by Lipschitz continuity arguments, which extend to SAFFE-D’s solution. To this end, we consider two demand settings: a worst-case scenario where we allow highly unbalanced demands for different agents, and a balanced-demand scenario where we assume equal expectation and variance across various agents (see Appendix D for details). In the worst-case scenario, we assume that there is one agent who has much larger requests compared to all other agents, and thus, an increase in the request of any other agent would be on her account. This is a fairly general case - the only assumption we make is knowing the mean and variance of future demand distributions. Our arguments rely on concentration inequalities and are in spirit similar to the idea of guardrails in Sinclair et al. (2022). The following theorem provides the gap between hindsight and SAFFE-D solutions for λ(t)=(Tt)/ξ\lambda(t)=\sqrt{(T-t)/\xi}. For brevity we impose the assumption of equal variance std(Xit)\operatorname{std}(X^{t}_{i}) across all agents, which can be relaxed for more general settings as in Remark 2 of Appendix E.

Theorem 1 (Gap between SAFFE-D and Hindsight).

With probability at least 1ξ1-\xi, the gap between SAFFE-D and hindsight measured by Δ𝐀max\Delta\mathbf{A}^{\text{max}} introduced in (3) satisfies

Δ𝐀max2T3/2ξstd(Xit),\displaystyle\Delta\mathbf{A}^{\text{max}}\leq\frac{2T^{3/2}}{\sqrt{\xi}}\operatorname{std}(X^{t}_{i}), (18)

for balanced demands. In the worst-case scenario this bound is NT3/2ξstd(Xit)N\frac{T^{3/2}}{\sqrt{\xi}}\operatorname{std}(X^{t}_{i}), and scales with the number of agents NN.

Proof.

The detailed proof is in Appendix H. Let 𝐀oracle\mathbf{A}^{\text{oracle}} denote the allocations derived using SAFFE-D in a setting where an oracle has perfect knowledge of the sequence of incoming demands, i.e., with no stochasticity. We refer to this setup as SAFEE-Oracle. We upper bound Δ𝐀max\Delta\mathbf{A}^{\text{max}}, the distance between 𝐀SAFFE-D\mathbf{A}^{\text{SAFFE-D}} and 𝐀hindsight\mathbf{A}^{\text{hindsight}}, in two steps: by bounding the distance between SAFFE-D and SAFFE-Oracle (Appendix D) and between SAFFE-Oracle and hindsight (Appendix F). Finally, triangle inequality completes the proof and upper bounds the distance Δ𝐀max\Delta\mathbf{A}^{\text{max}} between SAFFE-D and hindsight in terms of the sum of these two distances. ∎

6 Experimental Results

In this section, we use synthetic data to evaluate SAFFE-D under different settings and compare its performance against baseline sequential algorithms in terms of fairness metrics and budget allocation efficiency. In the experiments, we study: 1) how the fairness of SAFFE-D allocations is affected as the budget size, number of agents and time horizon vary (Sec. 6.1), 2) whether the algorithm favors agents depending on their arrival times or demand sizes (Sec. 6.2), 3) how sensitive the algorithm is to future demand estimation errors (Sec. 6.3), 4) how the discounting in SAFFE-D improves fairness (Sec. 6.4), and 5) how SAFFE-D compares to allocation policies learned using reinforcement learning (RL) (Sec. 6.5). Finally, in Sec. 6.6, we evaluate SAFFE-D on real data.

Evaluation Metrics

We consider the following metrics to compare the fairness and efficiency of sequential allocation algorithms:

  • Log-NSW: Expected log-NSW in hindsight as in Eq. (2). Since the value of log-NSW is not directly comparable across different settings, we use its normalized distance to the hindsight log-NSW when comparing algorithms, denoted by ΔLog-NSW\Delta\text{Log-NSW}.

  • Utilization (%): The fraction of available budget distributed to the agents over the horizon. Due to the stochasticity in experiments described next, we may have an over-abundance of initial budget that exceeds the total demands over the horizon. Therefore, we define this metric by only considering the required fraction of available budget BB as

    i=1Nt=1TAitmin{B,i=1Nt=1TXit}×100\displaystyle\frac{\sum_{i=1}^{N}\sum_{t=1}^{T}A_{i}^{t}}{\min\{B,\sum_{i=1}^{N}\sum_{t=1}^{T}X_{i}^{t}\}}\times 100
  • Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}}: The average and maximum normalized deviation of per-agent cumulative allocations compared to hindsight allocations as defined in Eq. (3). For better scaling in visualizations, we make a slight change by normalizing these metrics with respect to hindsight as ΔAit=1TAit,hindsight\frac{\Delta A_{i}}{\sum_{t=1}^{T}A_{i}^{t,\text{hindsight}}}. Since these metrics measure distance, an algorithm with lower Δ𝐀\Delta\mathbf{A} is considered more fair in terms of this metric.

Allocation Algorithm Baselines

We compare our algorithms SAFFE and SAFFE-D with the following baselines:

  • Hindsight: As discussed in Sec. 3, the solution to (12) represents a baseline for evaluating sequential algorithms since its Log-NSW provides an upper bound for other algorithms.

  • HOPE-Online: Motivated by the mobile food-bank allocation problem, this algorithm was proposed in Sinclair et al. (2020) for a setting where NN agent demands are sequentially revealed over T=NT=N. As we explain in Appendix I, HOPE-Online coincides with SAFFE in the special case where each agent makes only one request.

  • Guarded-HOPE: It was proposed in Sinclair et al. (2022) to achieve the correct trade-off between fairness (in terms of envy-freeness) and efficiency with an input parameter LTL_{T} derived based on concentration arguments. The algorithm is designed for a setting of one request per horizon for an individual. We modify Guarded-HOPE (Algorithm 4 in Appendix J) to be applicable to our setting, and to provide meaningful confidence bounds for the multiple demand case. As in Sinclair et al. (2022), we use LT=T1/2L_{T}=T^{-1/2} and LT=T1/3L_{T}=T^{-1/3}.

Demand Processes

We investigate various settings by considering the following demand processes, for which we are able to analytically derive the future demand means and standard deviations:

  • Symmetric Setting (Homogeneous Bernoulli arrivals with Normal demands). At time tt, agent ii makes a request XiN(μi,σi2)X_{i}\sim N(\mu_{i},\sigma_{i}^{2}) with probability pp, independently from other agents. The distribution parameters are such that μiUniform(10,100)\mu_{i}\sim\text{Uniform}(10,100), and σi=μi/5\sigma_{i}=\mu_{i}/5. We study regimes where p=cTp=\frac{c}{T}, 1cT1\leq c\leq T, is the average number of arrivals per agent.

  • Non-symmetric Arrivals Setting (Inhomogeneous Bernoulli arrivals with Normal demands). In this setup, we assume that the likelihood of arrivals over the horizon varies across agents, which we use to evaluate whether the algorithms favor agents based on their arrival times. We implement this using Bernoulli arrivals with a time-varying parameter pp. We consider three groups. Ask-Early: 1/3rd1/3^{\text{rd}} of the agents are more likely to frequently visit earlier in the horizon such that pit(Tt)p^{t}_{i}\propto(T-t), Ask-Late: 1/3rd1/3^{\text{rd}} of the agents are more likely to frequently visit later in the horizon with pittp^{t}_{i}\propto t, and Uniform: the remaining agents do not have a preference in terms of time with constant pitp^{t}_{i}. The Bernoulli parameters of the three groups are normalized to have equal expected number of visits.

  • Non-symmetric Demands Setting (Homogeneous Bernoulli arrivals and varying Normal demands). We consider homogeneous arrivals but with time-varying demand sizes across agents. We have three groups with agents that make requests with the same probability pp over TT. More-Early: 1/3rd1/3^{\text{rd}} of the agents have Normal demands with mean μit(Tt)\mu^{t}_{i}\propto(T-t), More-Late: 1/3rd1/3^{\text{rd}} of the agents have μitt\mu_{i}^{t}\propto t, and Uniform: the remaining agents have constant μit\mu_{i}^{t}. The parameters are normalized to have the same total demands across the groups. Standard deviation σi\sigma_{i} is assumed to be time-independent for all agents.

Real Data

We evaluate SAFFE-D using a real dataset555https://www.kaggle.com/c/demand-forecasting-kernels-only containing five years of store-item sales data for 1010 stores. We assume that each store (agent) requests its total daily sales from a warehouse (supplier) with limited inventory, and the warehouse aims to be fair when supplying the stores on a weekly basis, i.e., when T=7T=7. We use the first three years to estimate the expected demands and standard deviations for each weekday, and we use the remaining two years to evaluate SAFFE-D for fair daily allocations.

We report the results on synthetic data as an average over 200200 experiment realizations with one standard deviation. For SAFFE-D we use λ(t)=λTt\lambda(t)=\lambda\sqrt{T-t}, where the hyper-parameter λ\lambda is optimized with respect to Log-NSW. We express the budget size BB as the fraction of total expected demands such that a budget of 0.10.1 means enough budget to meet only 10%10\% of the total expected demands. We assume equal weights across all agents.

6.1 Scaling System Parameters

In Fig. 2(a), we compare SAFEE-D with the baselines as the supplier’s budget increases from 0.10.1 to 11. We consider the Symmetric setting with N=50N=50 agents and 22 expected arrivals over horizon T=40T=40. We observe that SAFFE-D outperforms all other approaches in terms of both Δ\DeltaLog-NSW and resource utilization across all budgets. The advantage of SAFFE-D in comparison to other algorithms is that it prevents over-reserving for the future, and thus achieves higher utilization and lower Δ\DeltaLog-NSW. In contrast, SAFFE does not impose the regularization term which potentially results in overestimating future demands. Hope-Guardrail is designed to find a balance between utilization and envy, which does not necessarily mean balance in terms of Log-NSW. Additional experiments showing how the performance of the algorithms scale with the number of agents NN, time horizon TT and the per-agent expected arrivals are provided in Appendix K.1. The results suggest that SAFFE-D, although suboptimal, performs close to Hindsight.

Refer to caption
(a) Different budget size BB.
Refer to caption
(b) Sensitivity to estimation errors.
Figure 2: Symmetric Setting with N=50N=50 and T=40T=40. (a) SAFFE-D performs close to Hindsight achieving high utilization and low Δ\DeltaLog-NSW, Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}}, (b) SAFFE-D is robust to expected demands estimation errors.

6.2 Non-Symmetric Arrivals or Demands

We investigate whether SAFFE-D allocates to agents differently based on their arrival or demand patterns. We consider N=50N=50 agents with 22 expected arrivals over the horizon T=40T=40, and Normal demands with μi=50\mu_{i}=50, and budget size 0.50.5. Under the Non-symmetric Arrivals setting, we compare the agents’ allocations for each algorithm with respect to the Hindsight allocations in Fig. 3. In this setting, agents have different likelihoods of arriving over the horizon. We observe that despite being optimized for the log-NSW objective, SAFFE-D outperforms all other algorithms on average, and is more uniform across the three groups compared to SAFFE and Guarded-HOPE which favor the Ask-Late agents. In terms of the worst-case agent, SAFFE outperforms all other methods and has similar Δ𝐀max\Delta\mathbf{A}^{max} across all three groups. The algorithms are compared under the Non-symmetric Demands setting in Appendix K.2.

Refer to caption
Figure 3: Non-symmetric Arrivals Setting: SAFFE-D outperforms in the average case across groups with different arrival patterns and SAFFE is more uniform in the worst case.
Table 1: Improvement of SAFFE-D with choice of λ\lambda.
Log-NSW (\uparrow) Utilization % (\uparrow) Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} (\downarrow) Δ𝐀max\Delta\mathbf{A}^{\text{max}} (\downarrow)
Hindsight 236.57±\pm10.21 100.0±\pm0.0 - -
SAFFE-D (λ(t)\lambda(t)) 235.91±\pm10.26 99.45±\pm1.5 0.05±\pm0.02 0.45±\pm0.23
SAFFE-D (λ\lambda) 234.12±\pm10.67 97.86±\pm3.5 0.09±\pm0.03 0.65±\pm0.22
SAFFE (λ=0\lambda=0) 232.89±\pm10.91 95.77±\pm4.5 0.11±\pm0.03 0.69±\pm0.21

6.3 Sensitivity to Estimation Errors

We investigate how sensitive the algorithms are to estimation errors by scaling each agent’s mean estimate at each step as μ^it=(1+δ)μit\widehat{\mu}_{i}^{t}=(1+\delta)\mu_{i}^{t}, where δ\delta denotes the noise level. Fig. 2(b) shows the change in performance as we vary the noise level δ\delta from 0.5-0.5 to 0.50.5, for the Symmetric setting with N=50N=50 agents, 22 expected per-agent arrivals, T=40T=40, and budget 0.50.5. We observe that SAFFE-D is robust to mean estimation noise while all other algorithms are impacted. This results from the discounting term in SAFFE-D, for which the hyper-parameter λ\lambda can be tuned to the estimation errors. However, when over-estimating the future demands, SAFFE and other methods will be initially more conservative which can lead to lower utilization and less fair allocations due to the leftover budget. When under-estimating the expected demands, they use the budget more greedily earlier on and deplete the budget sooner, resulting in less allocations to agents arriving later (especially in Gaurded-HOPE).

6.4 Choice of λ\lambda for SAFFE-D

The discounting term in SAFFE-D uses the confidence in estimates to improve the performance of SAFFE by balancing the budget between allocating now vs reserving for future arrivals. In Table 1, we compare the performance of SAFFE-D when using different discounting functions, i.e., constant or decreasing λ\lambda over the horizon. In both cases, the parameter is tuned to maximize Log-NSW. As expected from Sec. 4.2, the performance of SAFFE-D improves for λ(t)\lambda(t) decreasing over time. This is because the uncertainty in expected future demands reduces and the supplier can be less conservative when making allocation decisions.

Table 2: SAFFE-D vs RL allocation policy in the Symmetric setting for N=10N=10 agents with different per-agent arrivals.
22 Per-agent Arrivals (Sparse) 44 Per-agent Arrivals (Dense)
Log-NSW (\uparrow) Utilization % (\uparrow) Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} (\downarrow) Δ𝐀max\Delta\mathbf{A}^{\text{max}} (\downarrow) Log-NSW (\uparrow) Utilization % (\uparrow) Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} (\downarrow) Δ𝐀max\Delta\mathbf{A}^{\text{max}} (\downarrow)
Hindsight 35.74±\pm1.23 100.0±\pm0.0 - - 47.37±\pm2.20 100.0±\pm0.0 - -
RL Policy (SAC) 35.11±\pm1.20 95.63±\pm0.49 0.13 ±\pm 0.01 0.43 ±\pm 0.01 47.00±\pm2.20 98.30±\pm0.45 0.12±\pm0.01 0.34±\pm0.02
SAFFE-D 35.01±\pm1.28 99.54±\pm0.42 0.15±\pm0.05 0.51±\pm 0.16 47.12±\pm 2.22 99.82±\pm0.36 0.11±\pm0.04 0.32±\pm0.09
SAFFE 34.77±\pm1.19 92.74±\pm0.41 0.16±\pm0.02 0.53±\pm0.03 46.87±\pm2.15 97.14±\pm0.43 0.12±\pm0.02 0.35±\pm0.03
Table 3: Experiments on real data for daily per-agent demands (p=1p=1), and more sparse demand arrivals (p=0.5p=0.5).
p=1p=1 p=0.5p=0.5
Log-NSW (\uparrow) Utilization % (\uparrow) Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} (\downarrow) Δ𝐀max\Delta\mathbf{A}^{\text{max}} (\downarrow) Log-NSW (\uparrow) Utilization % (\uparrow) Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} (\downarrow) Δ𝐀max\Delta\mathbf{A}^{\text{max}} (\downarrow)
Hindsight 42.72±\pm0.00 100.0±\pm0.0 - - 35.24±\pm0.94 100.0±\pm0.0 - -
SAFFE-D 42.72±\pm0.00 100.0±\pm0.0 0 0 35.16±\pm0.94 100.0±\pm0.0 0.06±\pm0.04 0.17±\pm0.14
SAFFE 42.72±\pm0.02 99.98±\pm0.15 0.01±\pm0.01 0.01±\pm0.03 34.82±\pm1.10 99.16±\pm2.94 0.17±\pm0.08 0.37±\pm0.16
HOPE-Online 42.71±\pm0.02 99.98±\pm0.15 0.10±\pm0.04 0.04±\pm0.01 34.44±\pm1.10 99.10±\pm3.03 0.27±\pm0.06 0.60±\pm0.15
Guarded-HOPE (LT=T1/2L_{T}=T^{-1/2}) 42.48±\pm0.30 97.93±\pm2.80 0.07±\pm0.02 0.16±\pm0.05 33.55±\pm3.67 95.93±\pm6.24 0.27±\pm0.06 0.62±\pm0.15
Guarded-HOPE (LT=T1/3L_{T}=T^{-1/3}) 42.55±\pm0.21 98.64±\pm1.94 0.07±\pm0.02 0.15±\pm0.05 33.59±\pm3.67 96.54±\pm5.61 0.28±\pm0.07 0.63±\pm0.15

6.5 Reinforcement Learning Results

We investigate learning an allocation strategy using RL under the Symmetric setting. We train the RL allocation policy without access to any demand distribution information (e.g., expectation). We consider N=10N=10 agents, T=10T=10 and varying budget size sampled from Uniform(0.4,0.8)\text{Uniform}(0.4,0.8). We use the Soft Actor-Critic (SAC) Haarnoja et al. (2018a); Achiam (2018) method with automatic entropy tuning Haarnoja et al. (2018b) to learn a policy (See Appendix L). The results averaged over 200200 experiment rollouts for five random seeds are reported in Table 2, for two cases where we expect to have 22 and 44 per-agent arrivals. We observe that while the RL policy is not able to match the hindsight performance, it outperforms SAFFE-D and other approaches in terms of Log-NSW under sparse (22 per-agent) arrivals. With denser (44 per-agent) arrivals, the RL policy performs slightly worse than SAFFE-D while still outperforming others. These observations are in line with other experiments illustrating the close-to-optimal performance of SAFFE-D especially in settings with more arrivals.

6.6 Real Data

In contrast to the demand processes considered in previous experiments, in this case, we need to estimate the expected demands and their standard deviations. The efficiency and fairness metrics are reported in Table 3 for budget size 0.50.5. We observe that SAFFE-D is optimal, and SAFFE and HOPE-Online perform very close to Hindsight in terms of Utilization and Log-NSW. As shown in Fig. 4(c) in Appendix K.1, we expect that SAFFE-D achieves (close to) hindsight performance for dense agent arrivals as with the daily arrivals in this setting. In order to investigate less dense scenarios, we impose sparsity in the arrivals by erasing each arrival with probability pp. As observed in Table 3, SAFFE-D is no longer optimal but outperforms all other methods. Additional results are provided in Appendix K.3.

7 Related Work

Fair division is a classic problem extensively studied for settings involving single and multiple resources, divisible and indivisible resources, and under different fairness objectives Bretthauer and Shetty (1995); Katoh and Ibaraki (1998); Lee and Lee (2005); Patriksson (2008). In the online setting, which requires real-time decision-making, Walsh (2011); Kash et al. (2014) consider the cake cutting problem where agents interested in fixed divisible resources arrive and depart over time, while the goal in Aleksandrov et al. (2015); Zeng and Psomas (2020); He et al. (2019) is to allocate sequentially arriving indivisible resources to a fixed set of agents. A wide range of fairness metrics are used in the literature, including global fairness (e.g., utilitarian or egalitarian welfare) Kalinowski et al. (2013); Manshadi et al. (2021), individual measures Benade et al. (2018); Gupta and Kamble (2021); Sinclair et al. (2022) or probabilistic notions of fairness across groups of agents Donahue and Kleinberg (2020).

The most relevant work to ours are Lien et al. (2014); Sinclair et al. (2020, 2022), which study the online allocation of divisible resources to agents with stochastic demands. Lien et al. (2014) proposes equitable policies that maximize the minimum agent utility, while Sinclair et al. (2020, 2022) use approximate individual fairness notions. Motivated by the food-bank allocation problem, Sinclair et al. (2020, 2022) assume one request per agent, where the requests are restricted to a finite set of values. The algorithm proposed in Sinclair et al. (2022) improves upon Sinclair et al. (2020) and achieves a trade-off between a measure of fairness (envy-freeness) and resource leftover using concentration arguments on the optimal NSW solution. In our work, we study a more general setup where agents can arrive simultaneously and several times over the horizon and we do not restrict agent demand sizes. Our goal is to be fair to the agents considering all their allocations over the horizon under the NSW objective in hindsight.

8 Conclusions

This paper studies the problem of fair resource allocation in the sequential setting, and mathematically formulates the NSW fairness objective under the MDP framework. With the goal to maximize the return in the MDP, we propose SAFFE-D, a new algorithm that enjoys a worst-case theoretical sub-optimality bound. Through extensive experiments, we show that SAFFE-D significantly outperforms existing approaches, and effectively balances the fraction of budget that should be used at each step versus preserved for the future. Empirical results under various demand processes demonstrate the superiority of SAFFE-D for different budget sizes, number of agents, and time horizons, especially in settings with dense arrivals. The uncertainty-based discount in SAFFE-D also improves the robustness of allocations to errors in future demand estimations. Future work includes exploring learning-based allocation policies, learning the optimal λ(t)\lambda(t) in SAFFE-D, and extensions to settings with strategic agents.

Disclaimer

This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JPMorgan Chase & Coȧnd its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

  • Abid et al. [2020] Adnan Abid, Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Uzma Farooq, and Muzammil Hussain. Challenges and issues of resource allocation techniques in cloud computing. KSII Transactions on Internet and Information Systems (TIIS), 14(7):2815–2839, 2020.
  • Achiam [2018] Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
  • Aleksandrov et al. [2015] Martin Damyanov Aleksandrov, Haris Aziz, Serge Gaspers, and Toby Walsh. Online fair division: Analysing a food bank problem. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • Balasubramanian et al. [2007] Aruna Balasubramanian, Brian Levine, and Arun Venkataramani. Dtn routing as a resource allocation problem. pages 373–384, 2007.
  • Bellman [1966] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • Benade et al. [2018] Gerdus Benade, Aleksandr M Kazachkov, Ariel D Procaccia, and Christos-Alexandros Psomas. How to make envy vanish over time. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 593–610, 2018.
  • Bretthauer and Shetty [1995] Kurt M Bretthauer and Bala Shetty. The nonlinear resource allocation problem. Operations research, 43(4):670–683, 1995.
  • Cohen et al. [1965] Kalman J Cohen, Richard Michael Cyert, et al. Theory of the firm; resource allocation in a market economy. Prentice-Hall International Series in Management (EUA), 1965.
  • Donahue and Kleinberg [2020] Kate Donahue and Jon Kleinberg. Fairness and utilization in allocating resources with uncertain demand. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 658–668, 2020.
  • Eisenberg and Gale [1959] Edmund Eisenberg and David Gale. Consensus of subjective probabilities: The pari-mutuel method. The Annals of Mathematical Statistics, 30(1):165–168, 1959.
  • Eisenberg [1961] Edmund Eisenberg. Aggregation of utility functions. Management Science, 7(4):337–350, 1961.
  • Furukawa [1972] Nagata Furukawa. Markovian decision processes with compact action spaces. The Annals of Mathematical Statistics, 43(5):1612–1622, 1972.
  • Gallego et al. [2015] Guillermo Gallego, Anran Li, Van-Anh Truong, and Xinshang Wang. Online resource allocation with customer choice. arXiv preprint arXiv:1511.01837, 2015.
  • Gupta and Kamble [2021] Swati Gupta and Vijay Kamble. Individual fairness in hindsight. J. Mach. Learn. Res., 22(144):1–35, 2021.
  • Haarnoja et al. [2018a] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1856–1865, 2018.
  • Haarnoja et al. [2018b] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. arXiv preprint, arXiv:1812.05905, 2018.
  • He et al. [2019] Jiafan He, Ariel D Procaccia, CA Psomas, and David Zeng. Achieving a fairer future by changing the past. IJCAI’19, 2019.
  • Kalinowski et al. [2013] Thomas Kalinowski, Nina Nardoytska, and Toby Walsh. A social welfare optimal sequential allocation procedure. arXiv preprint arXiv:1304.5892, 2013.
  • Kaneko and Nakamura [1979] Mamoru Kaneko and Kenjiro Nakamura. The nash social welfare function. Econometrica: Journal of the Econometric Society, pages 423–435, 1979.
  • Kash et al. [2014] Ian Kash, Ariel D Procaccia, and Nisarg Shah. No agent left behind: Dynamic fair division of multiple resources. Journal of Artificial Intelligence Research, 51:579–603, 2014.
  • Katoh and Ibaraki [1998] Naoki Katoh and Toshihide Ibaraki. Resource allocation problems. Handbook of combinatorial optimization, pages 905–1006, 1998.
  • Kumar et al. [1995] Nirmalya Kumar, Lisa K Scheer, and Jan-Benedict EM Steenkamp. The effects of supplier fairness on vulnerable resellers. Journal of marketing research, 32(1):54–65, 1995.
  • Lee and Lee [2005] Zne-Jung Lee and Chou-Yuan Lee. A hybrid search algorithm with heuristics for resource allocation problem. Information sciences, 173(1-3):155–167, 2005.
  • Lien et al. [2014] Robert W Lien, Seyed MR Iravani, and Karen R Smilowitz. Sequential resource allocation for nonprofit operations. Operations Research, 62(2):301–317, 2014.
  • Manshadi et al. [2021] Vahideh Manshadi, Rad Niazadeh, and Scott Rodilitz. Fair dynamic rationing. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 694–695, 2021.
  • Nair et al. [2018] Arun Sukumaran Nair, Tareq Hossen, Mitch Campion, Daisy Flora Selvaraj, Neena Goveas, Naima Kaabouch, and Prakash Ranganathan. Multi-agent systems for resource allocation and scheduling in a smart grid. Technology and Economics of Smart Grids and Sustainable Energy, 3(1):1–15, 2018.
  • Nash [1950] John F Nash. The bargaining problem. Econometrica: Journal of the econometric society, pages 155–162, 1950.
  • Patriksson [2008] Michael Patriksson. A survey on the continuous nonlinear resource allocation problem. European Journal of Operational Research, 185(1):1–46, 2008.
  • Sinclair et al. [2020] Sean R Sinclair, Gauri Jain, Siddhartha Banerjee, and Christina Lee Yu. Sequential fair allocation of limited resources under stochastic demands. arXiv preprint arXiv:2011.14382, 2020.
  • Sinclair et al. [2022] Sean R Sinclair, Siddhartha Banerjee, and Christina Lee Yu. Sequential fair allocation: Achieving the optimal envy-efficiency tradeoff curve. In Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, pages 95–96, 2022.
  • Varian [1973] Hal R Varian. Equity, envy, and efficiency. 1973.
  • Walsh [2011] Toby Walsh. Online cake cutting. In International Conference on Algorithmic Decision Theory, pages 292–305. Springer, 2011.
  • Yi et al. [2016] Peng Yi, Yiguang Hong, and Feng Liu. Initialization-free distributed algorithms for optimal resource allocation with feasibility constraints and application to economic dispatch of power systems. Automatica, 74:259–269, 2016.
  • Yilmaz et al. [2004] Cengiz Yilmaz, Bulent Sezen, and Ebru Tumer Kabadayı. Supplier fairness as a mediating factor in the supplier performance–reseller satisfaction relationship. Journal of Business Research, 57(8):854–863, 2004.
  • Zeng and Psomas [2020] David Zeng and Alexandros Psomas. Fairness-efficiency tradeoffs in dynamic fair division. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 911–912, 2020.

 
Sequential Fair Resource Allocation under a Markov Decision Process Framework
Supplementary Materials


 

Appendix A Multiple-Resource Setting

The setting in this paper can be easily extended to the supplier having MM divisible resources, where resource j{1,,M}j\in\{1,\dots,M\} has a limited budget of size BjB_{j}, and 𝐁=(B1,,BM)\mathbf{B}=(B_{1},\dots,B_{M}). At time t{1,,T}t\in\{1,\dots,T\}, agent ii’s demands i denoted by Xit=(Xi,1t,,Xi,Mt)M{X}^{t}_{i}=({X}^{t}_{i,1},\dots,{X}^{t}_{i,M})\in\mathbb{R}^{M}, and the supplier’s allocation is Ait=(Ai,1t,,Ai,Mt)M{A}^{t}_{i}=({A}^{t}_{i,1},\dots,{A}^{t}_{i,M})\in\mathbb{R}^{M}. For multiple resources, Θi=(Xi,j1,,Xi,jT,Vi)\Theta_{i}=({X}_{i,j}^{1},\dots,{X}_{i,j}^{T},V_{i}), where Vi=(vi,1,,vi,M)V_{i}=(v_{i,1},\dots,v_{i,M}) denotes the per-unit value that agent ii associates to each resource,

u(𝐀i,𝚯i)=t=1Tj=1Mvi,jmin{Ai,jt,Xi,jt}\displaystyle u(\mathbf{{A}}_{i},\mathbf{\Theta}_{i})=\sum_{t=1}^{T}\sum_{j=1}^{M}v_{i,j}\min\left\{{A}_{i,j}^{t},\,{X}_{i,j}^{t}\right\} (19)

While we present the paper in terms of a single resource, all algorithms and proofs can be extend to the multiple-resource setting. In this case, the optimal allocation in hindsight or the sequential setting will not have closed-form solutions as in the water-filling algorithm, but can be computed using convex optimization techniques.

Appendix B Existence of Optimal Policy

The state space of the MDP is continuous, the action space is continuous and state-dependent, and the reward is bounded and continuous due to the constant ϵ\epsilon in Eq. (8). As a result of Furukawa [1972][Theorem 4.2], a stationary optimal policy π=(π1,,πT)\pi^{\star}=(\pi_{1}^{\star},\dots,\pi_{T}^{\star}) is known to exist.

If we restrict the action space to allocations that satisfy AitXitA_{i}^{t}\leq X_{i}^{t} for all ii and tt, the reward at step tt (7) becomes

Rt(st,𝐀t)=\displaystyle R_{t}(s_{t},\mathbf{A}^{t})=
i=1N𝟙{Xit>0}.wi(log(A~it1+Ait+ϵ)log(A~it1+ϵ)),\displaystyle\quad\sum_{i=1}^{N}\mathbbm{1}\{X_{i}^{t}>0\}\,.\,w_{i}\Big{(}\log(\widetilde{{A}}_{i}^{t-1}+{A}_{i}^{t}+\epsilon)-\log(\widetilde{{A}}_{i}^{t-1}+\epsilon)\Big{)}, (20)

where A~itτ=1tAiτ\widetilde{A}_{i}^{t}\coloneqq\sum_{\tau=1}^{t}A_{i}^{\tau} denotes the cumulative allocated resources to agent ii till time tt. At time step TT, πT\pi_{T}^{\star} maximizes QT(sT,𝐀T)=RT(sT,𝐀T)Q_{T}(s_{T},\mathbf{A}^{T})=R_{T}(s_{T},\mathbf{A}^{T}) which does not depend on future allocations. The optimal allocation 𝐀T{\mathbf{A}^{T}}^{\star} can be directly computed from QTQ_{T} since there is no uncertainty about future demands. Then, the optimal policy for all tt can be derived by recursively solving (10) backward in time. This quickly becomes computationally intractable, which motivates us to find alternative solutions that trade-off sub-optimality for computational efficiency.

Appendix C Water-Filling Algorithm with Past Allocations

Algorithm 3 extends the water-filling algorithm presented in Algorithm 1 for agents with different weights to a setting where each agent may have past allocations denoted by 𝐀0\mathbf{A}^{0}. This algorithm is used as a base algorithm in SAFFE (Algorithm 2) to compute the allocations during each time step based on past allocations, and current and expected future demands. In line 3, the agents are ordered according to their demands, past allocations and weights. This ordering determines which agent is used to compute the water-level μ\mu. For each selected agent iji_{j}, the condition in line 6 determines whether there is enough budget to fully satisfy agent iji_{j}’s demand XijX_{i_{j}}. If there is enough budget, agent iji_{j} receives its full request (line 10) and the supplier moves on to the next agent in order. Otherwise, the available budget is fully divided among the remaining agents (line 8) according to a water-level μ\mu computed in line 7. The water-level accounts for the agent’s past allocations as well as their weights.

Algorithm 3 Water-Filling Algorithm with Past Allocations
1:  Input: number of agents NN, resource budget BB, demand vector 𝐗N{\mathbf{X}}\in\mathbb{R}^{N}, weight vector 𝐰N\mathbf{w}\in\mathbb{R}^{N}, past allocations 𝐀0N\mathbf{A}^{0}\in\mathbb{R}^{N}
2:  Output: allocation vector 𝐀N{\mathbf{A}}\in\mathbb{R}^{N}
3:  Find an ordered index set {i1,,iN}\{i_{1},\dots,i_{N}\} by sorting the agents such that Xi1+Ai10wi1XiN+AiN0wiN\frac{{X}_{i_{1}}+{A}^{0}_{i_{1}}}{w_{i_{1}}}\leq\dots\leq\frac{{X}_{i_{N}}+{A}^{0}_{i_{N}}}{w_{i_{N}}}
4:  j1j\leftarrow 1
5:  while jNj\leq N and B>0B>0 do
6:     if Bk=jN(wikwij(Xij+Aij0)Aik0)+B\leq\sum_{k=j}^{N}\Big{(}\frac{w_{i_{k}}}{w_{i_{j}}}({X}_{i_{j}}+A_{i_{j}}^{0})-{A}^{0}_{i_{k}}\Big{)}^{+} then
7:        Solve k=jN(wikwijμAij0)+=B\sum_{k=j}^{N}(\frac{w_{i_{k}}}{w_{i_{j}}}\mu-A_{i_{j}}^{0})^{+}=B for μ\mu
8:        Aik=(wikwijμAij0)+A_{i_{k}}=(\frac{w_{i_{k}}}{w_{i_{j}}}\mu-A_{i_{j}}^{0})^{+} for k=j,,Nk=j,\dots,N
9:        break
10:     else
11:        Aij=Xij{A}_{i_{j}}={X}_{i_{j}}
12:        BBAijB\leftarrow B-{A}_{i_{j}}
13:        jj+1j\leftarrow j+1
14:     end if
15:  end while

Appendix D SAFFE-Discounted vs SAFFE-Oracle

Under mild assumptions on the distribution of demands, we can quantify the distance between SAFFE-D and SAFFE-Oracle allocations using concentration inequalities on the deviation of future demands from their expected value. Let us define

Y¯it=Xit+𝔼[τ=t+1TXiτ]+Ttξstd(Xiτ)\displaystyle\overline{Y}_{i}^{t}=X_{i}^{t}+\mathbb{E}\Big{[}\sum_{\tau=t+1}^{T}X_{i}^{\tau}\Big{]}+\frac{\sqrt{T-t}}{\sqrt{\xi}}\operatorname{std}(X^{\tau}_{i}) (21)
Y¯it=Xit+𝔼[τ=t+1TXiτ]Ttξstd(Xiτ)\displaystyle\underline{Y}_{i}^{t}=X_{i}^{t}+\mathbb{E}\Big{[}\sum_{\tau=t+1}^{T}X_{i}^{\tau}\Big{]}-\frac{\sqrt{T-t}}{\sqrt{\xi}}\operatorname{std}(X^{\tau}_{i}) (22)

For simplicity assume that agent ii’s demands are i.i.d (this assumption can be relaxed see Remark 2). Then, for ξ>0\xi>0, with probability at least 1ξ1-\xi, based on Chebyshev’s inequality we have

Y¯itXit+τ=t+1TXiτY¯it\displaystyle\underline{Y}_{i}^{t}\leq X^{t}_{i}+\sum_{\tau=t+1}^{T}X^{\tau}_{i}\leq\overline{Y}_{i}^{t} (23)

We further assume that all agents have equal std(Xiτ)\operatorname{std}(X^{\tau}_{i}), but their expectations might differ (the assumption simplifies the presentation, and is not crucial). We first present the worst-case scenario bound, where we allow highly unbalanced future demands for different agents, i.e. there is an agent kk such that Y¯ktY¯jt\underline{Y}^{t}_{k}\geq\overline{Y}^{t}_{j} for all other agents jkj\neq k.

Theorem 2 (Unbalanced demands bound).

Let Ait,SAFFE-DA_{i}^{t,\text{SAFFE-D}} and Ait,oracleA_{i}^{t,\text{oracle}} denote allocations by SAFFE-D for λ(t)=Ttξ\lambda(t)=\sqrt{\frac{T-t}{{\xi}}} and SAFFE-Oracle, respectively. Then, for all agents ii we have

|Ait,SAFFE-DAit,oracle|{2N(Tt)ξstd(Xit)if Bti=1NY¯it,4(Tt)ξstd(Xit)if Bti=1NY¯it\displaystyle\left|A_{i}^{t,\text{SAFFE-D}}-A_{i}^{t,\text{oracle}}\right|\leq\begin{cases}2N\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{t}_{i})&\text{if }B^{t}\leq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t},\\ 4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{t}_{i})&\text{if }B^{t}\geq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t}\end{cases}

with probability at least 1ξ1-\xi.

Proof.

The detailed proof is in Appendix E. Intuitively, the discrepancy scales with the number of agents, since if all agnets jkj\neq k submit demands according to their upper bound Y¯jt\overline{Y}^{t}_{j}, then the water-level moves so that their SAFFE-Oracle allocations increase compared to SAFFE-Oracle with Y¯jt\underline{Y}^{t}_{j}, which happens on the account of agent jj who now gets less. Finally, using (23) completes the proof, as we can translate the discrepancy between SAFFE-Oracle with Y¯it\underline{Y}^{t}_{i} and Y¯it\overline{Y}^{t}_{i} to that between SAFFE-D and SAFFE-Oracle with YitY^{t}_{i}. ∎

Remark 2.

The equal standard deviation assumption that for any two agents ii and jj, we have std(Xiτ)=std(Xjτ)\operatorname{std}(X^{\tau}_{i})=\operatorname{std}(X^{\tau}_{j}) is used in the proofs only to simplify jistd(Xjτ)=(N1)std(Xjτ)\sum_{j\neq i}\operatorname{std}(X^{\tau}_{j})=(N-1)\operatorname{std}(X^{\tau}_{j}). Thus, the assumption can be removed.

In order to move beyond the worst-case scenario of highly unbalanced request distributions, we now assume that all agents have the same Y¯it\overline{Y}_{i}^{t} and Y¯it\underline{Y}_{i}^{t}, e.g. if their demands are from the same distribution, the discrepancy between allocations scales better.

Theorem 3 (Balanced demands bound).

If for any two agents ii and jj we have the same bounds in (23), i.e. Y¯it=Y¯jt\overline{Y}_{i}^{t}=\overline{Y}_{j}^{t} and Y¯it=Y¯jt\underline{Y}_{i}^{t}=\underline{Y}_{j}^{t}, then for all agents ii, with probability at least 1ξ1-\xi, we have

|Ait,SAFFE-DAit,oracle|{0if Bti=1NY¯it,4(Tt)ξstd(Xit)if Bti=1NY¯it\displaystyle\left|A_{i}^{t,\text{SAFFE-D}}-A_{i}^{t,\text{oracle}}\right|\leq\begin{cases}0&\text{if }B^{t}\leq\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t},\\ 4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{t}_{i})&\text{if }B^{t}\geq\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t}\end{cases}
Proof.

See Appendix E for the detailed proof, which follows a similar argument as the proof of Theorem 2. ∎

Appendix E Detailed Proof of Theorem 2

Lemma 1 (Chebyshev’s inequality).

For a random variable YY with finite expectation and a finite non-zero variance, we have

(|Y𝔼(Y)|1ξVar(Y))ξ\displaystyle\mathbb{P}\left(|Y-\mathbb{E}(Y)|\geq\frac{1}{\sqrt{\xi}}\sqrt{\text{Var}(Y)}\right)\leq\xi

In particular, under the assumption from Appendix D that agent ii’s requests are i.i.d, Chebyshev’s inequality gives us that Eq. (23) holds with probability 1ξ1-\xi. Note however that the assumption is not crucial, and in a more general case we would have

Y¯it=𝔼[τ=t+1TXiτ]1ξVar(τ=t+1TXiτ)\underline{Y}^{t}_{i}=\mathbb{E}[\sum_{\tau=t+1}^{T}X^{\tau}_{i}]-\frac{1}{\sqrt{\xi}}\sqrt{\text{Var}\Big{(}\sum_{\tau=t+1}^{T}X^{\tau}_{i}\Big{)}}

and

Y¯it=𝔼[τ=t+1TXiτ]+1ξVar(τ=t+1TXiτ).\overline{Y}^{t}_{i}=\mathbb{E}[\sum_{\tau=t+1}^{T}X^{\tau}_{i}]+\frac{1}{\sqrt{\xi}}\sqrt{\text{Var}\Big{(}\sum_{\tau=t+1}^{T}X^{\tau}_{i}\Big{)}}.

See also Remark 2 on how the assumption on equal standard deviations std(Xiτ)=std(Xjτ)\operatorname{std}(X^{\tau}_{i})=\operatorname{std}(X^{\tau}_{j}) for any two agents ii and jj is not crucial for the analysis.

We proceed as follows. First, we state the result for SAFFE-Oracle in the unbalanced demands regime, i.e. when there is an agent kk such that Y¯ktY¯jt\underline{Y}^{t}_{k}\geq\overline{Y}^{t}_{j} for all jkj\neq k. Then, we state the result for the case of balanced demands, i.e. when Y¯it=Y¯jt\underline{Y}^{t}_{i}=\underline{Y}^{t}_{j} and Y¯it=Y¯jt\overline{Y}^{t}_{i}=\overline{Y}^{t}_{j} for any two ii and jj. Finally, we present the proof for both results.

Lemma 2 (Unbalanced demands SAFFE-Oracle).

Let A¯it\overline{A}_{i}^{t} and A¯it\underline{A}_{i}^{t} denote the allocations of SAFFE-Oracle with Yit=Y¯itY_{i}^{t}=\overline{Y}_{i}^{t} and Yit=Y¯itY_{i}^{t}=\underline{Y}_{i}^{t}, respectively. Then for all agents ii we have

|A¯itA¯it|{2N(Tt)ξstd(Xiτ) if Bti=1NY¯it,2N(Tt)ξstd(Xiτ) if i=1NY¯itBti=1NY¯it,4(Tt)ξstd(Xiτ) if Bti=1NY¯it\displaystyle|\overline{A}_{i}^{t}-\underline{A}_{i}^{t}|\leq\begin{cases}2N\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})&\text{ if }B^{t}\leq\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t},\\ 2N\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})&\text{ if }\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t}\leq B^{t}\leq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t},\\ 4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})&\text{ if }B^{t}\geq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t}\end{cases} (24)
Lemma 3 (Balanced demands SAFFE-Oracle).

Let A¯it\overline{A}_{i}^{t} and A¯it\underline{A}_{i}^{t} denote the allocations of SAFFE-Oracle with Yit=Y¯itY_{i}^{t}=\overline{Y}_{i}^{t} and Yit=Y¯itY_{i}^{t}=\underline{Y}_{i}^{t}, respectively. Then we have

|A¯itA¯it|{0 if Bti=1NY¯it,4(Tt)ξstd(Xiτ) if i=1NY¯itBti=1NY¯it,4(Tt)ξstd(Xiτ) if Bti=1NY¯it\displaystyle|\overline{A}_{i}^{t}-\underline{A}_{i}^{t}|\leq\begin{cases}0&\text{ if }B^{t}\leq\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t},\\ 4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})&\text{ if }\sum\limits_{i=1}^{N}\underline{Y}_{i}^{t}\leq B^{t}\leq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t},\\ 4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})&\text{ if }B^{t}\geq\sum\limits_{i=1}^{N}\overline{Y}_{i}^{t}\end{cases} (25)
Proof of Lemma 2 and Lemma 3.

Case 𝑩𝒕𝒏=𝟏𝑵𝒀¯𝒊𝒕B^{t}\leq\sum_{n=1}^{N}\underline{Y}_{i}^{t}.

Agents whose lower bound demands are fully filled by the lower bound solution with Yit=Y¯itY_{i}^{t}=\underline{Y}_{i}^{t} might increase the water-level when submitting larger demands, which will cause other agents to have smaller C¯jt\overline{C}^{t}_{j} in comparison to C¯jt\underline{C}^{t}_{j}. In the worst-case scenario, where agent ii’s lower bound demand is greater than all other agents’ upper bound demands, the difference C¯itC¯it\underline{C}^{t}_{i}-\overline{C}^{t}_{i} can be at most

ji(Y¯itY¯it),\sum_{j\neq i}\left(\overline{Y}_{i}^{t}-\underline{Y}_{i}^{t}\right),

i.e., agent ii has to compensate for the excess demands of all other agents. If std(Xiτ)\operatorname{std}(X^{\tau}_{i}) is equal for all agents, the difference in the worst-case scenario is at most 2(N1)(Tt)ξstd(Xiτ)2(N-1)\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}), which leads to a difference between A¯itA¯it\underline{A}^{t}_{i}-\overline{A}^{t}_{i} of at most 2N(Tt)ξstd(Xiτ)2N\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}) as a consequence of Lemma 4.

On the other hand if instead of the worst-case scenario we assume that Y¯it\overline{Y}_{i}^{t} and Y¯it\underline{Y}_{i}^{t} are the same for all agents (e.g. same demand distributions), then the water-level would not move in SAFFE using Yit=Y¯itY_{i}^{t}=\overline{Y}_{i}^{t} in comparison to SAFFE using Yit=Y¯itY_{i}^{t}=\underline{Y}_{i}^{t}, and thus there would be no change in the allocations resulting from the two problems.

Case 𝒏=𝟏𝑵𝒀¯𝒊𝒕𝑩𝒕𝒏=𝟏𝑵𝒀¯𝒊𝒕\sum_{n=1}^{N}\underline{Y}_{i}^{t}\leq B^{t}\leq\sum_{n=1}^{N}\overline{Y}_{i}^{t}.

There is enough budget to satisfy all agents’ demands under the lower bound constraint but not eoug budget under the upper bound one. Compared to C¯it\underline{C}^{t}_{i}, in the worst-case scenario, C¯it\overline{C}^{t}_{i} can decrease by at most

jiY¯itjiY¯it\sum_{j\neq i}\overline{Y}_{i}^{t}-\sum_{j\neq i}\underline{Y}_{i}^{t}

where agent ii’s lower bound demand is larger than any other agents’ upper bound demands, and therefore, the water-filling algorithm will satisfy the upper bound demands of all other agents on the account of agent ii. If all agents have equal std(Xiτ)\operatorname{std}(X^{\tau}_{i}), in the worst-case scenario, the difference is again at most 2(N1)(Tt)ξstd(Xiτ)2(N-1)\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}). Following Lemma 4, this leads to a difference of at most 2N(Tt)ξstd(Xiτ)2N\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}) between A¯it\underline{A}^{t}_{i} and A¯it\overline{A}^{t}_{i}.

On the other hand, if we assume that jiY¯it\sum_{j\neq i}\overline{Y}_{i}^{t} and jiY¯it\sum_{j\neq i}\underline{Y}_{i}^{t} are the same for all agents, then all agents will receive an equal share of the surplus, i.e. the difference between C¯it\underline{C}_{i}^{t} and C¯it\overline{C}_{i}^{t} becomes

1N(BtjiY¯it).\frac{1}{N}\left({B^{t}}-\sum_{j\neq i}\underline{Y}_{i}^{t}\right).

Given that Btn=1NY¯itB^{t}\leq\sum_{n=1}^{N}\overline{Y}_{i}^{t}, and due to (23), this is at most 2(Tt)ξ2\sqrt{\frac{(T-t)}{{\xi}}} std(Xiτ)\operatorname{std}(X^{\tau}_{i}). As a consequence of Lemma 4, this leads to difference of at most 4(Tt)ξstd(Xiτ)4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}) between A¯it\underline{A}_{i}^{t} and A¯it\overline{A}_{i}^{t}.

Case 𝒏=𝟏𝑵𝒀¯𝒊𝒕𝑩𝒕\sum_{n=1}^{N}\overline{Y}_{i}^{t}\leq B^{t}.

There is enough budget to satisfy all agents’ demands even under the upper bound, so the difference between C¯it\overline{C}^{t}_{i} and C¯it\underline{C}^{t}_{i} is purely driven by the difference between lower and upper bound demands, i.e. it equals Y¯itY¯it\overline{Y}_{i}^{t}-\underline{Y}_{i}^{t} which is 2(Tt)ξstd(Xnτ)2\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{n}). Following Lemma 4, this leads to a difference of at most 4(Tt)ξstd(Xit)4\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{t^{\prime}}_{i}) between A¯it\underline{A}^{t}_{i} and A¯it\overline{A}^{t}_{i}. ∎

Lemma 4.

If |C¯itC¯it|D|\overline{C}^{t}_{i}-\underline{C}^{t}_{i}|\leq D, then we have

|A¯itA¯it|D+2(Tt)ξstd(Xiτ).\displaystyle|\overline{A}^{t}_{i}-\underline{A}^{t}_{i}|\leq D+2\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}).
Proof.

Firstly, let us note that

XitC¯itY¯itXitC¯itY¯it\displaystyle X_{i}^{t}\overline{C}^{t}_{i}\underline{Y}^{t}_{i}-X_{i}^{t}\underline{C}^{t}_{i}{\overline{Y}^{t}_{i}} =XitY¯it(C¯itC¯it)+XitC¯itY¯it\displaystyle=X_{i}^{t}\underline{Y}^{t}_{i}(\overline{C}^{t}_{i}-\underline{C}^{t}_{i})+X_{i}^{t}\underline{C}^{t}_{i}\underline{Y}^{t}_{i}
XitC¯itY¯itXitC¯it(Y¯itY¯it)\displaystyle-X_{i}^{t}\underline{C}^{t}_{i}\underline{Y}^{t}_{i}-X_{i}^{t}\underline{C}^{t}_{i}\left(\overline{Y}^{t}_{i}-\underline{Y}^{t}_{i}\right)

and hence we have

|XitC¯itY¯itXitC¯itY¯it|\displaystyle\left|X_{i}^{t}\overline{C}^{t}_{i}\underline{Y}^{t}_{i}-X_{i}^{t}\underline{C}^{t}_{i}{\overline{Y}^{t}_{i}}\right| XitY¯it|C¯itC¯it|+XitC¯it|Y¯itY¯it|\displaystyle\leq X_{i}^{t}\underline{Y}^{t}_{i}\left|\overline{C}^{t}_{i}-\underline{C}^{t}_{i}\right|+X_{i}^{t}\underline{C}^{t}_{i}\left|\overline{Y}^{t}_{i}-\underline{Y}^{t}_{i}\right|
XitY¯itD+2XitC¯itTtξstd(Xiτ)\displaystyle\leq X_{i}^{t}\underline{Y}^{t}_{i}\cdot D+2X_{i}^{t}\underline{C}^{t}_{i}\sqrt{\frac{T-t}{{\xi}}}\operatorname{std}(X^{\tau}_{i})
XitY¯it(D+2Ttξstd(Xiτ))\displaystyle\leq X_{i}^{t}\underline{Y}^{t}_{i}\left(D+2\sqrt{\frac{T-t}{{\xi}}}\operatorname{std}(X^{\tau}_{i})\right)

where in the last inequality we use C¯itY¯it.\underline{C}^{t}_{i}\leq\underline{Y}^{t}_{i}.

Thus

|A¯itA¯it|\displaystyle|\overline{A}^{t}_{i}-\underline{A}^{t}_{i}| =|XitC¯itY¯itXitC¯itY¯it|\displaystyle=\left|\frac{X_{i}^{t}\overline{C}^{t}_{i}}{\overline{Y}^{t}_{i}}-\frac{X_{i}^{t}\underline{C}^{t}_{i}}{\underline{Y}^{t}_{i}}\right|
=|XitC¯itY¯itXitC¯itY¯itY¯itY¯it|\displaystyle=\left|\frac{X_{i}^{t}\overline{C}^{t}_{i}\underline{Y}^{t}_{i}-X_{i}^{t}\underline{C}^{t}_{i}\overline{Y}^{t}_{i}}{\overline{Y}^{t}_{i}\underline{Y}^{t}_{i}}\right|
(D+2(Tt)ξstd(Xiτ))XitY¯it\displaystyle\leq\frac{(D+2\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i}))X_{i}^{t}}{\overline{Y}^{t}_{i}}
D+2(Tt)ξstd(Xiτ)\displaystyle\leq D+2\sqrt{\frac{(T-t)}{{\xi}}}\operatorname{std}(X^{\tau}_{i})

Appendix F SAFFE-Oracle Matches Hindsight

In Appendix D, we have upper bounded the discrepancy between the allocations of SAFFE-D and SAFFE-Oracle. Our next step is to show that SAFFE-Oracle achieves the optimal allocations in hindsight.

Theorem 4.

SAFFE-Oracle achieves optimal allocations in hindsight, i.e. for all ii we have t=1TAit,oracle=Ai~\sum_{t=1}^{T}A^{t,\text{oracle}}_{i}=\widetilde{A_{i}}, where Ai~\widetilde{A_{i}} is the solution to the Eisenber-Gale program (11).

Proof.

See Appendix G for a detailed proof, which relies on the following observations: 1) For t=1t=1, total allocations for the current step and the reserved allocations for the future Ci1C^{1}_{i}, equals the solution Ai~\widetilde{A_{i}} of (11), and 2) SAFFE-Oracle fully distributes the total reserved allocations for future by the end of the horizon TT. ∎

Appendix G Detailed Proof of Theorem 4

Lemma 5.

At t=1, the solution Ci1,oracleC_{i}^{1,\text{oracle}} to the SAFFE-Oracle problem coincides with the solution Ai~\tilde{A_{i}} to the Eisenber-Gale program (11) i.e. Ci1,oracle=Ai~C_{i}^{1,\text{oracle}}=\widetilde{A_{i}}.

Proof.

At t=1t=1, SAFFE-Oracle solves the following problem

max{Ci1}\displaystyle\max_{\{C_{i}^{1}\}} i:t=1TXit>0log(Ci1)\displaystyle\sum_{i:\sum_{t=1}^{T}X_{i}^{t}>0}\log(C_{i}^{1})
s.t. 0Ci1t=1TXit,n=1NCi1B.\displaystyle 0\leq C_{i}^{1}\leq\sum_{t=1}^{T}X_{i}^{t},\quad\sum_{n=1}^{N}C_{i}^{1}\leq B.

This is equivalent to problem (11) with Ai=Ci1A_{i}=C_{i}^{1}, and hence, we have Ci1,oracle=Ai~C_{i}^{1,\text{oracle}}=\tilde{A_{i}}. ∎

Lemma 6.

With SAFFE-Oracle, for any agent ii we have

Ci1,oracle=t=1TAit,oracleC^{1,\text{oracle}}_{i}=\sum_{t=1}^{T}{A^{t,\text{oracle}}_{i}}

i.e. its future allocation from the first round is allocated in full by the end of the TthT^{\text{th}} round.

Proof.

First, we prove that for t=2,,Tt=2,\dots,T we have

τ=1t1Aiτ,oracle+Cit,oracle=τ=1t2Aiτ,oracle+Cit1,oracle\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}+C^{t,\text{oracle}}_{i}=\sum_{\tau=1}^{t-2}A^{\tau,\text{oracle}}_{i}+C^{t-1,\text{oracle}}_{i}

i.e. the past plus future allocation from the previous step is preserved in the following step. Recall that in each step we are solving

max{Cit}\displaystyle\max_{\{C_{i}^{t}\}} i:τ=tTXnτ>0log(τ=1t1Aiτ,oracle+Cit)\displaystyle\sum_{i:\sum_{\tau=t}^{T}X_{n}^{\tau}>0}\log(\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}+C_{i}^{t}) (26)
s.t. 0Citτ=tTXiτ,n=1NCitBt.\displaystyle 0\leq C_{i}^{t}\leq\sum_{\tau=t}^{T}X_{i}^{\tau},\quad\sum_{n=1}^{N}C_{i}^{t}\leq B^{t}.

where the sequence of future demands is known to the oracle. It is easy to see that

τ=1t1Aiτ,oracle+Cit,oracleτ=1t2Aiτ,oracle+Cit1,oracle\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}+C^{t,\text{oracle}}_{i}\leq\sum_{\tau=1}^{t-2}A^{\tau,\text{oracle}}_{i}+C^{t-1,\text{oracle}}_{i}

because the constraints in step tt are tighter. If this didn’t hold, the previous round solution Cit1,oracleC^{t-1,\text{oracle}}_{i} would not be the maximizing solution in step t1t-1. It remains to note that τ=1t1Aiτ,oracle+Cit,oracle\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}+C^{t,\text{oracle}}_{i} can achieve its upper bound, because compared to the previous step t1t-1, the remaining budget was reduced by exactly

n=1N(τ=1t1Aiτ,oracleτ=1t2Aiτ,oracle).\sum_{n=1}^{N}\left(\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}-\sum_{\tau=1}^{t-2}A^{\tau,\text{oracle}}_{i}\right).

Therefore,

Cit,oracle=Cit1,oracle(τ=1t1Aiτ,oracleτ=1t2Aiτ,oracle)C_{i}^{t,\text{oracle}}=C_{i}^{t-1,\text{oracle}}-\left(\sum_{\tau=1}^{t-1}A^{\tau,\text{oracle}}_{i}-\sum_{\tau=1}^{t-2}A^{\tau,\text{oracle}}_{i}\right)

satisifes the constraints. This means that Cit,oracle=Cit1,oracleAit1,oracleC_{i}^{t,\text{oracle}}=C_{i}^{t-1,\text{oracle}}-A^{t-1,\text{oracle}}_{i}, i.e., the current plus future allocation in step tt equals its counterpart in step t1t-1 minus the allocation in step t1t-1. In particular, we get

Ci2,oracle\displaystyle C^{2,\text{oracle}}_{i} =Ci1,oracleAi1,oracle\displaystyle=C^{1,\text{oracle}}_{i}-A^{1,\text{oracle}}_{i} (27)
=Cn1,oracleXi1τ=1TXiτCi1,oracle\displaystyle=C^{1,\text{oracle}}_{n}-\frac{X^{1}_{i}}{\sum_{\tau=1}^{T}X^{\tau}_{i}}C^{1,\text{oracle}}_{i} (28)
=(1Xn1τ=1TXnτ)Ci1,oracle,\displaystyle=\left(1-\frac{X^{1}_{n}}{\sum_{\tau=1}^{T}X^{\tau}_{n}}\right)C^{1,\text{oracle}}_{i}, (29)
Ci3,oracle\displaystyle C^{3,\text{oracle}}_{i} =Ci2,oracleAi2,oracle\displaystyle=C^{2,\text{oracle}}_{i}-A^{2,\text{oracle}}_{i} (30)
=(1Xn2τ=2TXnτ)Ci2,oracle\displaystyle=\left(1-\frac{X^{2}_{n}}{\sum_{\tau=2}^{T}X^{\tau}_{n}}\right)C^{2,\text{oracle}}_{i} (31)
=(1Xn2τ=2TXnτ)(1Xn1τ=1TXnτ)Ci1,oracle\displaystyle=\left(1-\frac{X^{2}_{n}}{\sum_{\tau=2}^{T}X^{\tau}_{n}}\right)\left(1-\frac{X^{1}_{n}}{\sum_{\tau=1}^{T}X^{\tau}_{n}}\right)C^{1,\text{oracle}}_{i} (32)

and in general we have

Cit,oracle=(1Xit1τ=t1TXiτ)(1Xi1τ=1TXiτ)Ci1,oracle\displaystyle C^{t,\text{oracle}}_{i}=\left(1-\frac{X^{t-1}_{i}}{\sum_{\tau=t-1}^{T}X^{\tau}_{i}}\right)\cdots\left(1-\frac{X^{1}_{i}}{\sum_{\tau=1}^{T}X^{\tau}_{i}}\right)C^{1,\text{oracle}}_{i} (33)

and

Ant,oracle=\displaystyle A^{t,\text{oracle}}_{n}=
(1Xit1τ=t1TXiτ)(1Xi1τ=1TXiτ)Xitτ=tTXiτCi1,oracle\displaystyle\left(1-\frac{X^{t-1}_{i}}{\sum_{\tau=t-1}^{T}X^{\tau}_{i}}\right)\cdots\left(1-\frac{X^{1}_{i}}{\sum_{\tau=1}^{T}X^{\tau}_{i}}\right)\frac{X^{t}_{i}}{\sum_{\tau=t}^{T}X^{\tau}_{i}}C^{1,\text{oracle}}_{i} (34)

Proof of Theorem 4.

As a consequence of Lemmas 5 and 6, we have A~it=t=1TAit,oracle\tilde{A}^{t}_{i}=\sum_{t=1}^{T}{A^{t,\text{oracle}}_{i}} for every ii. This completes the proof. ∎

Appendix H Detailed Proof of Theorem 1

Proof of Theorem 1.

Firstly, let us note that Theorem 4 guarantees that

Δ𝐀max=𝔼[maxi|t=1TAt,SAFFE-Dt=1TAt,oracle|],\Delta\mathbf{A}^{\text{max}}=\mathbb{E}\left[\max_{i}\left|\sum_{t=1}^{T}A^{t,\text{SAFFE-D}}-\sum_{t=1}^{T}A^{t,\text{oracle}}\right|\right],

as SAFFE-Oracle achieves the hindsight solution. It remains to employ Theorems 2 and 3 together with the triangle inequality in order to obtain an upper bound on the right hand side. In the case of unbalanced demands, this results in Δ𝐀maxNT3/2ξstd(Xit)\Delta\mathbf{A}^{\text{max}}\leq N\frac{T^{3/2}}{\sqrt{\xi}}\operatorname{std}(X^{t}_{i}). In the case of balanced demands, we have Δ𝐀max2T3/2ξstd(Xit)\Delta\mathbf{A}^{\text{max}}\leq\frac{2T^{3/2}}{\sqrt{\xi}}\operatorname{std}(X^{t}_{i})

Remark 3.

As Δ𝐀meanΔ𝐀max\Delta\mathbf{A}^{\text{mean}}\leq\Delta\mathbf{A}^{\text{max}}, Theorem 1 also provides an upper bound on Δ𝐀mean\Delta\mathbf{A}^{\text{mean}}.

Appendix I SAFFE as a generalization of HOPE-Online

The heuristic algorithm HOPE-Online in Sinclair et al. [2020] coincides with SAFFE under a simplified demand process. HOPE-Online is designed for a setting where the supplier visits a set of agents sequentially in order. SAFFE generalizes the allocation algorithm to a setup where agents can arrive simultaneously and several times over the horizon TT. Similar to SAFFE-D which improves SAFFE using the uncertainty of future demand estimates, Guarded-HOPE proposed in Sinclair et al. [2022] improves HOPE-Online and achieves the correct trade-off between a measure of fairness (envy-freeness) and resource leftover by learning a “lower guardrail” on the optimal solution in hindsight. We highlight that Guarded-HOPE is designed for the setting where agents are not making repeated demands over the horizon i.e. each individual agent has a request at most once during the horizon. In that sense, our SAFFE and SAFFE-D represent generalizations as we allow for agents with multiple demands. In our analysis of the optimality of SAFFE-D, we rely on concentration inequalities for deviation of future demands from their expected values, which is in spirit similar to the optimality analysis of the guardrail approach in Sinclair et al. [2022].

Appendix J Extension of Guarded-HOPE to Our Setting

Since Guarded-HOPE was designed for a setting where in each time-step a number of individuals of the same type arrive, we slightly modify it to be applicable to our setting in Algorithm 4. The key components of the algorithm are the upper and lower guardrails defined in lines 7 and 10. Specifically, 𝐗¯,𝐗¯N\underline{\mathbf{X}},\overline{\mathbf{X}}\in\mathbb{R}^{N} are high-confidence lower and upper bounds on the future demands, and 𝐀¯,𝐀¯N\overline{\mathbf{A}},\underline{\mathbf{A}}\in\mathbb{R}^{N}, which Sinclair et al. [2020] refers to as upper and lower guardrails, respectively, are the optimal hindsight allocations under 𝐗¯,𝐗¯\underline{\mathbf{X}},\overline{\mathbf{X}}. When the demands 𝐗t\mathbf{X}^{t} are revealed in time step tt, the condition in line 13 first checks if the budget is insufficient to even allow an allocation according to the lower guardrail. If so, the budget is allocated right away and none is reserved. The condition in line 15 checks if the current demands can be allocated according to the upper guardrail assuming that the remaining budget is still enough to allow for allocations according to the lower guardrail for anticipated future demands. If so, the the upper guardrail is allocated to the agents with demands at the current step. Otherwise, the lower guardrail is allocated.

Algorithm 4 Guarded-HOPE (Modified Compared to Sinclair et al. [2022])
1:  Input: number of agents NN, resource budget BB, demand vectors 𝐗1,𝐗TN\mathbf{X}^{1},\dots\mathbf{X}^{T}\in\mathbb{R}^{N}, demand distributions P𝐗1,,P𝐗NP_{\mathbf{X}_{1}},\dots,P_{\mathbf{X}_{N}}, bound on envy LTL_{T}
2:  Output: allocation vectors 𝐀1,,𝐀TN\mathbf{A}^{1},\dots,\mathbf{A}^{T}\in\mathbb{R}^{N}
3:  Define confidence bound CONFi=std(Xi1)𝔼[Xi1](T1)\text{CONF}_{i}=\sqrt{\text{std}(X_{i}^{1})\mathbb{E}[X_{i}^{1}](T-1)}
4:  𝔼Xiτ=1T𝔼[Xiτ]\mathbb{E}X_{i}\coloneqq\sum_{\tau=1}^{T}\mathbb{E}[X_{i}^{\tau}]
5:  ci=Lt(1+CONFit/𝔼Xi)CONFit/𝔼Xic_{i}=L_{t}(1+\text{CONF}^{t}_{i}/\mathbb{E}X_{i})-\text{CONF}^{t}_{i}/\mathbb{E}X_{i} for all i=1,,Ni=1,\dots,N
6:  Solve 𝐀¯\overline{\mathbf{A}} for X¯𝔼X(1c)\underline{X}\coloneqq\mathbb{E}X(1-c) using Algorithm 1
7:  A¯iA¯i/X¯i\overline{A}_{i}\leftarrow\overline{A}_{i}/\underline{X}_{i}
8:  γi=CONFit/𝔼Xi\gamma_{i}=\text{CONF}^{t}_{i}/\mathbb{E}X_{i}
9:  Solve 𝐀¯\underline{\mathbf{A}} for X¯𝔼X(1+γ)\overline{X}\coloneqq\mathbb{E}X(1+\gamma) using Algorithm 1
10:  A¯iA¯i/X¯i\underline{A}_{i}\leftarrow\underline{A}_{i}/\overline{X}_{i}
11:  for For t=1,,Tt=1,\dots,T do
12:     Define confidence bound CONFit=std(Xit)𝔼[Xit](Tt)\text{CONF}^{t}_{i}=\sqrt{\text{std}(X_{i}^{t})\mathbb{E}[X_{i}^{t}](T-t)}
13:     if Bi=1NXitA¯iB\leq\sum_{i=1}^{N}X_{i}^{t}\underline{A}_{i} then
14:        Ait=𝟙{Xit>0}×B/(i=1N𝟙{Xit>0})A_{i}^{t}=\mathbbm{1}\{X_{i}^{t}>0\}\times B/\Big{(}\sum_{i=1}^{N}\mathbbm{1}\{X_{i}^{t}>0\}\Big{)} for i=1,,Ni=1,\dots,N
15:     else if Bi=1NXitA¯i+i=1NA¯i(𝔼[τ=tTXiτ]+CONFit)B\geq\sum_{i=1}^{N}X_{i}^{t}\overline{A}_{i}+\sum_{i=1}^{N}\underline{A}_{i}\Big{(}\mathbb{E}[\sum_{\tau=t}^{T}X_{i}^{\tau}]+\text{CONF}_{i}^{t}\Big{)} then
16:        Ait=𝟙{Xit>0}×XitA¯iA_{i}^{t}=\mathbbm{1}\{X_{i}^{t}>0\}\times X_{i}^{t}\overline{A}_{i} for i=1,,Ni=1,\dots,N
17:     else
18:        Ait=𝟙{Xit>0}×XitA¯iA_{i}^{t}=\mathbbm{1}\{X_{i}^{t}>0\}\times X_{i}^{t}\underline{A}_{i} for i=1,,Ni=1,\dots,N
19:     end if
20:     BBi=1NAitB\leftarrow B-\sum_{i=1}^{N}A_{i}^{t}
21:  end for

Appendix K Additional Experiments

K.1 Scaling System Parameters

SAFFE-D is compared to the baselines in terms of utilization and fairness metrics in Fig. 4, as different system parameters vary. Unless otherwise stated, the settings have N=50N=50 agents, budget size 0.50.5, time horizon T=40T=40, and there are 22 expected arrivals per-agent. In all experiments, we observe that SAFFE-D is more efficient and more fair compared to other methods, i.e., it achieves higher utilization, and lower Δ\DeltaLog-NSW, Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}}. In Fig. 4(a), we observe that as the number of agents increases over the same horizon, the algorithms initially achieve higher utilization and lower Δ\DeltaLog-NSW, which eventually levels out. However, in terms of Δ𝐀max\Delta\mathbf{A}^{\text{max}} fairness which measures the allocation difference with respect to hindsight for the worst agent, SAFEE-D is the least affected across different number of agents, while all other algorithms become worse. Fig. 4(b) shows the metrics for varying horizon TT while having the same number of expected arrivals, i.e., when the arrivals are more spread out across time. In this setting, utilization and fairness metrics seem relatively unaffected across all methods.

In Fig. 4(c), we compare the algorithms as the number of arrivals over the horizon increases. We observe that SAFFE-D is able to use all the budget and match the fair hindsight allocations when it is expected to have more than 55 arrivals per-agent over T=40T=40. SAFFE is able to match the performance of SAFFE-D in terms of utilization and Δ\DeltaLog-NSW as the arrivals become denser. As discussed in Appendix I, HOPE-Online and Guarded-HOPE are designed for settings with a single per-agent arrival. As expected, when there are several arrivals per agents, they are not able to match SAFFE-D or SAFFE since they do not account for an agent’s past allocations when distributing the budget.

Refer to caption
(a) Different number of agents NN.
Refer to caption
(b) Different horizons TT.
Refer to caption
(c) Different per-agent expected arrivals.
Figure 4: Symmetric Setting : SAFFE-D performs close to Hindsight and outperforms other methods by achieving higher utilization and lower Δ\DeltaLog-NSW, Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}}.

K.2 Non-Symmetric Demands

Fig. 5 shows how SAFFE-D compares to other baselines in terms of allocations using the Non-symmetric Demands Setting, where some agents request larger demands earlier, while others have larger requests later or have no preference. We observe that SAFFE-D and SAFFE achieve lower Δ𝐀mean\Delta\mathbf{A}^{\text{mean}}, and have low variability across the groups on average. When considering the worst-case agent, SAFFE-D is less fair to agents that have larger demands earlier in the horizon, as it reserves the budget early on accounting for future expected demands. However, Uniform and More-Late agents receive allocations closer to hindsight compared to other algorithms.

Refer to caption
Figure 5: Non-symmetric Demands Setting: SAFFE-D allocates uniformly across agents with different arrival patterns on average, while SAFFE outperforms in the worst case.

K.3 Real Data

In order to study how SAFFE-D performs on real data with sparser arrivals over the horizon, we enforce less per-agent requests by erasing each arrival with probability pp. Since T=7T=7, setting p=27p=\frac{2}{7} corresponds to two weekly demands per store. As observed from Fig. 6, SAFFE-D outperforms the other methods in terms of efficiency and fairness. We remark that while with the uniform random erasures, we have imposed Bernoulli arrivals similar to the demand processes described in Sec. 6, the results presented here are based on the real demand sizes and reflect using imperfect estimates that are computed based on real data. However, further experiments on real datasets are needed to compare the performance of SAFFE-D under more general arrival processes that are correlated across time.

Refer to caption
Figure 6: Experiments on real data for different budgets when p=27p=\frac{2}{7}. SAFFE-D performs close to Hindsight achieving high utilization and low Δ\DeltaLog-NSW, Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}}.

Appendix L Training of Reinforcement Learning Policy

We implement a variation of the Soft Actor-Critic method Haarnoja et al. [2018a] with clipped double-Q learning Achiam [2018] as well as automatic entropy Haarnoja et al. [2018b] tuning. We train the SAC policy for 100100K episodes for 55 random seeds, and then evaluate the 55 checkpoints under another 200200 rollouts in order to compare with other baselines. We report the average performance with one standard deviation over the four metrics of Log-NSW, Utilization, Δ𝐀mean\Delta\mathbf{A}^{\text{mean}} and Δ𝐀max\Delta\mathbf{A}^{\text{max}} in Table 2.

The policy network architecture consists of three Fully Connected (FC) layers followed by one output layer, where each FC layer has 256 neurons with ReLU activation functions. Since the MDP state is time dependent, to prevent the input state vector size from growing over the time horizon TT during training, we represent the state vector as 𝐗~t,𝐀~t,Bt,t\langle\widetilde{\mathbf{X}}^{t},\widetilde{\mathbf{A}}^{t},B^{t},t\rangle, where

X~it=τ=1tXiτ,A~it=τ=1tAiτ.\widetilde{X}_{i}^{t}=\sum_{\tau=1}^{t}{X}_{i}^{\tau},\quad\widetilde{A}_{i}^{t}=\sum_{\tau=1}^{t}{A}_{i}^{\tau}.

Note that the state does not include any demand distribution information (e.g., expectation). We guarantee the step-wise budget constraint i=1NAitBt\sum_{i=1}^{N}A^{t}_{i}\leq B^{t}, by designing the output layer to have two heads: one head with a Sigmoid activation function, which outputs a scalar determining the step-wise budget utilization ratio ut[0,1]u^{t}\in[0,1]; and another head with a Softmax activation function, which outputs an allocation ratio vector, 𝐳tN\mathbf{z}^{t}\in\mathbb{R}^{N} : i=1Nzit=1\sum_{i=1}^{N}z_{i}^{t}=1 over the agents. The final allocation for each agent is determined by Ait=zitutBtA_{i}^{t}=z_{i}^{t}u^{t}B^{t}.

We perform hyper-parameter tuning using a grid search over a set of candidates. The results in Table 2 are achieved by using the hyper-parameters summarized in Table 4.

Table 4: SAC Hyper-parameters used in experiments.
Parameter Sparse Dense
(22 arrivals) (44 arrivals)
Learning rate 31043\cdot 10^{-4} 31043\cdot 10^{-4}
Replay-buffer size 81058\cdot 10^{5} 10610^{6}
Batch size 512 512
Target smoothing coefficient(τ)(\tau) 0.005 0.005
Update interval (step) 5 1
Update after (step) 10510^{5} 51055\cdot 10^{5}
Uniform-random action selection (step) 10510^{5} 51055\cdot 10^{5}