This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Programming and Linear Programming
for Odds Problem thanks: This work was supported by JSPS KAKENHI Grant Numbers JP26285045, JP26242027, JP20K04973. We thank Katsunori Ano, Naoto Miyoshi, and Akifumi Kira for extensive discussions.

Sachika Kurokawa Graduate School of Engineering, Tokyo Institute of Technology    Tomomi Matsui Graduate School of Engineering, Tokyo Institute of Technology
Abstract

This paper discusses the odds problem, proposed by Bruss in 2000, and its variants. A recurrence relation called a dynamic programming (DP) equation is used to find an optimal stopping policy of the odds problem and its variants. In 2013, Buchbinder, Jain, and Singh proposed a linear programming (LP) formulation for finding an optimal stopping policy of the classical secretary problem, which is a special case of the odds problem. The proposed linear programming problem, which maximizes the probability of a win, differs from the DP equations known for long time periods. This paper shows that an ordinary DP equation is a modification of the dual problem of linear programming including the LP formulation proposed by Buchbinder, Jain, and Singh.

Keywords: Odds problem, dynamic programming, linear programming, Markov decision process

2010 Mathematics Subject Classification: Primary 60G40, Secondary 60L15

1 Introduction

The odds problem is an optimal stopping problem proposed by Bruss [1], which includes the classical secretary problem as a special case. Some variants of the odds problem are discussed in [2, 3, 4, 5, 6, 7, 8]. Although the odds theorem shown in [1] and [9] gives an optimal policy directly, it is well-known that a backward induction method finds an optimal policy for most of the finite optimal stopping problems [10]. The backward induction method was used for the first time by Cayley [11, 12, 13]. A recurrence relation appearing in the backward induction method is called a dynamic programming (DP) equation. Recently, Buchbinder, Jain, and Singh [14] proposed a linear programming (LP) formulation for the classical secretary problem. Their LP formulation does not depend on the backward induction method and is essentially different from the DP equation. The purpose of this paper is to clarify the relation between these two types of programming.

The classical secretary problem and its variants are discussed as finite-horizon Markov decision processes and solved through backward induction methods (e.g., see Gilbert and Mosteller [10], Ross [15] Section I.5, Puterman [16] Section 4.6.4, and Bertsekas [17] Section 3.4 of Vol. I). Variations of infinite-horizon optimal stopping problems are discussed as stationary, infinite-horizon Markov decision processes (e.g., see [15] Section III.2, [16] Section 7.2.8, and [18] Section 4.4 of Vol. II). An LP reformulation is a popular method for stationary, infinite-horizon Markov decision process models [19, 20]. By contrast, an LP formulation is seldom used for finite-horizon Markov decision process models (and/or transient Markov policies) owing to the ease of solving the DP equation. Theoretically, Derman [21, 22] showed that we can deal with a finite-horizon Markov decision process model as a special case of an infinite-horizon Markov decision process models. Some recent results on LP formulations for finite-horizon Markov decision process models appear in [23, 24], for example. In Section 3, we propose a linear programming problem whose unique optimal solution is the solution of the DP equation. We give a direct proof of a transformation (from the DP equation to an LP formulation) in a self-contained manner by restricting to the odds problem.

In Section 4, we describe the odds problem as a finite-horizon Markov decision process and discuss a problem of finding an optimal policy (stopping rule). A straightforward formulation, which is independent of the DP equation, produces a non-linear programming problem. We propose a transformation that converts the non-linear programming problem into a linear programming problem, called a flow formulation. By restricting to the classical secretary problem, our flow formulation gives the LP formulation proposed in [14]. As a main result, we show that our flow formulation is the dual linear programming of the LP formulation obtained in Section 3.

The remainder of this paper is organized as follows. In the next section, we provide detailed descriptions of the odds problem and its variants. Section 3 shows that the DP equation gives a unique optimal solution of a certain linear programming problem. In Section 4, we describe a linear programming formulation for determining an optimal policy of the odds problem. We also show the duality of a DP equation and our linear programming formulation.

2 Odds Problem and its Variations

Let X1,X2,,XnX_{1},X_{2},\ldots,X_{n} denote a sequence of independent Bernoulli random variables. If Xi=1X_{i}=1, we state that the outcome of random variable XiX_{i} is a success. Otherwise (Xi=0)(X_{i}=0), we state that the outcome of XiX_{i} is a failure. For each i{1,2,,n}i\in\{1,2,\ldots,n\}, the success probability pi=Pr[Xi=1](0<pi<1)p_{i}=\mbox{\rm Pr}[X_{i}=1](0<p_{i}<1) is given. We denote the failure probability by qi=Pr[Xi=0]=1pi>0q_{i}=\mbox{\rm Pr}[X_{i}=0]=1-p_{i}>0 and the odds by ri=pi/qir_{i}=p_{i}/q_{i}. A player observes these random variables sequentially one by one and is allowed to select the variable when observing a success. The odds problem, proposed by Bruss [1], maximizes the probability of selecting the last success.

In this paper, we discuss the odds problem and some variations (see [8] for detailed discussion). We assume that a sequence of rewards (R1,R2,,Rn)(R_{1},R_{2},\ldots,R_{n}) is given and a player receives reward RiR_{i} when selecting a success Xi=1X_{i}=1. The aim of the player is to maximize the expected reward. The odds problem proposed by Bruss [1] is obtained by setting the reward RiR_{i} to the probability that a success Xi=1X_{i}=1 is the last success, i.e., Ri=qi+1qi+2qnR_{i}=q_{i+1}\cdot q_{i+2}\cdot\cdots\cdot q_{n} (i{1,2,,n}).(\forall i\in\{1,2,\ldots,n\}). The classical secretary problem is obtained by setting pi=1/ip_{i}=1/i and Ri=i/nR_{i}=i/n (i{1,2,,n}).(\forall i\in\{1,2,\ldots,n\}).

Bruss and Paindaveine [3] discussed a model in which a player wants to select the final mm-th success. This problem is obtained by setting

Ri=(j=i+1nqj)(i+1i1<<im1nri1ri2rim1)(i{1,2,,n}).R_{i}=\left(\prod_{j={i+1}}^{n}q_{j}\right)\cdot\left(\sum_{i+1\leq i_{1}<\cdots<i_{m-1}\leq n}r_{i_{1}}r_{i_{2}}\cdots r_{i_{m-1}}\right)\;\;\;(\forall i\in\{1,2,\ldots,n\}).

Tamaki [5] discussed a problem of selecting any of the last mm successes. We can express this problem by setting

Ri=(j=i+1nqj){1+h=1m1(i+1i1<<ihnri1ri2rih)}(i{1,2,,n}).R_{i}=\left(\prod_{j={i+1}}^{n}q_{j}\right)\cdot\left\{1+\sum_{h=1}^{m-1}\left(\sum_{i+1\leq i_{1}<\cdots<i_{h}\leq n}r_{i_{1}}r_{i_{2}}\cdots r_{i_{h}}\right)\right\}\;\;\;(\forall i\in\{1,2,\ldots,n\}).

Matsui and Ano [8] discussed a problem of selecting kk out of the last \ell successes, where 1k<N.1\leq k\leq\ell<N. Their model includes the above models as special cases. The model is obtained by setting

Ri=(j=i+1nqj){h=k11(i+1i1<<ihnri1ri2rih)}(i{1,2,,n}).R_{i}=\left(\prod_{j={i+1}}^{n}q_{j}\right)\cdot\left\{\sum_{h=k-1}^{\ell-1}\left(\sum_{i+1\leq i_{1}<\cdots<i_{h}\leq n}r_{i_{1}}r_{i_{2}}\cdots r_{i_{h}}\right)\right\}\;\;\;(\forall i\in\{1,2,\ldots,n\}).

3 Linear Programming Problem for Solving DP Equation

This section describes a DP equation for finding an optimal stopping rule of our problem. We propose a linear programming problem whose unique optimal solution gives the solution to the DP equation.

Let wiw_{i} be the expected reward under the condition in which a player observes variables X1,X2,,XiX_{1},X_{2},\ldots,X_{i}; does not select XiX_{i}; and afterward adopts an optimal stopping rule. Then, w0w_{0} denotes the maximum expected reward when a player adopts an optimal stopping rule. The following recurrence relation,

wi1=max{qiwi+piRi,wi}(i{1,2,,n}),wn=0,\displaystyle\begin{aligned} &w_{i-1}=\max\{q_{i}w_{i}+p_{i}R_{i},w_{i}\}&(i\in\{1,2,\ldots,n\}),\\ &w_{n}=0,\\ \end{aligned} (1)

is called a DP equation, which calculates an optimal stopping rule through backward induction (see [10] for example).

Now, we describe a linear programming problem that finds a solution to the above DP equation.

Theorem 3.1.

A linear programming problem

P: min.w0s.t.wi1qiwi+piRi(i{1,2,,n}),wi1wi(i{1,2,,n}),wn=0,\displaystyle\begin{aligned} \mbox{\rm P: \;\;}\mbox{\rm min.}~{}&w_{0}\\ \mbox{\rm s.t.}~{}&w_{i-1}\geq q_{i}w_{i}+p_{i}R_{i}&(\forall i\in\{1,2,\ldots,n\}),\\ &w_{i-1}\geq w_{i}&(\forall i\in\{1,2,\ldots,n\}),\\ &w_{n}=0,\\ \end{aligned}

has a unique optimal solution 𝐰\mbox{\boldmath$w$}^{\ast} that satisfies DP Equation (LABEL:eq:dp).

proof: Because problem P has a feasible solution whose objective function is always non-negative, an optimal solution is available. Let 𝒘\mbox{\boldmath$w$}^{\ast} be an optimal solution of P. Obviously, 𝒘\mbox{\boldmath$w$}^{\ast} satisfies

wi1max{qiwi+piRi,wi}(i{1,2,,n}).w_{i-1}^{\ast}\geq\max\{q_{i}w_{i}^{\ast}+p_{i}R_{i},w_{i}^{\ast}\}\;\;\;(\forall i\in\{1,2,\ldots,n\}).

We prove that 𝒘\mbox{\boldmath$w$}^{\ast} satisfies DP Equation (LABEL:eq:dp) through a contradiction. Assume that there exists an index i^{1,2,,n}\hat{i}\in\{1,2,\ldots,n\} that satisfies

wi^1>max{qi^wi^+pi^Ri^,wi^}.w_{\hat{i}-1}^{\ast}>\max\{q_{\hat{i}}w_{\hat{i}}^{\ast}+p_{\hat{i}}R_{\hat{i}},w_{\hat{i}}^{\ast}\}.

We introduce a sufficiently small positive number ε>0\varepsilon>0 and a solution (w0,w1,,wn)(w_{0}^{\prime},w_{1}^{\prime},\ldots,w_{n}^{\prime}) defined by

wk={wk(k{i^,,n}),wkεnk(k{0,1,,i^1}).w_{k}^{\prime}=\left\{\begin{array}[]{ll}w_{k}^{\ast}&(\forall k\in\{{\hat{i}},\ldots,n\}),\\ w_{k}^{\ast}-\varepsilon^{n-k}&(\forall k\in\{0,1,\ldots,{\hat{i}}-1\}).\\ \end{array}\right.

In the following, we show that (w0,w1,,wn)(w_{0}^{\prime},w_{1}^{\prime},\ldots,w_{n}^{\prime}) is feasible to P.

For each k{i^+1,i^+2,,n},k\in\{\hat{i}+1,\hat{i}+2,\ldots,n\}, the definition of 𝒘{\mbox{\boldmath$w$}^{\prime}} directly implies that

wk1=wk1qkwk+pkRk=qkwk+pkRk, and wk1=wk1wk=wk.\displaystyle\begin{aligned} w^{\prime}_{k-1}=~{}&w^{\ast}_{k-1}\geq q_{k}w^{\ast}_{k}+p_{k}R_{k}=q_{k}w^{\prime}_{k}+p_{k}R_{k},\;\;\mbox{ and }\\ \ w^{\prime}_{k-1}=~{}&w^{\ast}_{k-1}\geq w^{\ast}_{k}=w^{\prime}_{k}.\end{aligned}

When k=i^k=\hat{i}, wi^1w^{\prime}_{\hat{i}-1} satisfies

wi^1=wi^1εni^+1>qi^wi^+pi^Ri^=qi^wi^+pi^Ri^, and wi^1=wi^1εni^+1>wi^=wi^,\displaystyle\begin{aligned} w^{\prime}_{\hat{i}-1}=~{}&w^{\ast}_{\hat{i}-1}-\varepsilon^{n-\hat{i}+1}>q_{\hat{i}}w^{\ast}_{\hat{i}}+p_{\hat{i}}R_{\hat{i}}=q_{\hat{i}}w^{\prime}_{\hat{i}}+p_{\hat{i}}R_{\hat{i}},\;\;\mbox{ and }\\ w^{\prime}_{\hat{i}-1}=~{}&w^{\ast}_{\hat{i}-1}-\varepsilon^{n-\hat{i}+1}>w^{\ast}_{\hat{i}}=w^{\prime}_{\hat{i}},\end{aligned}

as ε\varepsilon is a sufficiently small positive number and ni^+11n-\hat{i}+1\geq 1. For each k{1,2,,i^1}k\in\{1,2,\ldots,\hat{i}-1\}, the inequality qk>0q_{k}>0 implies the following:

wk1=wk1εn(k1)qkwk+pkRkεnk+1qkwk+pkRkqkεnk=qkwk+pkRk, and wk1=wk1εn(k1)wkεnk+1wkεnk=wk.\displaystyle\begin{aligned} w^{\prime}_{k-1}=~{}&w^{\ast}_{k-1}-\varepsilon^{n-(k-1)}\geq q_{k}w^{\ast}_{k}+p_{k}R_{k}-\varepsilon^{n-k+1}\geq q_{k}w^{\ast}_{k}+p_{k}R_{k}-q_{k}\varepsilon^{n-k}\\ =~{}&q_{k}w^{\prime}_{k}+p_{k}R_{k},\;\;\mbox{ and }\\ w^{\prime}_{k-1}=~{}&w^{\ast}_{k-1}-\varepsilon^{n-(k-1)}\geq w^{\ast}_{k}-\varepsilon^{n-k+1}\geq w^{\ast}_{k}-\varepsilon^{n-k}=w^{\prime}_{k}.\end{aligned}

From the above, 𝒘{\mbox{\boldmath$w$}^{\prime}} is feasible to problem P. Obviously, we have w0<w0w_{0}^{\prime}<w_{0}^{\ast}, which contradicts with the optimality of 𝒘{\mbox{\boldmath$w$}^{\ast}}. \Box

4 Flow Formulation and its Duality

4.1 Flow Formulation

In this subsection, we describe the odds problem as a finite-horizon Markov decision process and give a straightforward non-linear programming formulation of a problem for finding an optimal policy. We propose a transformation that converts the problem into a linear programming problem.

In this paper, we represent the record of a game using a sequence of realization values of observed random variables. For example, the record (X1,X2,X3,X4)=(1,0,0,1)(X_{1},X_{2},X_{3},X_{4})=(1,0,0,1) indicates that the player observed four random variables X1,X2,X3,X4X_{1},X_{2},X_{3},X_{4} and selected the success X4=1X_{4}=1. When a player observes all random variables and fails to select one, we define a record of the game though a sequence of realization values with an additional last component 1, i.e., (X1,X2,,Xn,1)(X_{1},X_{2},\ldots,X_{n},1). Then, the set of all possible records, denoted by Ξ\Xi, becomes

Ξ={(ξ1,ξ2,,ξi){0,1}ii{1,2,,n+1},ξi=1}.\Xi=\{(\xi_{1},\xi_{2},\ldots,\xi_{i})\in\{0,1\}^{i}\mid i\in\{1,2,\ldots,n+1\},\xi_{i}=1\}.

Next, we introduce a finite-horizon Markov decision process that formulates our problem. Let φ\varphi be a state space defined by φ=𝒞𝒮\varphi={\cal C}\cup{\cal S}, where 𝒞={c0,c1,,cn+1}{\cal C}=\{c_{0},c_{1},\ldots,c_{n+1}\} and 𝒮={s1,s2,,sn}{\cal S}=\{s_{1},s_{2},\ldots,s_{n}\}. The state c0c_{0} is the initial state of our process. The state cn+1c_{n+1} is an absorbing state called a terminal state. For each state ci(i{0,1,2,,n})c_{i}\;(i\in\{0,1,2,\ldots,n\}), we define a transition probability from cic_{i} to a state sφs\in\varphi by

pci,s={pi(if s=si+1),qi(if s=ci+1),0(otherwise),(i{0,n1}) and pcn,s={1(if s=cn+1),0(otherwise).p_{c_{i},s}=\left\{\begin{array}[]{ll}p_{i}&(\mbox{if }s=s_{i+1}),\\ q_{i}&(\mbox{if }s=c_{i+1}),\\ 0&(\mbox{otherwise}),\end{array}\right.(\forall i\in\{0,\ldots n-1\})\mbox{ and }p_{c_{n},s}=\left\{\begin{array}[]{ll}1&(\mbox{if }s=c_{n+1}),\\ 0&(\mbox{otherwise}).\end{array}\right.

For each state in 𝒮{\cal S}, we associate action space 𝒜={cont,stop}{\cal A}=\{\mbox{\tt cont},\mbox{\tt stop}\}. If an action stop is selected at state si𝒮s_{i}\in{\cal S}, the process moves to the terminal state cn+1c_{n+1} and generates a reward RiR_{i}. Otherwise, an action cont is selected and the process moves from sis_{i} to cic_{i} without a reward. A (probabilistic) policy is an nn-dimensional vector π=(π1,π2,,πn)[0,1]n\pi=(\pi_{1},\pi_{2},\ldots,\pi_{n})\in[0,1]^{n}, where πi\pi_{i} denotes a probability that an action cont is selected at state sis_{i}. Our goal is to find an optimal policy π\pi^{*} such that the expected reward is maximized.

Given a policy π\pi, the occurrence probability of a record ξ=(ξ1,ξ2,,ξi)Ξ\xi=(\xi_{1},\xi_{2},\ldots,\xi_{i})\in\Xi, denoted by Pr(ξπ)\mbox{\rm Pr}(\xi\mid\pi), satisfies

Pr(ξπ)=α1α2αi\mbox{\rm Pr}(\xi\mid\pi)=\alpha_{1}\alpha_{2}\cdots\alpha_{i}

where

αj={qj(if ξj=0),pjπj(if ξj=1),(j{1,2,,i1})\alpha_{j}=\left\{\begin{array}[]{ll}q_{j}&(\mbox{if }\xi_{j}=0),\\ p_{j}\pi_{j}&(\mbox{if }\xi_{j}=1),\end{array}\right.(\forall j\in\{1,2,\ldots,i-1\})

and

αi={pi(1πi)(if in),qn+pnπn(if i=n+1).\alpha_{i}=\left\{\begin{array}[]{ll}p_{i}(1-\pi_{i})&(\mbox{if }i\leq n),\\ q_{n}+p_{n}\pi_{n}&(\mbox{if }i=n+1).\end{array}\right.

For each i{1,2,,n+1}i\in\{1,2,\ldots,n+1\}, we define

Ξi={ξΞthe length of ξ is equal to i}.\Xi_{i}=\{\xi\in\Xi\mid\mbox{the length of $\xi$ is equal to $i$}\}.

Obviously, the occurrence probabilities Pr(ξπ)(ξΞi)\mbox{\rm Pr}(\xi\mid\pi)\;(\xi\in\Xi_{i}) satisfy

ξΞiPr(ξπ)=(j=1i1(qj+pjπj))pi(1πi)\sum_{\xi\in\Xi_{i}}\mbox{\rm Pr}(\xi\mid\pi)=\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}(1-\pi_{i})

for each i{1,2,,n}.i\in\{1,2,\ldots,n\}. The expected reward with respect to π\pi is equal to

i=1n(RiξΞiPr(ξπ)).\sum_{i=1}^{n}\left(R_{i}\sum_{\xi\in\Xi_{i}}\mbox{\rm Pr}(\xi\mid\pi)\right).

We can then formulate the problem of maximizing the expected reward as

Q: max.i=1nRiyis.t.yi=(j=1i1(qj+pjπj))pi(1πi)(i{1,2,,n}),0πi1(i{1,2,,n}).\displaystyle\begin{aligned} \mbox{Q: }\;\;\mbox{max.}~{}&\sum_{i=1}^{n}R_{i}y_{i}\\ \mbox{s.t.}~{}&y_{i}=\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}(1-\pi_{i})&(\forall i\in\{1,2,\ldots,n\}),\\ &0\leq\pi_{i}\leq 1&(\forall i\in\{1,2,\ldots,n\}).\\ \end{aligned}

In the following, we transform the above problem into a linear programming problem. The following theorem provides the key idea of our transformation.

Theorem 4.1.

Let 𝐲=(y1,y2,,yn)\mbox{\boldmath{$y$}}=(y_{1},y_{2},\ldots,y_{n}) be an nn-dimensional real vector. The following two conditions are equivalent.

(c1)

There exists π=(π1,π2,,πn)\pi=(\pi_{1},\pi_{2},\ldots,\pi_{n}) such that (𝒚,π)(\mbox{\boldmath{$y$}},\pi) is feasible for problem Q.

(c2)

There exists 𝒛=(z0,z1,,zn)\mbox{\boldmath{$z$}}=(z_{0},z_{1},\ldots,z_{n}) such that (𝒚,𝒛)(\mbox{\boldmath{$y$}},\mbox{\boldmath{$z$}}) satisfies the following linear inequality system:

yipizi1(i{1,2,,n}),yi+zi=zi1(i{1,2,,n}),z0=1,yi0(i{1,2,,n}).\displaystyle\begin{aligned} &y_{i}\leq p_{i}z_{i-1}&(\forall i\in\{1,2,\ldots,n\}),\\ &y_{i}+z_{i}=z_{i-1}&(\forall i\in\{1,2,\ldots,n\}),\\ &z_{0}=1,\\ &y_{i}\geq 0&(\forall i\in\{1,2,\ldots,n\}).\\ \end{aligned} (2)

Proof of (c1) \rightarrow (c2). Let (𝒚,π)(\mbox{\boldmath{$y$}},\pi) be a feasible solution to problem Q. Obviously, we have 𝒚𝟎\mbox{\boldmath{$y$}}\geq\mbox{\boldmath{$0$}}. Apply z0=1z_{0}=1 and zi=j=1i(qj+pjπj)z_{i}=\prod_{j=1}^{i}(q_{j}+p_{j}\pi_{j}) (i{1,2,,n})(\forall i\in\{1,2,\ldots,n\}). We then obtain for each i{1,2,,n},i\in\{1,2,\ldots,n\},

zi1zi=j=1i1(qj+pjπj)j=1i(qj+pjπj)\displaystyle z_{i-1}-z_{i}=\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})-\prod_{j=1}^{i}(q_{j}+p_{j}\pi_{j})
=\displaystyle= (j=1i1(qj+pjπj))(1(qi+piπi))=(j=1i1(qj+pjπj))pi(1πi)=yi\displaystyle\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)\left(1-(q_{i}+p_{i}\pi_{i})\right)=\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}(1-\pi_{i})=y_{i}

and

pizi1yi\displaystyle p_{i}z_{i-1}-y_{i} =\displaystyle= pij=1i1(qj+pjπj)(j=1i1(qj+pjπj))pi(1πi)\displaystyle p_{i}\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})-\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}(1-\pi_{i})
=\displaystyle= (j=1i1(qj+pjπj))(pipi(1πi))=(j=1i1(qj+pjπj))piπi0.\displaystyle\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)(p_{i}-p_{i}(1-\pi_{i}))=\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}\pi_{i}\geq 0.

Thus, (𝒚,𝒛)(\mbox{\boldmath{$y$}},\mbox{\boldmath{$z$}}) satisfies the inequality system (2).

Proof of (c2) \rightarrow (c1). Assume that (𝒚,𝒛)(\mbox{\boldmath{$y$}},\mbox{\boldmath{$z$}}) satisfies the inequality system (2). First, we show zi>0z_{i}>0 through the induction on ii. Clearly, z0=1>0z_{0}=1>0, and zi1>0z_{i-1}>0 implies that zi=zi1yizi1pizi1=zi1(1pi)>0.z_{i}=z_{i-1}-y_{i}\geq z_{i-1}-p_{i}z_{i-1}=z_{i-1}(1-p_{i})>0. With pi>0p_{i}>0, we define

πi=1yi/(pizi1)(i{1,2,,n}).\pi_{i}=1-y_{i}/(p_{i}z_{i-1})\;\;\;(\forall i\in\{1,2,\ldots,n\}). (3)

We then have the following inequalities:

1πi=1yipizi1=pizi1yipizi10(i{1,2,,n}).1\geq\pi_{i}=1-\frac{y_{i}}{p_{i}z_{i-1}}=\frac{p_{i}z_{i-1}-y_{i}}{p_{i}z_{i-1}}\geq 0\;\;\;(\forall i\in\{1,2,\ldots,n\}).

It is easy to see that for each i{1,2,,n}i\in\{1,2,\ldots,n\},

zi\displaystyle z_{i} =\displaystyle= zi1yi=(qi+pi)zi1yi=qizi1+(pizi1yi)\displaystyle z_{i-1}-y_{i}=(q_{i}+p_{i})z_{i-1}-y_{i}=q_{i}z_{i-1}+(p_{i}z_{i-1}-y_{i})
=\displaystyle= qizi1+pizi1πi=(qi+piπi)zi1\displaystyle q_{i}z_{i-1}+p_{i}z_{i-1}\pi_{i}=(q_{i}+p_{i}\pi_{i})z_{i-1}
=\displaystyle= (qi+piπi)(qi1+pi1πi1)(q1+p1π1)z0=j=1i(qj+pjπj),\displaystyle(q_{i}+p_{i}\pi_{i})(q_{i-1}+p_{i-1}\pi_{i-1})\cdots(q_{1}+p_{1}\pi_{1})z_{0}=\prod_{j=1}^{i}(q_{j}+p_{j}\pi_{j}),

and thus,

yi\displaystyle y_{i} =\displaystyle= pizi1piπizi1=zi1pi(1πi)=(j=1i1(qj+pjπj))pi(1πi).\displaystyle p_{i}z_{i-1}-p_{i}\pi_{i}z_{i-1}=z_{i-1}p_{i}(1-\pi_{i})=\left(\prod_{j=1}^{i-1}(q_{j}+p_{j}\pi_{j})\right)p_{i}(1-\pi_{i}).

From the above, (𝒚,π)(\mbox{\boldmath{$y$}},\pi) is feasible to problem Q. \Box

By employing the above theorem, we can transform problem Q into the following linear programming problem:

FF: max. i=1nRiyi\displaystyle\sum_{i=1}^{n}R_{i}y_{i}
s.t. yipizi10\displaystyle\frac{y_{i}}{p_{i}}-z_{i-1}\leq 0 (i{1,2,,n}),\displaystyle(\forall i\in\{1,2,\ldots,n\}),
yi+zizi1=0\displaystyle y_{i}+z_{i}-z_{i-1}=0 (i{1,2,,n}),\displaystyle(\forall i\in\{1,2,\ldots,n\}), (4)
z0=1,\displaystyle z_{0}=1, (5)
yi0\displaystyle y_{i}\geq 0 (i{1,2,,n}).\displaystyle(\forall i\in\{1,2,\ldots,n\}).

We call the above a flow formulation. When we have an optimal solution to FF, Equation (3) provides an optimal policy π\pi, which is optimal to problem Q.

In the following, we interpret the FF problem as a flow problem on a digraph. Let G=(V,E)G=(V,E) be a digraph with a vertex set V={s0,s1,,sn}{t1,t2,,tn}V=\{s_{0},s_{1},\ldots,s_{n}\}\cup\{t_{1},t_{2},\ldots,t_{n}\} and a directed edge set E={(si1,si)i{1,2,,n}}{(si1,ti)i{1,2,,n}}E=\{(s_{i-1},s_{i})\mid i\in\{1,2,\ldots,n\}\}\cup\{(s_{i-1},t_{i})\mid i\in\{1,2,\ldots,n\}\}. For each i{1,2,,n}i\in\{1,2,\ldots,n\}, we associate variables ziz_{i} and yiy_{i} with edges (si1,si)(s_{i-1},s_{i}) and (si1,ti)(s_{i-1},t_{i}), respectively. If we regard these variables as a flow on a corresponding directed edge, constraint (4) represents a flow conservation law at vertex si1s_{i-1}, and constraint (5) states that a flow of volume 1 emanates from vertex s0s_{0}. Because yi=ξΞiPr(ξπ)y_{i}=\sum_{\xi\in\Xi_{i}}\mbox{\rm Pr}(\xi\mid\pi), vertex tit_{i} corresponds to an event in which a player selects a success Xi=1X_{i}=1.

s0s_{0}(z0=1)(z_{0}=1)s1s_{1}s2s_{2}s3s_{3}s4s_{4}s5s_{5}t1t_{1}t2t_{2}t3t_{3}t4t_{4}t5t_{5}select X1X_{1}select X2X_{2}select X3X_{3}select X4X_{4}select X5X_{5}z1z_{1}y1y_{1}z2z_{2}y2y_{2}z3z_{3}y3y_{3}z4z_{4}y4y_{4}z5z_{5}y5y_{5}
Figure 1: Digraph GG, where n=5.n=5.

4.2 Flow Formulation for Classical Secretary Problem

Buchbinder, Jain, and Singh [14] proposed the following LP formulation for the classical secretary problem:

max.i=1n(i/n)yis.t.iyi1k=1i1yk(i{1,2,,n}),yi0(i{1,2,,n}).\displaystyle\begin{aligned} \mbox{max.}~{}&\sum_{i=1}^{n}(i/n)y_{i}\\ \mbox{s.t.}~{}&i\cdot y_{i}\leq 1-\sum_{k=1}^{i-1}y_{k}&(\forall i\in\{1,2,\ldots,n\}),\\ &y_{i}\geq 0&(\forall i\in\{1,2,\ldots,n\}).\\ \end{aligned} (6)

Their formulation is obtained from the FF by eliminating the variables {z1,z2,zn}\{z_{1},z_{2},\ldots z_{n}\} through substitutions zi=1(y1+y2++yi)z_{i}=1-(y_{1}+y_{2}+\cdots+y_{i}) and setting pi=1/ip_{i}=1/i and Ri=i/nR_{i}=i/n (i{1,2,,n}).(\forall i\in\{1,2,\ldots,n\}).

4.3 Dual of Flow Formulation

The linear programming duality of FF is

P1: min. w0\displaystyle w_{0}
s.t. wi+αipiRi\displaystyle w_{i}+\frac{\alpha_{i}}{p_{i}}\geq R_{i} (i{1,2,,n}),\displaystyle(\forall i\in\{1,2,\ldots,n\}), (7)
wi1wiαi=0\displaystyle w_{i-1}-w_{i}-\alpha_{i}=0 (i{1,2,,n}),\displaystyle(\forall i\in\{1,2,\ldots,n\}),
wn=0,\displaystyle w_{n}=0,
αi0\displaystyle\alpha_{i}\geq 0 (i{1,2,,n}).\displaystyle(\forall i\in\{1,2,\ldots,n\}). (8)

We eliminate variables {α1,α2,,αn}\{\alpha_{1},\alpha_{2},\ldots,\alpha_{n}\} by αi=wi1wi\alpha_{i}=w_{i-1}-w_{i}. Constraint (7) is transformed as follows:

wi+αipiRi,piwi+(wi1wi)piRi,wi1qiwi+piRi.\displaystyle\begin{aligned} w_{i}+\frac{\alpha_{i}}{p_{i}}&\geq&&R_{i},\\ p_{i}w_{i}+(w_{i-1}-w_{i})&\geq&&p_{i}R_{i},\\ w_{i-1}&\geq&&q_{i}w_{i}+p_{i}R_{i}.\\ \end{aligned}

Non-negativity constraints αi0\alpha_{i}\geq 0, as shown in (8), implies the following:

wi1wi(i{1,2,,n}).w_{i-1}\geq w_{i}\;\;\;(\forall i\in\{1,2,\ldots,n\}).

Thus, the above procedure coverts problem P1 into P.

5 Conclusion

This paper described a linear programming problem whose unique optimal solution is the solution to the DP equation for the odds problem. We proposed a flow formulation that maximizes the expected reward of the odds problem based on a finite-horizon Markov decision process. Furthermore, we showed the duality of these two formulations.

References

  • [1] F. T. Bruss, Sum the odds to one and stop, Annals of Probability 28 (3) (2000) 1384–1391.
  • [2] K. Ano, H. Kakinuma, N. Miyoshi, Odds theorem with multiple selection chances, Journal of Applied Probability 47 (4) (2010) 1093–1104.
  • [3] F. T. Bruss, D. Paindaveine, Selecting a sequence of last successes in independent trials, Journal of Applied Probability 37 (2) (2000) 389–399.
  • [4] A. Kurushima, K. Ano, Multiple stopping odds problem in Bernoulli trials with random number of observations, Mathematica Applicanda 44 (1) (2016) 209–220.
  • [5] M. Tamaki, Sum the multiplicative odds to one and stop, Journal of Applied Probability 47 (3) (2010) 761–777.
  • [6] T. Matsui, K. Ano, A note on a lower bound for the multiplicative odds theorem of optimal stopping, Journal of Applied Probability 51 (3) (2014) 885–889.
  • [7] T. Matsui, K. Ano, Lower bounds for Bruss’ odds problem with multiple stoppings, Mathematics of Operations Research 41 (2) (2016) 700–714.
  • [8] T. Matsui, K. Ano, Compare the ratio of symmetric polynomials of odds to one and stop, Journal of Applied Probability 54 (1) (2017) 12.
  • [9] F. T. Bruss, A note on bounds for the odds theorem of optimal stopping, Annals of Probability 31 (4) (2003) 1859–1961.
  • [10] J. P. Gilbert, F. Mosteller, Recognizing the maximum of a sequence, Journal of the American Statistical Association 61 (313) (1966) 35–73.
  • [11] A. Cayley, Mathematical questions with their solutions, The Educational Times 23 (1875) 18–19.
  • [12] L. Moser, On a problem of Cayley, Scripta Math 22 (1956) 289–292.
  • [13] T. S. Ferguson, Who solved the secretary problem?, Statistical Science 4 (3) (1989) 282–296.
  • [14] N. Buchbinder, K. Jain, M. Singh, Secretary problems via linear programming, Mathematics of Operations Research 39 (1) (2014) 190–206.
  • [15] S. M. Ross, Introduction to stochastic dynamic programming, Academic Press, 2014.
  • [16] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
  • [17] D. P. Bertsekas, Dynamic programming and optimal control 4th edition, volume I, Athena Scientific, 2017.
  • [18] D. P. Bertsekas, Dynamic programming and optimal control 4th edition, volume II, Athena Scientific, 2012.
  • [19] A. S. Manne, Linear programming and sequential decisions, Management Science 6 (3) (1960) 259–267.
  • [20] L. C. M. Kallenberg, Survey of linear programming for standard and nonstandard Markovian control problems. Part I: Theory, Zeitschrift für Operations Research 40 (1) (1994) 1–42.
  • [21] C. Derman, On sequential decisions and Markov chains, Management Science 9 (1) (1962) 16–24.
  • [22] C. Derman, M. Klein, Some remarks on finite horizon Markovian decision models, Operations Research 13 (2) (1965) 272–278.
  • [23] A. Bhattacharya, J. P. Kharoufeh, Linear programming formulation for non-stationary, finite-horizon Markov decision process models, Operations Research Letters 45 (6) (2017) 570–574.
  • [24] M. Mundhenk, J. Goldsmith, C. Lusena, E. Allender, Complexity of finite-horizon markov decision process problems, Journal of the ACM (JACM) 47 (4) (2000) 681–720.