Dynamic Programming and Linear Programming
for Odds Problem
††thanks:
This work was supported by JSPS KAKENHI Grant Numbers
JP26285045, JP26242027, JP20K04973.
We thank Katsunori Ano, Naoto Miyoshi, and Akifumi Kira
for extensive discussions.
Abstract
This paper discusses the odds problem, proposed by Bruss in 2000, and its variants. A recurrence relation called a dynamic programming (DP) equation is used to find an optimal stopping policy of the odds problem and its variants. In 2013, Buchbinder, Jain, and Singh proposed a linear programming (LP) formulation for finding an optimal stopping policy of the classical secretary problem, which is a special case of the odds problem. The proposed linear programming problem, which maximizes the probability of a win, differs from the DP equations known for long time periods. This paper shows that an ordinary DP equation is a modification of the dual problem of linear programming including the LP formulation proposed by Buchbinder, Jain, and Singh.
Keywords: Odds problem, dynamic programming, linear programming, Markov decision process
2010 Mathematics Subject Classification: Primary 60G40, Secondary 60L15
1 Introduction
The odds problem is an optimal stopping problem proposed by Bruss [1], which includes the classical secretary problem as a special case. Some variants of the odds problem are discussed in [2, 3, 4, 5, 6, 7, 8]. Although the odds theorem shown in [1] and [9] gives an optimal policy directly, it is well-known that a backward induction method finds an optimal policy for most of the finite optimal stopping problems [10]. The backward induction method was used for the first time by Cayley [11, 12, 13]. A recurrence relation appearing in the backward induction method is called a dynamic programming (DP) equation. Recently, Buchbinder, Jain, and Singh [14] proposed a linear programming (LP) formulation for the classical secretary problem. Their LP formulation does not depend on the backward induction method and is essentially different from the DP equation. The purpose of this paper is to clarify the relation between these two types of programming.
The classical secretary problem and its variants are discussed as finite-horizon Markov decision processes and solved through backward induction methods (e.g., see Gilbert and Mosteller [10], Ross [15] Section I.5, Puterman [16] Section 4.6.4, and Bertsekas [17] Section 3.4 of Vol. I). Variations of infinite-horizon optimal stopping problems are discussed as stationary, infinite-horizon Markov decision processes (e.g., see [15] Section III.2, [16] Section 7.2.8, and [18] Section 4.4 of Vol. II). An LP reformulation is a popular method for stationary, infinite-horizon Markov decision process models [19, 20]. By contrast, an LP formulation is seldom used for finite-horizon Markov decision process models (and/or transient Markov policies) owing to the ease of solving the DP equation. Theoretically, Derman [21, 22] showed that we can deal with a finite-horizon Markov decision process model as a special case of an infinite-horizon Markov decision process models. Some recent results on LP formulations for finite-horizon Markov decision process models appear in [23, 24], for example. In Section 3, we propose a linear programming problem whose unique optimal solution is the solution of the DP equation. We give a direct proof of a transformation (from the DP equation to an LP formulation) in a self-contained manner by restricting to the odds problem.
In Section 4, we describe the odds problem as a finite-horizon Markov decision process and discuss a problem of finding an optimal policy (stopping rule). A straightforward formulation, which is independent of the DP equation, produces a non-linear programming problem. We propose a transformation that converts the non-linear programming problem into a linear programming problem, called a flow formulation. By restricting to the classical secretary problem, our flow formulation gives the LP formulation proposed in [14]. As a main result, we show that our flow formulation is the dual linear programming of the LP formulation obtained in Section 3.
The remainder of this paper is organized as follows. In the next section, we provide detailed descriptions of the odds problem and its variants. Section 3 shows that the DP equation gives a unique optimal solution of a certain linear programming problem. In Section 4, we describe a linear programming formulation for determining an optimal policy of the odds problem. We also show the duality of a DP equation and our linear programming formulation.
2 Odds Problem and its Variations
Let denote a sequence of independent Bernoulli random variables. If , we state that the outcome of random variable is a success. Otherwise , we state that the outcome of is a failure. For each , the success probability is given. We denote the failure probability by and the odds by . A player observes these random variables sequentially one by one and is allowed to select the variable when observing a success. The odds problem, proposed by Bruss [1], maximizes the probability of selecting the last success.
In this paper, we discuss the odds problem and some variations (see [8] for detailed discussion). We assume that a sequence of rewards is given and a player receives reward when selecting a success . The aim of the player is to maximize the expected reward. The odds problem proposed by Bruss [1] is obtained by setting the reward to the probability that a success is the last success, i.e., The classical secretary problem is obtained by setting and
Bruss and Paindaveine [3] discussed a model in which a player wants to select the final -th success. This problem is obtained by setting
Tamaki [5] discussed a problem of selecting any of the last successes. We can express this problem by setting
Matsui and Ano [8] discussed a problem of selecting out of the last successes, where Their model includes the above models as special cases. The model is obtained by setting
3 Linear Programming Problem for Solving DP Equation
This section describes a DP equation for finding an optimal stopping rule of our problem. We propose a linear programming problem whose unique optimal solution gives the solution to the DP equation.
Let be the expected reward under the condition in which a player observes variables ; does not select ; and afterward adopts an optimal stopping rule. Then, denotes the maximum expected reward when a player adopts an optimal stopping rule. The following recurrence relation,
(1) |
is called a DP equation, which calculates an optimal stopping rule through backward induction (see [10] for example).
Now, we describe a linear programming problem that finds a solution to the above DP equation.
Theorem 3.1.
A linear programming problem
has a unique optimal solution that satisfies DP Equation (LABEL:eq:dp).
proof: Because problem P has a feasible solution whose objective function is always non-negative, an optimal solution is available. Let be an optimal solution of P. Obviously, satisfies
We prove that satisfies DP Equation (LABEL:eq:dp) through a contradiction. Assume that there exists an index that satisfies
We introduce a sufficiently small positive number and a solution defined by
In the following, we show that is feasible to P.
For each the definition of directly implies that
When , satisfies
as is a sufficiently small positive number and . For each , the inequality implies the following:
From the above, is feasible to problem P. Obviously, we have , which contradicts with the optimality of .
4 Flow Formulation and its Duality
4.1 Flow Formulation
In this subsection, we describe the odds problem as a finite-horizon Markov decision process and give a straightforward non-linear programming formulation of a problem for finding an optimal policy. We propose a transformation that converts the problem into a linear programming problem.
In this paper, we represent the record of a game using a sequence of realization values of observed random variables. For example, the record indicates that the player observed four random variables and selected the success . When a player observes all random variables and fails to select one, we define a record of the game though a sequence of realization values with an additional last component 1, i.e., . Then, the set of all possible records, denoted by , becomes
Next, we introduce a finite-horizon Markov decision process that formulates our problem. Let be a state space defined by , where and . The state is the initial state of our process. The state is an absorbing state called a terminal state. For each state , we define a transition probability from to a state by
For each state in , we associate action space . If an action stop is selected at state , the process moves to the terminal state and generates a reward . Otherwise, an action cont is selected and the process moves from to without a reward. A (probabilistic) policy is an -dimensional vector , where denotes a probability that an action cont is selected at state . Our goal is to find an optimal policy such that the expected reward is maximized.
Given a policy , the occurrence probability of a record , denoted by , satisfies
where
and
For each , we define
Obviously, the occurrence probabilities satisfy
for each The expected reward with respect to is equal to
We can then formulate the problem of maximizing the expected reward as
In the following, we transform the above problem into a linear programming problem. The following theorem provides the key idea of our transformation.
Theorem 4.1.
Let be an -dimensional real vector. The following two conditions are equivalent.
- (c1)
-
There exists such that is feasible for problem Q.
- (c2)
-
There exists such that satisfies the following linear inequality system:
(2)
Proof of (c1) (c2). Let be a feasible solution to problem Q. Obviously, we have . Apply and . We then obtain for each
and
Thus, satisfies the inequality system (2).
Proof of (c2) (c1). Assume that satisfies the inequality system (2). First, we show through the induction on . Clearly, , and implies that With , we define
(3) |
We then have the following inequalities:
It is easy to see that for each ,
and thus,
From the above, is feasible to problem Q.
By employing the above theorem, we can transform problem Q into the following linear programming problem:
FF: max. | |||||
s.t. | |||||
(4) | |||||
(5) | |||||
We call the above a flow formulation. When we have an optimal solution to FF, Equation (3) provides an optimal policy , which is optimal to problem Q.
In the following, we interpret the FF problem as a flow problem on a digraph. Let be a digraph with a vertex set and a directed edge set . For each , we associate variables and with edges and , respectively. If we regard these variables as a flow on a corresponding directed edge, constraint (4) represents a flow conservation law at vertex , and constraint (5) states that a flow of volume 1 emanates from vertex . Because , vertex corresponds to an event in which a player selects a success .
4.2 Flow Formulation for Classical Secretary Problem
Buchbinder, Jain, and Singh [14] proposed the following LP formulation for the classical secretary problem:
(6) |
Their formulation is obtained from the FF by eliminating the variables through substitutions and setting and
4.3 Dual of Flow Formulation
5 Conclusion
This paper described a linear programming problem whose unique optimal solution is the solution to the DP equation for the odds problem. We proposed a flow formulation that maximizes the expected reward of the odds problem based on a finite-horizon Markov decision process. Furthermore, we showed the duality of these two formulations.
References
- [1] F. T. Bruss, Sum the odds to one and stop, Annals of Probability 28 (3) (2000) 1384–1391.
- [2] K. Ano, H. Kakinuma, N. Miyoshi, Odds theorem with multiple selection chances, Journal of Applied Probability 47 (4) (2010) 1093–1104.
- [3] F. T. Bruss, D. Paindaveine, Selecting a sequence of last successes in independent trials, Journal of Applied Probability 37 (2) (2000) 389–399.
- [4] A. Kurushima, K. Ano, Multiple stopping odds problem in Bernoulli trials with random number of observations, Mathematica Applicanda 44 (1) (2016) 209–220.
- [5] M. Tamaki, Sum the multiplicative odds to one and stop, Journal of Applied Probability 47 (3) (2010) 761–777.
- [6] T. Matsui, K. Ano, A note on a lower bound for the multiplicative odds theorem of optimal stopping, Journal of Applied Probability 51 (3) (2014) 885–889.
- [7] T. Matsui, K. Ano, Lower bounds for Bruss’ odds problem with multiple stoppings, Mathematics of Operations Research 41 (2) (2016) 700–714.
- [8] T. Matsui, K. Ano, Compare the ratio of symmetric polynomials of odds to one and stop, Journal of Applied Probability 54 (1) (2017) 12.
- [9] F. T. Bruss, A note on bounds for the odds theorem of optimal stopping, Annals of Probability 31 (4) (2003) 1859–1961.
- [10] J. P. Gilbert, F. Mosteller, Recognizing the maximum of a sequence, Journal of the American Statistical Association 61 (313) (1966) 35–73.
- [11] A. Cayley, Mathematical questions with their solutions, The Educational Times 23 (1875) 18–19.
- [12] L. Moser, On a problem of Cayley, Scripta Math 22 (1956) 289–292.
- [13] T. S. Ferguson, Who solved the secretary problem?, Statistical Science 4 (3) (1989) 282–296.
- [14] N. Buchbinder, K. Jain, M. Singh, Secretary problems via linear programming, Mathematics of Operations Research 39 (1) (2014) 190–206.
- [15] S. M. Ross, Introduction to stochastic dynamic programming, Academic Press, 2014.
- [16] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014.
- [17] D. P. Bertsekas, Dynamic programming and optimal control 4th edition, volume I, Athena Scientific, 2017.
- [18] D. P. Bertsekas, Dynamic programming and optimal control 4th edition, volume II, Athena Scientific, 2012.
- [19] A. S. Manne, Linear programming and sequential decisions, Management Science 6 (3) (1960) 259–267.
- [20] L. C. M. Kallenberg, Survey of linear programming for standard and nonstandard Markovian control problems. Part I: Theory, Zeitschrift für Operations Research 40 (1) (1994) 1–42.
- [21] C. Derman, On sequential decisions and Markov chains, Management Science 9 (1) (1962) 16–24.
- [22] C. Derman, M. Klein, Some remarks on finite horizon Markovian decision models, Operations Research 13 (2) (1965) 272–278.
- [23] A. Bhattacharya, J. P. Kharoufeh, Linear programming formulation for non-stationary, finite-horizon Markov decision process models, Operations Research Letters 45 (6) (2017) 570–574.
- [24] M. Mundhenk, J. Goldsmith, C. Lusena, E. Allender, Complexity of finite-horizon markov decision process problems, Journal of the ACM (JACM) 47 (4) (2000) 681–720.