STAT0025-统计代写|学霸联盟

STAT0025-统计代写

时间：2023-04-19

STAT0025: Optimisation Algorithms in
Operational Research
University College London
2022/23
Contents
1. Linear programming 4
1.1. Mathematical programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Linear programming problems (LPPs) . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3. Graphical solution method for LPPs . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Extreme point theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5. Graphical naïve solution method . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6. Slack Variables and Surplus Variables . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7. Graphical sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. The simplex algorithm 16
2.1. Theory leading to the simplex algorithm . . . . . . . . . . . . . . . . . . . . . . . 17
2.2. Numerical naïve method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3. The simplex algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4. Why Does the Simplex Algorithm Work? . . . . . . . . . . . . . . . . . . . . . . 24
2.5. Another Example Of the Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . 26
2.6. Practical considerations of the simplex algorithm . . . . . . . . . . . . . . . . . . 27
2.7. Simplex Algorithm for More General Problems . . . . . . . . . . . . . . . . . . . 29
2.8. Two-phase simplex method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3. Sensitivity analysis and duality 35
3.1. Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2. Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4. Game theory 41
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2. Basic formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3. Nash equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4. Dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5. Saddle Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6. Maximin and minimax mixed strategies . . . . . . . . . . . . . . . . . . . . . . . 45
4.7. Graphical solution for an n× 2 game . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8. Linear programming formulation of a game . . . . . . . . . . . . . . . . . . . . . 48
4.9. Games against nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9.1. Maximax strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9.2. Laplacian strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9.3. Hurwicz strategy (not examinable in 2022–23) . . . . . . . . . . . . . . . 52
4.9.4. Minimax regret strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9.5. Expected Payoff and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9.6. Expected value–standard deviation criterion . . . . . . . . . . . . . . . . . 53
5. Dynamic programming 55
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2
5.2. Backward recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3. Resource allocation problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4. The warehousing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6. Markov dynamic programming 64
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2. Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3. Rewards in a Markov setting (with no actions) . . . . . . . . . . . . . . . . . . . 65
6.4. Markov dynamic programming with actions . . . . . . . . . . . . . . . . . . . . . 67
6.5. Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.6. Infinite horizon problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7. Optimal stopping problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A. Reading list 75
B. Corrections 76
3
1. Linear programming
What is OR? OR stands for Operational Research or Operations Research, and is sometimes
called Management Science. To a mathematician, OR means maximising some function subject
to a set of constraints. To a manager, OR means building a model for his/her business which:
• allows the efficient or optimal allocation of limited resources, and
• identifies those resources that are limiting the business’ output.
Who uses OR? Usually large companies or organisations e.g. oil companies, car manufacturers,
the military.
When did it start? Surprisingly recently; during the Second World War when scarce resources
had to be assigned for different military purposes. Scientists were asked to do research into
military operations. The simplex algorithm was formulated in 1947.
To do practical OR one needs at least simple computers/calculators.
Main topics in this course
• Linear programming (Chapters 1–3),
• Game theory (Chapter 4),
• Dynamic programming – deterministic and stochastic (Chapters 5 and 6)
1.1. Mathematical programming
Optimisation is the problem of finding the maximum or minimum of a function f(x) and the
value of x at which f(x) attains that maximum/minimum.
If x∗ is the value that minimises f(x), then x∗ is also the value that maximises −f(x), so a
minimisation problem can be turned into a maximisation problem, and vice versa.
Example 1.1 (Unconstrained minimisation). Minimise
z = f(x) = x21 + x
2
2.
It is easy to see that x∗1 = 0 and x∗2 = 0 gives the unique minimum, with z∗ = f(x∗1, x∗2) = 0.
For a less obvious problem partial derivatives would yield the solution.
Example 1.2 (Constrained minimisation). Minimise
z = f(x) = x21 + x
2
2.
Subject to the conditions
g1(x) = x1 − x2 = 3
g2(x) = x2 ≥ 2.
4
x2 = 2
x2 = x1 − 3
x2
x1
z = 1
z = 4
(5, 2)
2
−3
The value of z is the squared distance of the point (x1, x2) from the origin (Pythagoras’s
Theorem). From the figure the point that satisfies the constraints that is also the closest to the
origin is x∗1 = 5, x∗2 = 2, which is the optimal solution.
Amathematical program is an optimisation problem in which the objective and constraints
are given as mathematical functions and functional relationships:
Minimise (or maximise) z = f(x1, x2, . . . , xn)
subject to:
g1(x1, x2, . . . , xn)
g2(x1, x2, . . . , xn)
...
gm(x1, x2, . . . , xn)

≤
=
≥

b1
b2
...
bm
The xj are called the control variables, f is the objective function, the equalities and
inequalities involving gi are the constraints and bi (i = 1, . . . ,m) are the resource values.
1.2. Linear programming problems (LPPs)
Definition 1.3. A mathematical program is a linear program if f(x1, x2, . . . , xn) and all
gi(x1, x2, . . . , xn) are linear in the control variables, that is, if f and all gi can be written as:
f(x1, x2, . . . , xn) = c1x1 + c2x2 + . . .+ cnxn
and
gi(x1, x2, . . . , xn) = ai,1x1 + ai,2x2 + . . .+ ai,nxn,
where cj and ai,j (i = 1, 2, . . . ,m and j = 1, . . . , n) are known constants.
5
In most problems we deal with in this course we also insist that the xi’s are non-negative.
Definition 1.4. • An LPP is said to be in canonical form if it is a maximisation problem,
all of the variables xi are non-negative, and all of the constraints are equalities.
• An LPP is said to be in standard form if it is a maximisation problem, all of the variables
xi are non-negative, and all of the constraints are ‘≤’ inequalities.
• A point x = (x1, . . . , xn)T is called a feasible solution if it satisfies all of the constraints
(including non-negativitiy constraints, if present.) The set of all feasible solutions is called
the feasible region.
• A feasible solution is called an optimal solution if there is no other feasible solution at
which the objective function has a strictly better value (bigger for a maximisation problem,
smaller for a minimisation problem.)
• The value of the objective function at an optimal solution is called the optimal value of
the problem.
Example 1.5 (Clock manufacturer). A clock manufacturer produces a number of standard clocks
and a number of alarm clocks each day.
Resource constraints: labour hours per day: 1600
processing hours per day: 1800
alarm assemblies available per day: 350
Consumption of resources: one standard clock uses: 2 hours labour
6 hours processing
one alarm clock uses: 4 hours labour
2 hours processing
1 alarm assembly.
Profit: one standard clock: gives £3 profit
one alarm clock: gives £8 profit
The objective is to maximise profit.
In linear programming notation, let x1 be the number of standard clocks produced per
day and let x2 be the number of alarm clocks per day.
The objective function is z = 3x1 + 8x2, which we want to maximise.
The constraints are: 2x1 + 4x2 ≤ 1600 (labour)
6x1 + 2x2 ≤ 1800 (processing)
x2 ≤ 350 (alarm assemblies).
Also, the number of clocks produced per day cannot be negative, so x1 ≥ 0 and x2 ≥ 0.
This problem is in standard form.
We will see two main approaches to solving linear programming problems: the graphical solu-
tion method and the simplex method.
1.3. Graphical solution method for LPPs
Consider the alarm clock problem above.
First draw the feasible region, defined by the 3 constraints.
6
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
Labour Constraint
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
Processing C
onstraint
Labour Constraint
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
Processing C
onstraint
Labour Constraint
Alarm AssembliesConstraint
7
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
z=5200
Processing Constraint
Labour Constraint
Alarm AssembliesConstraint
z=1200
z=3100
(100,350)
Taking z = 3x1+8x2, we can rearrange this to get x2 = −38x1+ 18z. So, for any given constant
c, all points on this line x2 = −38x1 + c yield the same value of z, and this value of z is given
by 18z = c, i.e. z = 8c. Lines x2 = −38x1 + c that do not intersect with the feasible region are
of no interest, as no point on such a line satisfies the constraints. We are looking for the line
x2 = −38x1 + c which both has the greatest intercept, c, and intersects the feasible region. All
points on the line yield the same maximum value of z (z = 8c), and at least one point on this
line is a feasible solution.
By taking the line x2 = −38x1 + c for some arbitrary value of c, and moving it up and down
in a parallel fashion (i.e. increasing/decreasing c whilst keeping the gradient the same) we find
the line with the greatest intercept that intersects the feasible region. This gives us both the
maximum value for z and the values of x1 and x2 that maximise it.
In this example (x∗1, x∗2) = (100, 350) gives the optimal value z∗ = 3100.
Exercise. I claim that, if alarm clocks were instead sold for £2 profit, meaning that z =
3x1 + 2x2, the the optimal solution would become (x∗1, x∗2) = (200, 300) and the optimal value
would be z∗ = 1200. Check this.
8
Note. In both cases, the optimising values of the control variables are unique, and lie at a
vertex of the feasible region.
1.4. Extreme point theorem
Definition 1.6. Let R ⊂ Rn.
• R is said to be bounded if there exists some C > 0 such that ‖x‖ ≤ C for all x ∈ R.
• An extreme point of R is any vertex on the boundary of R. Formally, x ∈ R is an extreme
point of R if it does not lie strictly between any other pair of points of R; that is, if for all
y, z ∈ R distinct from x, x /∈ {λy + (1− λ)z : λ ∈ (0, 1)}.
• A convex combination of a set of points x(1), . . . ,x(k) is a point of the form λ1x(1) +
· · ·+ λkx(k) for some λ1, . . . , λk ≥ 0 such that λ1 + · · ·+ λk = 1. Informally, this is a point
that lies ‘in between’ all of the points x(i).
• If R is the feasible region in some optimisation problem, then the extreme points of R are
called feasible basis points.
Theorem 1.7 (Extreme point theorem). Consider a linear programming problem, and assume
that its feasible region is non-empty and bounded. Then:
• An optimal solution exists, and can be found at an extreme point of the feasible region.
• If there is more than one optimal solution, then every optimal solution is a convex combi-
nation of optimal solutions found at the extreme points of the feasible region.
Examples The clock example has a non-empty, bounded feasible region with a unique optimal
solution.
An empty feasible region:
maximise z = x1 + x2
s.t. x1 − x2 ≥ 5
x1 − x2 ≤ 4
An unbounded feasible region without an optimal solution:
maximise z = x1 + x2
s.t. x1 ≥ 2
x2 ≥ 2
An unbounded feasible region with a (unique) optimal solution:
minimise z = x1 + x2
s.t. x1 ≥ 2
x2 ≥ 2
1.5. Graphical naïve solution method
If the feasible region is bounded, the extreme point theorem tells us that the optimal solution is
at an extreme point of the feasible region (a feasible basis point). So, to find the optimal solution,
one can evaluate the objective function at each feasible basis point. The point that gives the
highest (lowest) value for z is the optimal solution.
9
Example 1.8 (The alarm clock problem).
Feasible basis point Value
coordinates (x1, x2) z = 3x1 + 8x2
(0, 0) 0
(300, 0) 900
(200, 300) 3000
(100, 350) 3100
(0, 350) 2800
Comments If the feasible region is not bounded, the method may not work. We will see later
how to extend this method to the case n > 2, but as you may be able to guess, it is not very
efficient.
The graphical naïve method is not meant as a serious way to solve linear program-
ming problems. The idea of introducing this method is to demonstrate something
about the role of extreme points.
1.6. Slack Variables and Surplus Variables
In the optimal solution above, x∗1 = 100 and x∗1 = 350, which implies the following use of
resources.
• Labour: 2x∗1 + 4x∗2 = 1600 hr/day
• Processing: 6x∗1 + 2x∗2 = 1300 hr/day
• Alarm assemblies: x∗2 = 350 assemblies/day.
The processing constraint was 6x∗1 + 2x∗2 ≤ 1800. Since there was spare capacity here, we say
that the processing constraint had 500 hr/day slack. The labour and alarm assembly constraints
were 2x∗1 + 4x∗2 ≤ 1600 and x∗2 ≤ 350, respectively, so the slack for these was 0. We say these
constraints are tight for this solution.
Considering further the processing constraint:
6x1 + 2x2 ≤ 1800,
by introducing a slack variable x4 we can transform the constraint into
6x1 + 2x2 + x4 = 1800,
with x4 ≥ 0. (If we allowed negative values for x4 then 6x1 + 2x2 could be > 1800.)
So the inequality has been transformed into an equality. This will be utilised in the next
chapter.
Suppose there had been another constraint that was a minimum (≥) not a maximum (≤). E.g.
the total number of clocks produced must be at least 100.
x1 + x2 ≥ 100.
We now introduce a surplus variable x6 and transform the constraint into
x1 + x2 − x6 = 100.
with x6 ≥ 0.
10
Note: We always arrange matters so that slack and surplus variables are non-negative.
The full set of constraints, for the clock problem with a minimum of 100 clocks, is:
Labour: 2x1 + 4x2 + x3 = 1600
Processing: 6x1 + 2x2 + x4 = 1800
Alarm assemblies: x2 + x5 = 350
Minimum number of clocks: x1 + x2 − x6 = 100
as well as positivity: xi ≥ 0 for i = 1, . . . , 6.
1.7. Graphical sensitivity analysis
Sensitivity analysismeans determining how the optimal solution changes when the specification
of the LPP changes. How sensitive is the optimal solution to the correct model specification?
The specification of the LPP may change in three ways:
1. The coefficients of the objective function (cj) change,
2. The resource constraints (bi) change, or
3. The coefficients of the constraints (ai,j) change.
Below, we look at 1 and 2.
Definition 1.9. The shadow price of a constraint is the increase in the optimal value per unit
increase of the resource value associated with that constraint.
Returning to the original clock problem.
11
100
400
600
800
900
100 200 300 400 100
200
300
500
700
500 700 800
A
la
rm
C
lo
ck
P
ro
du
ct
io
n
x2
Standard Clock Production x1
Processing Constraint
Labour Constraint
Alarm AssembliesConstraint
Suppose we increase the available labour resource by 1 unit, to 1601. Then the line representing
the labour constraint will change from x2 = 14(1600−2x1) to x2 = 14(1601−2x1); in other words,
the intercept changes but the gradient does not. It is simple to see that thee optimal solution will
still be at the “same” vertex as before, in the sense of the intersection of the labour and alarm
assemblies lines. Hence, the optimal value of x2 is still x∗2 = 350, thus
x∗1 = (1601− 4x∗2)/2 = 100.5
z∗ = 3101.5
Because the problem is linear, the objective function increases by 1.5 for every unit increase in
the labour resource. The shadow price of the labour constraint is £1.5 (per labour unit).
Suppose, instead, the processing resource had increased by one unit. Then the optimal solution
would not change at all. This is because the optimal solution does not lie on the line corresponding
to the processing constraint. Thus, the shadow price for processing is zero. In general, if the
slack/surplus variable for a constraint is non-zero then the shadow price of the constraint will be
zero.
Finally, suppose the number of alarm assemblies available increases by one. Then
x∗1 = 98
x∗2 = 351
z∗ = 3102
and so the shadow price for assemblies is £2 (per assembly unit).
12
Interpretation of the shadow price Suppose the clock company considers increasing the number
of alarm assemblies available at a cost. The shadow price of £2 tells them that they should do
so, provided the cost of buying in each extra alarm assembly is less than £2.
Connection with slack The labour and alarm assembly constraints both have non-zero shadow
price, and they are also tight (they have zero slack). This is not a coincidence – we will return
to it later.
At what point does the shadow price drop to zero? Look at the previous graph. As the labour
resource increases, the optimal solution will increase until some point P . This corresponds to
the processing slack being reduced to zero. Increasing the labour resource beyond this point will
have no effect on the optimal solution and the labour slack will increase.
Point P is the intersection of the processing and alarm constraints, located at P = (5503 , 350),
where z = 3350, corresponding to a labour resource of 176623 . It is not worthwhile to increase
the available labour to more than 176623 hours/day.
Changes in the objective function From before,
z = 3x1 + 8x2
⇒ x2 = −3
8
x1 +
z
8
So, the gradient of the contours of the objective function is −38 .
In the graph, the gradient of the assembly constraint is zero and the gradient of the labour
constraint is −12 . For the optimal solution to be at the assembly-labour vertex, (1) we must be
maximising the contour intercept, and (2) the gradient of the objective function must be between
−12 and 0.
What happens if either of these fails? If (2) fails, and the gradient of the objective function is
between −3 and −12 , then a greater value of the objective function can be obtained at the labour-
processing vertex. If the gradient of the objective function is exactly −12 then it is parallel to the
labour constraint and any point on the labour constraint between the two vertices is optimal.
On the other hand, if z = −3x1− 8x2, then the gradient of the contours is still −38 , but (1) fails:
we are minimising the intercept, so x∗ = 0.
Question: If z = c1x1 + c2x2, what values of c1 and c2 give the same values of x∗1 = 100,
x∗2 = 350?
Solution: Rearrange the equation to obtain x2 = − c1c2x1 + 1c2 z. The gradient of this line is− c1c2 . In order for an optimal solution to be (x∗1, x∗2) = (100, 350), we need c2 > 0 to be maximising
the intercept, and we also require −12 ≤ − c1c2 ≤ 0, or equivalently 0 ≤ c1c2 ≤ 12 .
If we change the value of c1, the optimal solution will be the same if c1 remains positive and
does not exceed 12c2.
If we change only c2, the optimal solution will remain the same if c2 exceeds 2c1.
Note: The optimal solution remains the same (i.e., x∗1 and x∗2 remain unchanged), but the
optimal value of the objective function will change.
Example 1.10. Suppose a company makes two models of car. These models are called Alpha
and Omega. The number of Alpha models to be produced is x1 and the number of Omega models
is x2.
13
We have the following linear programming problem:
maximise z = 6x1 + 5x2 (profit)
s.t. 4x1 + x2 ≤ 800 (materials)
2x1 + 3x2 ≤ 900 (labour)
x1 ≤ 180 (max Alpha sales)
x2 ≤ 320 (max Omega sales)
x1, x2 ≥ 0.
Which gives the graph:
E
x2 max omega sales
max alpha salesmaterials
labour
320
A
300
B
C
D
180 200 x1450
The line representing the contours of the objective function is x2 = −65x1 + c, and this can be
used with the graphical method to determine the solution. Equally, we can list the feasible basis
points and the value of z at each one:
Feasible basis point Value
coordinate (x1, x2) z = 6x1 + 5x2
A = (0, 300) 1500
B = (150, 200) 1900
C = (180, 80) 1480
D = (180, 0) 1080
E = (0, 0) 0
The optimal solution is at point B, giving
x∗1 = 150, x∗2 = 200, z∗ = 1900.
14
Again, if the objective function is altered, the location of the optimal solution could change to
another vertex. Let z = c1x1 + c2x2, with the original values c1 = 6, c2 = 5. We shall find the
values of c1 and c2 for which the optimal solution is still located at the same vertex.
E
x2 max omega sales
max alpha salesmaterials
labour
320
A
300
B
C
D
180 200 x1450
z = 0
In the plot above, a contour z = 0 of the original objective function is shown (the solid line
through E), along with contours of the objective function through B when the values of c1 and
c2 are changed (dashed lines). The slope of the contours of z is −c1/c2.
In order for B to remain the optimal solution, firstly we need that c2 > 0 in order to be
maximising the intercept; secondly, the slope of the contours has to lie between the slopes of the
materials constraint (−4) and the labour constraint (−23).
Therefore, for B to be optimal,
−4 ≤ −c1
c2
≤ −2
3
1
4
c1 ≤ c2 ≤ 3
2
c1
In the original problem, c1 = 6, so the tolerance interval for c2 of [1.5, 9]. If the profit from
the Omega model (c2) changes next month, then so long as it lies within the tolerance interval,
the company does not need to change its production plan (though the overall profit may change.)
Exercise
1. What is the tolerance interval for c1, given the original value c2 = 5?
2. What are the shadow prices for the four constraints in this problem?
15
2. The simplex algorithm
In this chapter we shall assume that all linear programming problems (LPPs) aremaximisation
problems and that all control variables, xi, must be non-negative, unless otherwise stated.
General Notation
A LPP may be written using matrix notation.
For example,
Maximise z = cTx objective function
subject to Ax ≤ b constraints (2.1)
x ≥ 0,
where A = [ai,j ] is an m×n matrix, b is an m-vector of constraints (resources), x is an n-vector
of variables, and c is an n-vector of objective function coefficients.
Here, ‘Ax ≤ b’ indicates that the inequality holds elementwise, that is, (Ax)i ≤ bi for i =
1, . . . ,m, where (Ax)i is the i-th element of the m-vector Ax. Likewise, ‘x ≥ 0’ means that
xi ≥ 0 for all i = 1, . . . , n.
The above LPP is in standard form. This is because equation (2.1) has a ≤ inequality, and
x are all non-negative.
If instead the problem is
Maximise z = cTc xc
subject to Acxc = b (2.2)
xc ≥ 0,
then the LPP is in canonical form. (The subscript ‘c’ is just to indicate that this is canonical
form.)
By using slack/surplus variables, it is always possible to convert any LPP into canonical form.
To see this, consider the following example.
Example 2.1 (Clock LPP revisited). The clock LPP is:
Maximise z = 3x1 + 8x2
subject to 2x1 + 4x2 ≤ 1600
6x1 + 2x2 ≤ 1800
x2 ≤ 350
x1, x2 ≥ 0.
16
It can be written in matrix notation in standard form as (2.1), with
c =
(
3
8
)
, x =
(
x1
x2
)
,
A =
2 46 2
0 1
 , b =
16001800
350
 .
Add slack variables to transform it into canonical form:
Maximise z = 3x1 + 8x2
subject to 2x1 + 4x2 + x3 = 1600
6x1 + 2x2 + x4 = 1800
x2 + x5 = 350
x1, x2, x3, x4, x5 ≥ 0.
In matrix form (2.2), this corresponds to:
cc =

3
8
0
0
0
 xc =

x1
x2
x3
x4
x5

Ac =
2 4 1 0 06 2 0 1 0
0 1 0 0 1
 b =
16001800
350
 .
In general, any problem in standard form (2.1) can be converted to one in canonical form (2.2),
by defining the m× (n+m) matrix
Ac =
 A
1 0 · · · 0
0 1 · · · 0
...
. . .
0 0 · · · 1

or in other words Ac = ( A | Im ), where Im is the m × m identity matrix. We also have
xc = (x1, . . . , xn+m)
T and cc =
(
c
0
)
.
2.1. Theory leading to the simplex algorithm
Assume the rows of Ac are linearly independent. Otherwise, at least one constraint is redundant
(i.e., follows from the other constraints) and can be removed without affecting the problem.
Solving Acxc = b gives a region R = {xc ∈ Rn+m : Acxc = b} satisfying the m resource
constraints of the canonical problem.
The region S = {xc ∈ R : xc ≥ 0} is the feasible region of the LPP.
Definition 2.2. If a vector xc solves the equation Acxc = b and xc has at most m non-zero
elements, xc is called a basic solution of the LPP.
If xc is a basic solution, and every element of xc is also non-negative, then xc is called a
feasible basic solution of the LPP.
17
Theorem 2.3. Every extreme point of S is a feasible basic solution of the LPP, and conversely
every feasible basic solution of the LPP is an extreme point of S.
In the last chapter we saw that the optimal solution of a LPP lies at an extreme point of the
feasible region (or is a convex combination of extreme points.) So, from the theorem above, any
optimal solution must be one of the feasible basic solutions (or such a linear space.)
So, if we check all the basic solutions, determine which are feasible, and pick the one which
maximises z, we are done.
2.2. Numerical naïve method
How do we find the basic solutions?
The idea is simple: pick m of the control variables, and allow those to be non-zero (which
means that the remaining n variables are forced to be zero). Try to solve the resulting equation.
This will give one basic solution. Then change which control variables are allowed to be non-zero,
and repeat. The idea is probably best understood by example.
The m elements of xc which are allowed to be non-zero are called basic variables. The re-
maining n variables set to zero are called non-basic variables.
Using this method it is straightforward to find all the basic solutions, check which ones are
feasible and then inspect which of these feasible basic solutions gives the largest value of z. This
is the optimal solution.
There are
(
m+n
m
)
combinations of m basic variables, and hence (at most)
(
m+n
m
)
basic solutions
to be found.
Example 2.4 (Clock LPP revisited). Earlier, we wrote the clock LPP in canonical form using
matrix notation:
Acxc =
 2 4 1 0 06 2 0 1 0
0 1 0 0 1


x1
x2
x3
x4
x5
 =
 16001800
350
 = b
There are
(
5
3
)
= 10 ways of choosing m = 3 basic variables, and so 10 basic solutions to obtain.
If we choose x1, x2, x3 as the basic (non-zero) variables, we obtain:
 2 4 1 0 06 2 0 1 0
0 1 0 0 1


x1
x2
x3
0
0
 =
 2 4 16 2 0
0 1 0
 x1x2
x3
 =
 16001800
350

This point is very important. Notice that because the non-basic variables are zero, the equation
simplifies. When x1, x2, x3 are basic, the equation Acxc = b is equivalent to A′x′ = b, where A′
consists of columns 1, 2, 3 of Ac, and x′ = (x1, x2, x3)T .
Solving the equation gives x1 = 18313 , x2 = 350, x3 = −16623 . This corresponds to xc =
(18313 , 350,−16623 , 0, 0)T , which is a basic solution but not a feasible one, since one of the variables
is negative.
18
Now consider letting x1, x2, x4 be basic. By the reasoning above, this is the same as selecting
out columns 1, 2, 4, to give the equation:2 4 06 2 1
0 1 0
x1x2
x4
 =
16001800
350

corresponding to xc = (100, 350, 0, 500, 0)T . This is a feasible basic solution with value z = 3100.
Using columns 1, 2 and 5: 2 4 06 2 0
0 1 1
x1x2
x5
 =
16001800
350

corresponding to xc = (200, 300, 0, 0, 50)T . This is a feasible basic solution with value z = 3000.
Using columns 1, 3 and 4: 2 1 06 0 1
0 0 0
x1x3
x4
 =
16001800
350
 .
This equation has no solution.
The following table summarises the results for the other six combinations of three basic vari-
ables (three columns):
Columns x1 x2 x3 x4 x5 Feasible? z
1 3 5 300 0 1000 0 350 Y 900
1 4 5 800 0 0 -3000 350 N
2 3 4 0 350 200 1100 0 Y 2800
2 3 5 0 900 -2000 0 -550 N
2 4 5 0 400 0 1000 -50 N
3 4 5 0 0 1600 1800 350 Y 0
The feasible basic solution maximising the objective function is the second combination we tried
(selecting columns 1, 2 and 4) with x∗1 = 100, x∗2 = 350, x∗3 = 0, x∗4 = 500, x∗5 = 0, giving
z∗ = 3100. This confirms the result found graphically.
Example 2.5 (Another example). The following LPP is in canonical form.
Maximise z = x1 + x2
s.t. x1 + 2x2 + x3 = 10
2x1 + x2 + x4 = 12.
x1, x2, x3, x4 ≥ 0.
This gives
Ac =
(
1 2 1 0
2 1 0 1
)
with
xc = (x1, x2, x3, x4)
T , b = (10, 12)T .
19
Consider the basic solutions. One combination of m = 2 columns of Ac is column 1 and column
3. This gives us the equation
A′x′ =
(
1 1
2 0
)(
x1
x3
)
=
(
10
12
)
= b,
yielding x1 = 6, x2 = 0, x3 = 4 and x4 = 0. As x1 ≥ 0 and x3 ≥ 0, the solution is feasible. Its
value is z = 6.
Exercise: repeat this for each combination of two columns ofAc, and find the optimal solution.
Comments
1. This is essentially a numerical version of the graphical naïve method from section 1.5.
2. Just like the graphical naïve method, the numerical naïve method is not in-
tended as a serious way to solve linear programming problems. It is meant to
demonstrate how the simplex algorithm works, as we will see in the next sec-
tion. The number of possible solutions
(
m+n
m
)
can become large very rapidly. For example,
m = 5, n = 5 gives
(
10
5
)
= 252 solutions to be found.
3. The two examples we have looked at are both problems in standard form. When this is not
so, the LPP in canonical form will still be Acxc = b, but Ac may not have m+n columns.
The principle remains the same. Suppose that Ac is a m × l matrix, where l ≥ m. To
find basic solutions, we still take m ×m submatrices A′ of Ac and solve A′x′ = b. The
vector x′ is still of length m and there are still m basic variables, but now there are l −m
non-basic variables (instead of exactly n of them.)
2.3. The simplex algorithm
Assume for now that we are looking at a problem in standard form
Maximise z = cTx
subject to Ax ≤ b
x ≥ 0,
with A ∈ Rm×n, b ∈ Rm and x, c ∈ Rn. Assume moreover that bi ≥ 0 for all i = 1, . . . ,m. We
shall consider later what to do when this is not true.
In the numerical naïve method, we had to solve A′x′ = b for each of the
(
m+n
m
)
possible
matrices A′. One of these is trivial to solve: the one where A′ consists of the last m columns of
Ac. A′ is then the identity matrix and the solution is just x′ = b (see Example 2.4). Thus, one
basic solution is just x1 = · · · = xn = 0 (the non-basic variables) and xn+i = bi for i = 1, . . . ,m
(the basic variables). That is, set each slack variable equal to its corresponding resource value
and all the original variables to be zero. This basic solution is feasible (because bi ≥ 0 for all i)
and z = 0.
This section will outline an algorithm, the simplex algorithm, in which the problem is laid
out in a tabular format, starting from this initial solution, and we successively visit other feasible
basic solutions in such a way that the objective function always increases. Once z can no longer
be increased, we have found an optimal solution.
20
The advantage of the simplex algorithm is that we do not have to investigate explicitly all of
the
(
m+n
m
)
basic solutions, only the interesting ones.
Rearranging the objective function z = c1x1 + · · ·+ cnxn, we obtain
−c1x1 − · · · − cnxn + z = 0.
Our m constraints, written in canonical form, are Acxc = b. So, we can write the entire problem
as 
0
Ac
...
0
−c1 · · · −cn 0 · · · 0 1


x1
...
xn+m
z
 =

b1
...
bm
0
 (2.3)
From this we have the general format of the initial tableau:
Basic vars. x1 x2 · · · xn+m z Sol.
xn+1 0 b1
xn+2 0 b2
... Ac
...
...
xn+m 0 bm
z −c1 −c2 · · · −cn 0 · · · 0 1 0
Clock example: initial tableau
x1 x2 x3 x4 x5 z Sol.
x3 2 4 1 0 0 0 1600
x4 6 2 0 1 0 0 1800
x5 0 1 0 0 1 0 350
z −3 −8 0 0 0 1 0
Our initial basic solution is easy to read off from this tableau: x1 = x2 = 0 (the non-basic
variables) and x3 = 1600, x4 = 1800 and x5 = 350 (the basic variables), with z = 0. We say that
this tableau corresponds to this basic solution. Note that this basic solution is feasible since all
the variables are non-negative.
The simplex algorithm now looks for another feasible basic solution that has a greater value of
z. This new basic solution will be ‘adjacent’ to the old one. That is, m− 1 of the basic variables
will be same and one will be different.
First, we choose which of the non-basic variables to ‘enter’ (into the set of basic variables).
Second, we choose which basic variable to ‘exit’ (from the set of basic variables).
1. Enter. Find the non-basic variable with the most negative entry in the the last row (the
objective function row or z row). This variable will enter.
In the example, x2, has z-row entry −8, so x2 enters.
2. Exit. Find the basic variable corresponding to the row with the smallest non-negative
‘θ-ratio’. This variable will exit. The θ-ratio for row i when xj is entering is given by
θi =
{
bi/ai,j , ai,j ≥ 0,
−∞, ai,j < 0.
The row corresponding to the exiting variable is called the pivotal row.
21
In this example, the θ-ratios are:
θ1 = 1600/4 = 400,
θ2 = 1800/2 = 900,
θ3 = 350/1 = 350.
The smallest ratio is for row 3, so x5 is the exiting variable. Row 3 is the pivotal
row.
3. Update. Use ‘elementary row operations’ to ‘isolate’ the entering variable (and keep the
other basic variables and z isolated.) Then relabel the pivotal row with the entering variable.
A variable is isolated when the column corresponding to this variable consists of zero
entries except for the entry in its corresponding row, which equals one.
The elementary row operations are:
• Multiplying a row by a constant.
• Adding one row to another.
(These will be familiar from linear algebra: they are used when solving a system of linear
equations.)
In this example, we want the x2 column,

4
2
1
−8
, to become

0
0
1
0
.
Row 3 already has a 1 in the x2 column. So there is no need to alter row 3, except
that we relabel it with x2, because the old basic variable x5 is being replaced by
x2. We can write this as [3]→ [3].
If we add −4 times row 3 to row 1, row 1 becomes
x3 | 2 0 1 0 − 4 | 200.
The element in the x2 column is now zero, as required. We can write this as
[1]− 4× [3]→ [1].
To isolate x2, take
[3] → [3]
[1]− 4× [3] → [1]
[2]− 2× [3] → [2]
[z] + 8× [3] → [z]
After performing these steps we have the second tableau:
x1 x2 x3 x4 x5 z Sol.
x3 2 0 1 0 −4 0 200
x4 6 0 0 1 −2 0 1100
x2 0 1 0 0 1 0 350
z −3 0 0 0 8 1 2800
22
Variables x2, x3 and x4 are now the basic variables — and so x1 and x5 are non-basic. The
values of x2, x3, x4 and z can easily be read off the tableau: x2 = 350, x3 = 200, x4 = 1100,
z = 2800. (Note that z = 3x1 + 8x2 does indeed equal 2800.)
We have now completed one full iteration in the simplex algorithm.
Notes
• z has increased.
• The values of the basic variables and of z come from:
0x2 + 1x3 + 0x4 + 0z = 200
⇒ x3 = 200
0x2 + 0x3 + 1x4 + 0z = 1100
⇒ x4 = 1100
1x2 + 0x3 + 0x4 + 0z = 350
⇒ x2 = 350
0x2 + 0x3 + 0x4 + z = 2800
⇒ z = 2800
since x1 = x5 = 0.
• It is easy to read off the values of the basic variables because the submatrix consisting
of the columns corresponding to the basic variables is the identity matrix (with columns
permuted).
Next iteration We now repeat the same process to see if we can get a solution that is even
better.
1. The entering variable is x1.
2. The θ ratios are 100, 18313 and ∞. The exiting variable is x3, in row 1 which is the pivotal
row.
3. The third tableau is:
x1 x2 x3 x4 x5 z Sol.
x1 1 0
1
2 0 −2 0 100
x4 0 0 −3 1 10 0 500
x2 0 1 0 0 1 0 350
z 0 0 32 0 2 1 3100
Stopping criterion When the last row has non-negative entries for every non-basic variable, then
the stopping criterion has been met. This means that the optimal solution has been reached,
which can be read off directly from the tableau.
Except for certain so-called degenerate problems, the stopping criterion will always be reached
provided an optimal solution exists. We will return to this point.
In our example, the stopping criterion is satisfied in the third tableau. Therefore, the optimal
solution is x∗1 = 100, x∗2 = 350, x∗3 = 0, x∗4 = 500, x∗5 = 0 and z∗ = 3100. (You can check that
z∗ = 3x∗1 + 8x∗2 does indeed equal 3100.)
23
The z column You might have noticed that, in all the tableaux above, the z column stayed
completely unaffected by all our linear row operations. This is always true. For this reason, it’s
usual to drop the z column when writing out a simplex tableau, and we will follow this convention
from now on.
2.4. Why Does the Simplex Algorithm Work?
Full details are available in our references, such as Bertsimas and Tsitsiklis, Hillier and Lieberman,
or Gordon and Pressman. Here I provide a brief account.
We have seen that the initial tableau represents the pair of equations
Acxc = b, z = c1x1 + · · ·+ cnxn (2.4)
as 
0
Ac
...
0
−c1 · · · −cn 0 · · · 0 1

[
xc
z
]
=
[
b
0
]
(2.5)
plus the choice of initial basic variables xn+1, . . . , xn+m. The initial feasible basic solution comes
from this tableau:
x1 = · · · = xn = 0
xn+i = bi (i = 1, . . . ,m)
z = 0
To obtain the second tableau, and indeed any subsequent tableau, we perform elementary row
operations. This amounts to multiplying (2.5) on the left by some invertible matrix. As a result,
we obtain equation 
0
BAc
...
0
−d 1

[
xc
z
]
=
[
Bb
e
]
(2.6)
together with a new choice of basic variables. The set of solutions of (2.5) is the same as the set
of solutions of (2.6).
The matrix B is chosen according to our updating rules. This implies that the columns
corresponding to the basic variables are a permutation of the identity matrix, and the entries of
−d are zero when they correspond to a basic variable. Because of this, the values of the basic
variables and the value of z can be read off from the equation, as before.
In the initial solution, x1 = · · · = xn = 0. So, equation (2.4) gives z = 0. In the second feasible
basic solution, the variable xi from {x1, . . . , xn} with the largest positive ci value (and hence
most negative entry in the z row) has been allowed to be positive, while one of the variables
xn+1, . . . , xn+m has been set to zero. From equation (2.4), we see that setting xi > 0 will increase
z, while setting any of xn+1, . . . , xn+m equal to zero will not change z. Thus, the second feasible
basic solution will have a greater z value than the initial solution.
The same reasoning shows that our third feasible basic solution will be better than our second,
and so on.
24
The last row of (2.6) says that
z − dk1xk1 − · · · − dknxkn = e (2.7)
where xk1 , . . . , xkn are the n non-basic variables, and dk1 , . . . , dkn are the corresponding entries
of d. When all entries of the z row are non-negative (so, dk1 , . . . , dkn ≤ 0), it is clear that making
any of the non-basic variables into basic variables (i.e. allowing them to be positive) can only
decrease z. Therefore, we have found an optimal solution.
Example 2.6 (A simple example).
Maximise z = x1 + 2x2
s.t. x1 + x2 ≤ 2
x1, x2 ≥ 0
The corresponding canonical form is
Maximise z = x1 + 2x2
s.t. x1 + x2 + x3 = 2
x1, x2, x3 ≥ 0
As this is in three dimensions, the problem can be plotted.
x1
x2
x3
B
C
A
The area enclosed by triangle ABC is the feasible region. (Notice that if you project this
region onto the x1-x2 plane, you obtain the feasible region of the standard form problem.)
The vertices A, B and C are the three basic solutions (two zero elements and one non-zero)
and they are all feasible.
The initial solution is vertex A = (0, 0, 2).
The optimal solution is vertex C = (0, 2, 0).
25
The simplex algorithm moves, at each iteration, from one feasible basic solution to an adjacent
and better solution, stopping when it can no longer find a better solution.
By the way, the triangle ABC is called a two-dimensional simplex (and this isn’t a coincidence.)
So, either:
• We move from A to B (i.e., enter x1 and exit x3), giving z = 2, followed by moving from B
to C (i.e., entering x2 and exiting x1), giving z = 4.
• Or we move directly from A to C (i.e., enter x2 and exit x3).
Exercise: Run the simplex algorithm on this problem, and investigate which route to the
optimal solution is taken.
2.5. Another Example Of the Simplex Algorithm
Example 2.7.
Maximise z = x1 + 2x2 + x3 + x4
s.t. 2x1 + x2 + 3x3 + x4 ≤ 8
2x1 + 3x2 + 4x4 ≤ 12
3x1 + x2 + 2x3 ≤ 18
x1, . . . , x4 ≥ 0
Introduce the slack variables x5, x6 and x7. These slack variables are the initial basic variables.
Initial Tableau
x1 x2 x3 x4 x5 x6 x7 Sol.
x5 2 1 3 1 1 0 0 8
x6 2 3 0 4 0 1 0 12
x7 3 1 2 0 0 0 1 18
z −1 −2 −1 −1 0 0 0 0
The first iteration of the algorithm:
x1 x2 x3 x4 x5 x6 x7 Sol. θ
x5 2 1 3 1 1 0 0 8 8
x6 2 [3] 0 4 0 1 0 12 4
x7 3 1 2 0 0 0 1 18 18
z −1 −2 −1 −1 0 0 0 0
[1]− 1
3
[2]→ [1]
[3]− 1
3
[2]→ [3]
1
3
[2]→ [2]
[z] +
2
3
[2]→ [z]
26
The second tableau:
x1 x2 x3 x4 x5 x6 x7 Sol. θ
x5
4
3 0 [3] −13 1 −13 0 4 43
x2
2
3 1 0
4
3 0
1
3 0 4 ∞
x7
7
3 0 2 −43 0 −13 1 14 7
z 13 0 −1 53 0 23 0 8
1
3
[1]→ [1]
[2]→ [2]
[3]− 2
3
[1]→ [3]
[z] +
1
3
[1]→ [z]
The third tableau:
x1 x2 x3 x4 x5 x6 x7 Sol.
x3
4
9 0 1 −19 13 −19 0 43
x2
2
3 1 0
4
3 0
1
3 0 4
x7 1
4
9 0 0 −109 −23 −19 1 1113
z 79 0 0 1
5
9
1
3
5
9 0 9
1
3
After the second iteration (giving the third tableau) the stopping criterion has been reached.
Thus
x∗2 = 4, x
∗
3 = 4/3, x
∗
1 = x
∗
4 = 0
and the slack variables are
x∗5 = x
∗
6 = 0, x
∗
7 = 11
1
3
,
and the optimal value is z∗ = 913 . (Check this using z = x1 + 2x2 + x3 + x4.)
2.6. Practical considerations of the simplex algorithm
The choice of entering variable
Different implementations of the simplex algorithm use different methods for choosing the entering
variable. The method given in this course, that is “choose the most negative value in the z row”,
is a good, simple method.
In the situation where two or more entries in the z row have the same largest negative value,
just choose one arbitrarily.
Alternative optima
If one (or more) of the non-basic variables has a zero in the z row of the final tableau, there exists
an alternative optimal solution. In other words there are two (or more) feasible basic solutions
that give the same maximum value of the objective function.
Why is this?
27
Look at equation (2.7) in Section 2.4. The fact that there is a zero in one of the columns
corresponding to a non-basic variable means that one of dk1 , . . . , dkn (dkj , say) equals zero. So,
we can enter the non-basic variable xkj without changing the value of z.
The geometric interpretation of this is that the z-contours are parallel to one (or more) con-
straints and that that any point on this constraint line also maximises z. So, alternative optimal
basic solutions will always be adjacent.
Example 2.8. This is a trivial example to demonstrate the idea. Given the following LPP.
Maximise z = x1 + 2x2
s.t. x1 ≤ 5
x2 ≤ 5
x1 + 2x2 ≤ 12
x1, x2 ≥ 0
Clearly the objective function is parallel to the third constraint.
x1 = 5
x2 = 5
x1
x2
x1 + 2x2 = 12
z = 0
A B
C
D
The simplex method moves from the origin to point A to point B, where the algorithm stops.
Any point on the line BC is optimal, but only B and C are basic solutions.
Exercise: Run the simplex algorithm on this problem and verify that the z row of the final
tableau contains a zero in one of the columns corresponding to a non-basic variable.
Unbounded solutions
In the simplex algorithm, if you encounter the situation in which none of the θ-ratios are non-
negative, then the feasible region is unbounded. There is no optimal solution, and the simplex
algorithm breaks down.
In typical practical examples this means that the problem is ill-formulated. If the objective
function is ‘profit’, for instance, it’s clearly nonsense to say that this can become arbitrarily large
– there must be some missing constraint.
28
Example 2.9 (Another trivial example).
Maximise z = x1 + x2
s.t. −2x1 + x2 ≤ 10
x1 − x2 ≤ 5
x1, x2 ≥ 0
Exercise: Check that, in the second tableau for this problem, all of the θ-ratios are negative.
Degenerate problems
If two (or more) rows have the same smallest positive θ-ratio — so that there is a choice of
exiting variable — we have a degenerate problem. If we are unfortunate in our choice of the
exiting variable, then it is possible for the method to cycle, so that after some steps we return to
a tableau that was seen before. If that is the case, then the method will never stop, so we will
never reach the optimal solution.
What can one do about this?
• Recognise when a choice of exiting variable arises, i.e., when two or more of the θ-ratios
are equal.
• If this occurs, continue the simplex algorithm; if the stopping condition is reached, the
solution is optimal.
• Be aware that cycling may occur: compare at each step the set of basic variables to the
preceding tableaux, and if at any stage you encounter the same set twice, you have a
repeated tableau and the simplex method is not going to work.
There are special rules that one can apply to degenerate problems which ensure that cycling
will not occur, and good linear programming software can handle these cases. For more details,
see Kolman and Beck.
For this course, you only need to know about degeneracy and to be able to identify when it
occurs. You will not be expected to find the optimal solution to a degenerate problem where
cycling occurs.
2.7. Simplex Algorithm for More General Problems
So far, we have assumed that
• the objective function is to be maximised,
• all constraints are of the form ≤,
• all the resource values, bi, are ≥ 0,
• and all the control variables must be ≥ 0.
We now look at what to do when these are not true.
Minimisation Problems
Minimising
z = c1x1 + · · ·+ cnxn
is the same as maximising
−z = −c1x1 − · · · − cnxn.
29
Practically, this simply means that the objective row in the initial tableau reads
−z | c1 c2 · · · cn 0 · · · 0 | 0
That is, the z and −ci are replaced with −z and ci.
Negative Control Variables
So far, we have assumed that all control variables must be non-negative. When a basic-variable
(x1, say) departs the tableau, we set x1 to zero, because this maximises the increase in z. If x1
were allowed to be negative, we could increase z even more by putting x1 equal to some negative
number. Therefore, the simplex algorithm as described above will not work.
The solution is straightforward. We replace x1 by two variables x1 = x
+
1 and x
−
1 , and let
x1 = x
+
1 − x−1 where x+1 ≥ 0 and x−1 ≥ 0. This allows the Simplex algorithm to be specified in
terms of non-negative control variables while still allowing x1 to be negative. Although there is
no unique solution to x1 = x
+
1 − x−1 , the simplex algorithm will always reach a final solution in
which either x+1 = 0 or x
−
1 = 0.
You will not be required to solve problems of this kind in this course.
Artificial Variables
So far, we have assumed that all the constraints have a ≤ and all the resource values bi are
non-negative. In this case, we introduce slack variables xn+1, . . . , xn+m and the initial solution
x1, . . . , xn = 0 and xn+i = bi (i = 1, . . . ,m) is feasible.
There are three situations to consider where this is not the case.
1. When the constraint involves ≤ and the resource value bi is negative, simply multiply the
constraint by −1. E.g.
a11x1 + . . .+ a1nxn ≤ b1 < 0
becomes
−a11x1 − . . .− a1nxn ≥ −b1 > 0
But now the constraint is ≥.
2. When a constraint involves ≥, we introduce a surplus variable. E.g.
a21x1 + . . .+ a2nxn ≥ b2
becomes
a21x1 + . . .+ a2nxn − xn+2 = b2
If b2 ≤ 0, we can just set −xn+2 = b2. If b2 > 0, we cannot set −xn+2 = b2 because then
xn+2 < 0, which is not a feasible solution.
3. When a constraint is an equality, e.g.
a31x1 + . . .+ a3nxn = b3
we do not introduce a slack or surplus variable and there is no simple way to find an initial
basic solution in the original variables.
30
To solve such LPPs using the simplex algorithm, an artificial variable is added to each
constraint that has an = sign or is ≥ bi with bi > 0. So, we transform
a21x1 + · · ·+ a2nxn ≥ b2 → a21x1 + · · ·+ a2nxn − xn+2 ≥ b2
→ a21x1 + · · ·+ a2nxn − xn+2 + y1 = b2
for a ‘ ≥ bi’ constraint with bi > 0, and
a31x1 + · · ·+ a3nxn = b3 → a31x1 + · · ·+ a3nxn + y2 = b3
for a ‘ = bi’ constraint with bi > 0.
Now, in order to find an initial feasible basic solution, we can set x1 = · · · = xn = xn+2 = 0,
y1 = b2 and y2 = b3.
The purpose of an artificial variable is simply to allow a feasible initial basic solution; it has
no real life meaning, and we require that each artificial variable becomes zero in the final
solution.
The simplex method can be modified to ensure that the artificial variables become equal to
zero. This modification is called the two-phase simplex method.
2.8. Two-phase simplex method
This method involves running the simplex algorithm twice: once to eliminate all the artificial
variables and so find a feasible basic solution to the original problem; and then again after
dropping the artificial variables from the tableau in order to optimise the original problem.
Phase one
Define one artificial variable for every constraint that needs one; let us say there are p ≤ m such
constraints, so we have artificial y1, . . . , yp. Define the auxiliary objective function as
z′ = −(y1 + · · ·+ yp),
which we want to maximise.
Suppose that the constraints are ordered such that constraints 1, . . . , p have artificial variables
and constraints p+ 1, . . . ,m do not. Then putting
xi = 0 for i = 1, . . . , n+ p
yi = bi for i = 1, . . . , p
and
xn+i = bi for i = p+ 1, . . . ,m
gives a feasible basic solution to the auxiliary problem.
We solve the auxiliary problem using the Simplex algorithm with the auxiliary objective func-
tion. We continue until all the artificial variables have been eliminated. As soon as all the
artificial variables have been eliminated, we stop the first phase.
31
Example 2.10. Consider the following example, which cannot be solved with the usual simplex
method.
Maximise z = −x1 + 5x2
s.t. 3x1 + 2x2 ≥ 18
2x2 ≤ 12
x1 ≤ 4
x1, x2 ≥ 0
Phase one Add slack and artificial variables to give the auxiliary problem.
Maximise z′ = −y1
s.t. 3x1 + 2x2 − x3 + y1 = 18
2x2 + x4 = 12
x1 + x5 = 4
x1, . . . , x5, y1 ≥ 0
The initial solution is x4 = 12, x5 = 4 and y1 = 18 (and x1 = x2 = x3 = 0.) If we follow the
usual rule for setting up the initial tableau, we would write the z′ row as...
x1 x2 . . . xm+n y1 . . . yp Sol
z′ 0 0 . . . 0 1 . . . 1 0?
(But does that ‘0’ make sense..?) There is a problem here: all the non-basic variables have a zero
in the z row, and it looks like the stopping criterion was already met. We have to re-write z′ as
z′ = −y1 = −18 + 3x1 + 2x2 − x3
i.e.,
z′ − 3x1 − 2x2 + x3 = −18.
In general, we have
z′ = −(y1 + · · ·+ yp)
and
ai,1x1 + . . . ai,n+mxn+m + yi = bi
giving
yi = bi − (ai,1x1 + . . . ai,n+mxn+m),
and hence
z′ −
n+m∑
j=1
(
p∑
i=1
ai,j
)
xj = −
p∑
i=1
bi
Note: rows 1, . . . , p are those rows which have an artificial variable.
The other requirement is that we “carry through” the original z row, to ensure it is in the
correct form at the start of the the second phase.
Returning to our example, this means the initial tableau should look like this.
32
Phase one: initial tableau (example)
x1 x2 x3 x4 x5 y1 Sol θ
y1 3 2 −1 0 0 1 18 6
x4 0 2 0 1 0 0 12 ∞
x5 1 0 0 0 1 0 4 4 →
z 1 −5 0 0 0 0 0
z′ −3 −2 1 0 0 0 −18
↑
[1]− 3[3]→ [1], [2]→ [2], [3]→ [3], [z]− [3]→ [z], [z′] + 3[3]→ [z′].
Second tableau
x1 x2 x3 x4 x5 y1 Sol θ
y1 0 2 −1 0 −3 1 6 3 →
x4 0 2 0 1 0 0 12 6
x1 1 0 0 0 1 0 4 ∞
z 0 −5 0 0 −1 0 −4
z′ 0 −2 1 0 3 0 −6
↑
[3]→ [3], [2]− [1]→ [2], 12 [1]→ [1], [z] + 52 [1]→ [z], [z′] + [1]→ [z′].
Final tableau
x1 x2 x3 x4 x5 y1 Sol
x2 0 1 −12 0 −32 12 3
x4 0 0 1 1 3 −1 6
x1 1 0 0 0 1 0 4
z 0 0 −52 0 −172 52 11
z′ 0 0 0 0 0 1 0
Now the artificial variable has exited, and so is zero. The solution for z′ is zero. (This must
be the case. Why?)
Our feasible solution for the original problem is x1 = 4, x2 = 3, x4 = 6 and x3 = x5 = 0, with
value z = 11.
Phase two Use the final tableau of phase one as the starting point for phase two. Simply drop
the z′ row and the y columns.
Phase two (example)
33
x1 x2 x3 x4 x5 Sol θ
x2 0 1 −12 0 −32 3 −∞
x4 0 0 1 1 3 6 2 →
x1 1 0 0 0 1 4 4
z 0 0 −52 0 −172 11
↑
[1] + 12 [2]→ [1], 13 [2]→ [2], [3]− 13 [2]→ [3], [z] + 176 [2]→ [z].
x1 x2 x3 x4 x5 Sol
x2 0 1 0 12 0 6
x5 0 0 13
1
3 1 2
x1 1 0 −13 −13 0 2
z 0 0 13
17
6 0 28
So, our final solution is x∗1 = 2, x∗2 = 6, x∗3 = x∗4 = 0, x∗5 = 2, with value z∗ = 28.
Summary of two-phase method
1. If all constraints are ≤ bi with bi ≥ 0 or ≥ bi with bi ≤ 0 then use the standard simplex
method.
2. Otherwise, introduce artificial variables.
3. Obtain the auxiliary objective function, in terms of xi.
4. Phase 1: Solve the auxiliary problem using the simplex algorithm on z′ and carrying through
z.
5. Phase 2: Drop the artificial variables and solve the original problem.
34
3. Sensitivity analysis and duality
3.1. Sensitivity analysis
Recall that sensitivity analysis means assessing how sensitive the optimal solution is to the
specification of the linear program. In Chapter 1, we looked at:
• how much the optimal value changes for a unit change in a resource value: the shadow price
of the resource.
• how much a resource value can change before the optimal solution moves to a different
extreme point (equivalently, the set of basic variables changes).
• how much the objective function can change before the optimal solution moves to a different
extreme point.
We looked at sensitivity analysis using the graphical solution method. Now, we consider an
analysis using the simplex algorithm.
Change to a Resource Value
Example 3.1. Consider the following linear programming problem.
Maximise z = x1 − 4x2 + 5x3
subject to x1 − 4x2 + 2x3 ≤ 20
2x1 + x2 + 3x3 ≤ 8
x1, x2, x3 ≥ 0.
In canonical form this is
Maximise z = x1 − 4x2 + 5x3
subject to x1 − 4x2 + 2x3 + x4 = 20
2x1 + x2 + 3x3 + x5 = 8
x1, x2, x3, x4, x5 ≥ 0.
This gives the following simplex algorithm tableaux.
Initial tableau
x1 x2 x3 x4 x5 Sol. θ
x4 1 −4 2 1 0 20 10
x5 2 1 3 0 1 8 8/3 →
z −1 4 −5 0 0 0
↑
Enter x3, exit x5. [1]− 23 [2]→ [1], 13 [2]→ [2], [z] + 53 [2]→ [z].
35
Second tableau
x1 x2 x3 x4 x5 Sol.
x4 −1/3 −14/3 0 1 −2/3 44/3
x3 2/3 1/3 1 0 1/3 8/3
z 7/3 17/3 0 0 5/3 40/3
So the optimal solution is x∗1 = 0, x∗2 = 0, x∗3 = 8/3, x∗4 = 44/3, x∗5 = 0, with optimal value
z∗ = 40/3.
We shall see later that the entries in the z row corresponding to the slack variables are the
shadow prices! The shadow prices for the first and second constraints are 0 and 5/3 respectively.
Definition 3.2. If the slack variable corresponding to a particular constraint is zero in the
optimal solution, then it is a binding constraint. Otherwise, it is a non-binding constraint.
Equivalently, a constraint is binding if the optimal solution exactly satisfies that constraint
(with equality).
Theorem 3.3 (Complementary slackness).
• Non-binding constraints have zero shadow prices.
• Equivalently: constraints with non-zero shadow prices are binding.
One might ask whether the converse of this theorem is true: do binding constraints always
have non-zero shadow prices? No. This might seem odd: if a constraint is binding, we would
think that we could obtain a higher return if we had a larger resource value for that constraint,
since we are currently bound by it. However, this isn’t guaranteed, and exercise sheet 2, question
1 gives an example where it fails.
In Example 3.1, the first constraint is non-binding, since x∗4 > 0. Hence, an improved objective
cannot be obtained by increasing the first resource value b1. The second constraint has shadow
price 5/3, so an improved objective can be obtained by increasing the second resource value
b2; however, we have to compute the shadow price to know this, we cannot deduce it from the
complementary slackness theorem.
How much can we change a resource value?
Suppose the value of b2, with shadow price 5/3, changes from 8 to 8 + ∆ for some small ∆ ∈ R.
Then, z increases by 5/3 · ∆ units. Typically, this will not be true for unlimited values of ∆:
there will be a point at which this constraint becomes non-binding because the other constraints
prevent further increases in z. Increasing b2 beyond this point will make no further difference to
z. At this point, the set of basic variables will change (the slack variable for the first constraint
will become basic). An important question: is for what range of values of b2 does the constraint
remain binding?
We could re-run the whole simplex algorithm again with the value of b2 replaced by its new
value. However, there is a quicker way.
Were we to repeat all the elementary row operations performed in the simplex algorithm,
we would find that only the solution column would change. So, instead of repeating all the
elementary row operations, just repeat them for the solution column. If all the entries in the
solution column remain non-negative, the solution we obtain is still feasible (since xi ≥ 0 ∀i) and
still optimal (since all the entries in the z row are still ≥ 0 – they have not changed).
In the above example, we obtain:
36
Solution column
Initial tableau Final tableau
Row 1 20 44/3− 2/3 ·∆
Row 2 8 + ∆ 8/3 + 1/3 ·∆
Row z 0 40/3 + 5/3 ·∆
So, for the final basic solution, we have x1 = x2 = x4 = 0, x3 = 8/3+1/3·∆ and x4 = 44/3−2/3·∆.
This is feasible if
8/3 + 1/3 ·∆ ≥ 0 and 44/3− 2/3 ·∆ ≥ 0,
or equivalently if
∆ ≥ −8 and ∆ ≤ 22.
So, the solution is feasible, and hence the shadow price is 5/3, if −8 ≤ ∆ ≤ 22; that is, if
0 ≤ b2 ≤ 30.
From our new solution column, we also see that the new optimal value is z∗ = 40/3 + 5/3 ·∆,
which confirms the shadow price.
Notes
1. It is not necessary to recalculate the solution column like this. The new solution column
is always equal to the old solution column plus ∆ times the corresponding slack variable
column (in this case, the x5 column.) Importantly, this confirms our earlier claim:
the entries in the z row corresponding to the slack variables are equal to the shadow prices.
2. In the example, if the value of ∆ is outside the range −8 ≤ ∆ ≤ 22, then the optimal
solution will have a different set of basic variables and the whole simplex algorithm will
have to be re-run using the new value of b2.
3. We have assumed that only one resource value is changed. There are methods which allow
simultaneous changes to more than one resource value, but these are not covered in this
course.
4. Good linear programming software will usually report, for each constraint i, the range of
values of bi for which the optimal solution still has the same set of basic variables.
Changes to the objective function
We consider changes to basic and non-basic variables of the optimal solution separately.
Basic variables Suppose we change the coefficient c3 of x3 from c3 = 5 to c3 = 5 + δ, for some
small δ ∈ R. That is,
znew = x1 − 4x2 + (5 + δ)x3
The old optimal solution we obtained will always still be feasible, since the constraints are
unaltered. However, the value of z will change and the solution may no longer be optimal. In
the final tableau the z row will change in the following way:
Take the row corresponding to the changed variable (x3), multiply by δ and, for the columns
corresponding to the non-basic variables and the solution column only, add this to the
37
z row.
x1 x2 x3 x4 x5 Sol.
x4 · · ·
x3 2/3 1/3 1 0 1/3 8/3
δx3 2/3 · δ 1/3 · δ δ 0 1/3 · δ 8/3 · δ
z 7/3 17/3 0 0 5/3 40/3
znew 7/3 + 2/3 · δ 17/3 + 1/3 · δ 0 0 5/3 + 1/3 · δ 40/3 + 8/3 · δ
This solution will still be optimal provided that 7/3 + 2/3 · δ ≥ 0, 17/3 + 1/3 · δ ≥ 0 and
5/3 + 1/3 · δ ≥ 0.
Simplifying this, we see that it is equivalent to δ ≥ −7/2, and so the coefficient for x3 (i.e.,
c3 = 5 + δ) must satisfy c3 ≥ 3/2, and the optimal value will be
z∗new = 40/3 + 8/3 · δ.
Non-basic variables Suppose that the objective function has a change in the coefficient of a
non-basic variable, such at x2, by an amount δ. That is, z becomes
znew = x1 + (δ − 4)x2 + 5x3.
The only change to the final tableau is that the z-row entry of the variable whose coefficient has
been changed (x2) changes by an amount −δ. Thus the z row becomes
znew | 7/3 17/3− δ 0 0 5/3 | 40/3
The solution will still be optimal, and the value unchanged, provided 17/3− δ ≥ 0, i.e., provided
δ ≤ 17/3. In other words, we require c2 ≤ 17/3− 4 = 4/3.
Again, the above methods only work for a change to a single coefficient ci.
3.2. Duality
Each LPP can be formulated in two different ways: the primal problem and the dual problem.
We have been considering LPPs of the form
Maximise z = c1x1 + · · ·+ cnxn
subject to a11x1 + · · ·+ a1nxn ≤ b1
...
am1x1 + · · ·+ amnxn ≤ bm
x1, . . . , xn ≥ 0.
This is primal problem. The dual problem is the LPP:
38
Minimise g = b1y1 + · · ·+ bmym
subject to a11y1 + · · ·+ am1ym ≥ c1
...
a1ny1 + · · ·+ amnym ≥ cn
y1, . . . , ym ≥ 0.
Notice that the dual of the dual problem is the primal problem.
Example 3.4. The dual problem corresponding to Example 3.1 is over control variables y1, y2:
Minimise g = 20y1 + 8y2
subject to y1 + 2y2 ≥ 1
−4y1 + y2 ≥ −4
2y1 + 3y2 ≥ 5
y1, y2 ≥ 0.
Check that that the optimal solution to the dual problem is y∗1 = 0, y∗2 = 5/3, with optimal
value g∗ = 40/3.
Notice that g∗ = z∗ and that y∗1 and y∗2 are equal the shadow prices for the primal problem.
In matrix form, the primal problem is Maximise: z = cTx
subject to: Ax ≤ b
x ≥ 0
and the corresponding dual
problem is Minimise: g = bTy
subject to: ATy ≥ c
y ≥ 0
Theorem 3.5. The optimal value of the objective function in the primal problem is the same
as the optimal value of the objective function in the dual problem. That is,
z∗ = g∗.
Consider the dual problem in Example 3.4. Suppose b2 is changed to b′2 = b2+1 = 9. Assuming
that the solution y∗ remains optimal, the new value of g∗ is
g′∗ = 20y∗1 + 9y
∗
2 = g
∗ + y∗2.
By increasing b2 by one unit, the new optimal value of g has increased by an amount y∗2. The-
orem 3.5 tells us that the optimal value of z∗ will have also increased by y∗2. Hence y∗2 is the
shadow price for the second constraint of the primal problem. This gives us:
Theorem 3.6. The optimal values of the control variables in the dual problem are the shadow
prices of the primal problem, and vice versa.
39
Interpretation of the dual problem
Let us return to the alarm clock problem.
Mr Primal is running a factory. He has 1600 units of labour, 1800 units of processing and 350
alarm assemblies at his disposal. He plans to produce x1 standard clocks and x2 alarm clocks.
For each standard clock he will get £3 and for each alarm clock £8.
An accident occurs and one unit of one of the resources (labour, processing or alarm assemblies)
is damaged and can no longer be used. Fortunately, Mr Primal is insured for such losses by Ms
Dual.
Mr Primal and Ms Dual have agreed beforehand how much one unit of each resource is worth,
and so how much Ms Dual will compensate Mr Primal to ensure that he does not suffer a financial
loss. Let yi denote the value of one unit of the ith resource (i = 1, 2, 3) that they agreed.
Mr Primal explains to Ms Dual that each standard clock uses 2 labour units and 6 processing
units and makes £3 profit. Each alarm clock uses 4 labour units, 2 processing units and 1 alarm
assembly and makes £8. Thus, for the compensation to be fair, he insists that y1, y2 and y3 must
satisfy the constraints
2y1 + 6y2 ≥ 3 (3.1)
4y1 + 2y2 + y3 ≥ 8
since, for example, with 2 units of labour and 6 of processing, he can make £3 of profit.
Ms Dual agrees this is fair. However, she want to minimise her liabilities. If the factory burns
down and all Mr. Primal’s resources are lost, she will have to pay out
g = 1600y1 + 1800y2 + 350y3
pounds.
So, Ms Dual wants to solve the following LPP:
Minimise g subject to the constraints (3.1).
This is the dual of Mr Primal’s original optimisation problem.
40
4. Game theory
4.1. Introduction
In Chapters 1–3, we looked at situations in which one individual chooses what to do in order
to optimise some quantity (the objective function). In this chapter, we look at strategic games
(or games). In a game, two or more individuals (the players) each try to maximise a different
quantity (the payoff for that player) which depends not only on what he/she does, but also on
what the other player(s) do.
Game Theory is applied in many fields:
• games of fun (Monopoly, card games, noughts and crosses)
• business and economics
• sociology
• politics
It is often necessary to idealise (simplify) the real problem, in order to make it susceptible to
analysis.
Games may be classified as:
• simultaneous-move or sequential-move
• zero-sum or non-zero sum (variable-sum)
• two-player or n-player (n > 2)
In this chapter, we shall look at two-player, zero-sum, simultaneous-move games. In Section 4.9
we shall briefly look at ‘games against nature’.
An important assumption of game theory is that each player seeks to maximise their payoff
and choose the best strategy to achieve this. (Economists call this ‘rationality’.) Under this
assumption, we can try to understand each player’s best (optimal) strategy.
4.2. Basic formulation
We have two players (A and B). Each has a number of possible moves and must choose one of
these. We are dealing with simultaneous-move games, so players must choose their move before
knowing which move the other has chosen. The outcome of the game is a payoff for Player A
and a payoff for Player B. These depend on Player A and B’s choices of move. As we are dealing
with zero-sum games, the payoff to Player B is minus the payoff to Player A. So, we require only
one payoff matrix:
Player B
B1 B2 · · · Bm
A1 a11 a12 · · · a1m
Player A
...
...
...
. . .
...
An an1 an2 · · · anm
If Player A chooses move Ai and Player B chooses move Bj , the payoff to Player A is aij and
41
to Player B is −aij . The convention is that the numbers in the payoff matrix are the payoffs for
the player to the left of the matrix (Player A in the matrix above).
Example 4.1.
Player B
B1 B2
Player A A1 0 −4
A2 3 2
Definition 4.2. A strategy is a rule for determining which move to play.
A pure strategy specifies that the same move always be chosen.
A mixed strategy specifies that the move be chosen at random, with each move having some
specific probability of being chosen. E.g., Player B chooses B1 with probability y1, B2 with
probability y2, etc. (and y1 + y2 + · · ·+ ym = 1).
(Notice that a pure strategy is a mixed strategy in which one of the probabilities equals one
and the rest are zero.)
Suppose the payoff matrix is
Player B
B1 . . . Bm
A1 a11 . . . a1m
Player A
...
...
...
An an1 . . . anm
Let xi (i = 1, . . . , n) be the probability that Player A chooses move Ai and yj be the probability
that Player B chooses move Bj , and let x = (x1, . . . , xn) and y = (y1, . . . , ym). Then the
expected payoff for Player A is
E(x,y) =
n∑
i=1
m∑
j=1
xiaijyj .
Note the non-standard notation for expectation.
4.3. Nash equilibrium
We seek the optimal strategy for each player. These are defined in terms of the Nash equilib-
rium.
The Nash equilibrium of a game is a pair of strategies, one for each player, such that each
player’s strategy is best for them, given that the other player is playing their own equilibrium
strategy.
Another way of looking at this is that neither player should want to change his strategy if he
knew what strategy the other player was using.
In fact, we shall often pretend that the other player knows our strategy in order to work out
what our optimal strategy is.
Example 4.3. Consider this payoff matrix:
Player B
B1 B2
Player A A1 0 −4
A2 3 2
42
The Nash equilibrium is: Player A always chooses A2 and Player B always chooses B2. You
can check this: if B plays B2, is it better for A to play A1 or A2? What if A mixes strategies?
Likewise, if A plays A2, what is best for B?
The equilibrium consists of pure strategies: x = (0, 1), y = (0, 1). The payoff will be 2 (to
Player A).
Example 4.4. Consider this payoff matrix:
Player B
B1 B2
Player A A1 4 3
A2 2 8
Do x = (12 ,
1
2) and y = (
1
4 ,
3
4) constitute a Nash equilibrium?
The expected payoff with the proposed actions is: E(x,y) = 4· 12 · 14+3· 12 · 34+2· 12 · 14+8· 12 · 34 = 478 .
Suppose that B plays y, and A knows that. If A plays A2, i.e., uses strategy x′ = (0, 1), then the
expected payoff to A is E(x′,y) = 2 · 14 + 8 · 34 = 612 > 478 . The fact that this is greater than the
payoff calculated above tells us that (x,y) is not an equilibrium.
In fact, the equilibrium is x∗ =
(
6
7 ,
1
7
)
, y∗ =
(
5
7 ,
2
7
)
, with payoff 357 .
Definition 4.5. If a Nash equilibrium x∗,y∗ exists, then x∗ and y∗ are the optimal strategies,
and
v = E(x∗,y∗)
is called the value of the game.
Corollary 4.6.
For all y, E(x∗,y) ≥ v.
For all x, E(x,y∗) ≤ v.
Theorem 4.7. Every two-player, simultaneous-move, zero-sum game has a Nash equilibrium
and hence an optimal strategy for each player and a value.
The optimal strategies may be pure or mixed.
If a pure strategy is optimal for one player then a pure strategy is also optimal for the other.
How can we solve a game? ‘Solving’ means finding optimal strategies x∗, y∗, and hence the
value v.
We will cover the following solution methods (in order of increasing complexity):
• Is there a single dominant strategy?
• Look for saddle point (a pure Nash equilibrium)
• Formulate in terms of maximin/minimax strategies
• Graphical methods
• Formulate as a linear programming problem
The first two methods will find optimal strategies if they are pure; the others are needed when
mixed strategies are involved.
43
4.4. Dominance
Definition 4.8. Let xp and x′p be pure strategies. Then xp dominates x′p if
E(xp,yq) ≥ E(x′p,yq) for all pure strategies yq
Also, yq dominates y′q if
E(xp,yq) ≤ E(xp,y′q) for all pure strategies xp
Dominated strategies are sub-optimal and can be eliminated (for the purpose of solving the
game).
Example 4.9.
Player B
B1 B2 B3 B4 B5
Player A A1 −1 −2 1 2 0
A2 −2 1 0 −3 −1
A3 −3 −2 0 0 0
We can drop the dominated strategies from the game. The first row dominates the third row.
The first column dominates B3 and B5. The sub-payoff matrix is now
Player B
B1 B2 B4
Player A A1 −1 −2 2
A2 −2 1 −3
If, for Player A, pure strategy xp dominates all other pure strategies, xp is Player A’s optimal
strategy. Player B’s optimal strategy is now the pure strategy that minimises A’s payoff when A
uses xp.
Example 4.10.
Player B
B1 B2 B3
Player A1 −2 5 −2
A A2 0 7 1
A3 −3 6 1
Pure strategy A2 (i.e. x = (0, 1, 0)) dominates the other two pure strategies, A1 and A3 (i.e.
x = (1, 0, 0) and x = (0, 0, 1). Hence, x∗ = (0, 1, 0).
What is B’s optimal strategy? B should play B1; that is, y∗ = (1, 0, 0).
4.5. Saddle Points
Suppose that neither Player A nor Player B has a pure strategy that dominates all his/her other
pure strategies. Then we look for saddle points.
44
Definition 4.11. The pure maximin strategy for Player A is the strategy Ai∗ with maximum
row-minimum. That is,
i∗ = argmaxi=1,...n min
j=1,...m
aij = argmaxi=1,...n
{
min(ai1, . . . , aim)
}
The pure minimax strategy for Player B is the strategy Bj∗ with minimum column-
maximum. That is,
j∗ = argminj=1,...,m max
i=1,...,n
aij = argminj=1,...m
{
max(a1j , . . . , anj)
}
If these two payoffs are equal, then the game has a saddle point.
Theorem 4.12 (Saddle point theorem). If a two-player, zero-sum, simultaneous-move game has
a saddle point, the pure maximin and minimax strategies constitute a Nash equilibrium and are
optimal strategies. The value of the game, v, equals the maximin and minimax payoffs.
Example 4.13. Two companies have rival products of beer, wine and spirits. Each month
they choose which product to advertise. The following table shows the gain in A’s profits (in
£thousands per month) given the combination of advertising actions. We shall assume that B’s
profits drop by the same amount.
Company B
Bb Bw Bs
Company A Ab 7 8 5
Aw −2 12 0
As 9 4 −4
There is a saddle point at (Ab, Bs): A should advertise beer and B should advertise spirits, for
a payoff of 5 to A.
Exercise: check that these strategies do form a Nash equilibrium.
Notes
• A game may have multiple saddle points.
• If there is a pure strategy that dominates all other pure strategies, it will also be found by
looking for a saddle point.
• Before looking for saddle points, you may want to eliminate dominated strategies (but this
is not necessary.)
4.6. Maximin and minimax mixed strategies
If there is no saddle point, the optimal solution is mixed.
When searching for a saddle point, we considered only pure maximin/minimax strategies. We
now consider mixed maximin/minimax strategies.
In order for Player A to determine their maximin strategy, they must consider what strategy
Player B would adopt if they knew x = (x1, . . . , xn). Player B would choose the move Bj for
which the expected payoff (to A)
n∑
i=1
xiai,j
45
is minimised. So, the expected payoff would be
min
(
n∑
i=1
xiai,1, . . . ,
n∑
i=1
xiai,m
)
Player A’s maximin strategy is the strategy x that maximises this:
x∗ = argmax
(x1,...xn)
{
min
(
n∑
i=1
xiai,1, . . . ,
n∑
i=1
xiai,m
)}
Similarly, the minimax strategy for B is
y∗ = argmin
(y1,...,ym)
max
 m∑
j=1
yja1,j , . . . ,
m∑
j=1
yjan,j
 .
Theorem 4.14. The maximin and minimax strategies constitute a Nash equilibrium, and so are
optimal strategies.
4.7. Graphical solution for an n× 2 game
Assume we have eliminated all dominated strategies and looked for a saddle point. No saddle
point has been found, so we consider mixed strategies.
Example 4.15.
Player B
B1 B2
Player A A1 2 −1
A2 −3 4
A3 0 2
(Check that there are no dominated strategies and no saddle point.)
Start by working out B’s minimax strategy. Given strategy y = (y1, 1−y1) for B, the expected
payoff to A for each of their pure strategies is
A1 : 2y1 − 1(1− y1) = 3y1 − 1
A2 : −3y1 + 4(1− y1) = 4− 7y1
A3 : 0y1 + 2(1− y1) = 2− 2y1.
Player B’s minimax strategy is
argmin
0≤y1≤1
{max(3y1 − 1, 4− 7y1, 2− 2y1)} .
We can plot the three components in the above formula as lines with y1 (0 ≤ y1 ≤ 1) on the
horizontal axis.
46
42
0
−2
1
y 1
to A
Payoff
A3
1A
2A
y∗1, the optimal choice of y1, is the place where the maximum of the lines in the plot has its
smallest value. We can find it as the intersection of the lines A1 and A3.
3y∗1 − 1 = 2− 2y∗1
y∗1 =
3
5
y∗2 =
2
5
So, B’s optimal strategy is y∗ =
(
3
5 ,
2
5
)
. Now, we need to find A’s optimal solution.
Theorem 4.16. In an n × 2 game, the optimal (maximin) strategy for A requires only two
strategies: those that intersect at B’s minimax solution.
If more than two strategies intersect at this point, choose any two that have opposite gradients.
In this example, A1 and A3 intersect at the minimax solution. This implies that A2 is not used
in the optimal solution and so x2 = 0.
We need to find the values of x1 and x3 that maximise
min
(
2x1,−x1 + 2x3
)
,
with x3 = 1− x1. In other words, we want x1 that maximises
min(2x1, 2− 3x1).
47
This minimum occurs where the two lines intersect (draw the graph of 2x1 and 2 − 3x1 against
x1 to check this). Hence we have
2x∗1 = 2− 3x∗1
⇒ x∗1 =
2
5
The optimal solution for this game is
x∗ =
(
2
5
, 0,
3
5
)
y∗ =
(
3
5
,
2
5
)
v = 2 · 2
5
· 3
5
− 1 · 2
5
· 2
5
+ 2 · 3
5
· 2
5
=
4
5
.
To solve a 2 ×m game an almost identical approach is used. The maximin strategy for A is
found graphically by taking the maximum point of the lowest of the n lines. Then the minimax
strategy for B can be found. Always start with the player who has only two possible moves.
4.8. Linear programming formulation of a game
The following method can always be used, even if n > 2 and m > 2.
Suppose Player B plays the mixed strategy y and Player A knows what y is. Then Player A
will choose their move Ai to maximise the payoff, i.e., to obtain
u = max
i=1,...n

m∑
j=1
ai,jyj
 .
Player B’s minimax strategy is that which minimises u. From the definition of u we have
m∑
j=1
a1,jyj ≤ u,
m∑
j=1
a2,jyj ≤ u, . . . ,
m∑
j=1
an,jyj ≤ u (4.1)
and
y1 + · · ·+ ym = 1 (4.2)
Taking (4.1) and (4.2) and dividing through by u (which for the moment we will assume is
positive), we get
a1,1
y1
u
+ · · ·+ a1,m ym
u
≤ 1
...
an,1
y1
u
+ · · ·+ an,m ym
u
≤ 1
y1
u
+ · · ·+ ym
u
=
1
u
48
Player B wants to choose y1, . . . , ym to minimise u.
The above equations can be expressed as a linear programming problem:
maximise Y1 + · · ·+ Ym = Z
subject to a1,1Y1 + · · ·+ a1,mYm ≤ 1
...
an,1Y1 + · · ·+ an,mYm ≤ 1
Y1, . . . , Ym ≥ 0,
where Yj = yj/u and Z = 1/u.
Now, suppose that Player A plays mixed strategy x. Player B will choose their move Bj to
minimise the payoff to A, i.e., to obtain
w = min
j=1,...,m
{
n∑
i=1
ai,jxi
}
.
Player A wants to maximise w. We have
n∑
i=1
ai,1xi ≥ w, . . . ,
n∑
i=1
ai,mxi ≥ w
and
x1 + · · ·+ xn = 1,
which yields (assuming w is positive)
n∑
i=1
ai,1
xi
w
≥ 1, . . . ,
n∑
i=1
ai,m
xi
w
≥ 1
x1
w
+ · · ·+ xn
w
=
1
w
Let Xi = xi/w and V = 1/w. We have
minimise X1 + · · ·+Xn = V
subject to a1,1X1 + · · ·+ an,1Xn ≥ 1
...
a1,mX1 + · · ·+ an,mXn ≥ 1
X1, . . . , Xn ≥ 0.
Recognise this as the dual of the first LPP. Hence Z∗ = V ∗, i.e. min u = max w.
Theorem 4.17. Provided the value of the game is greater than zero (v > 0), the value of the
game is minu = 1/Z∗ = 1/V ∗, and the optimal strategies are x∗i = X
∗
i /V
∗ and y∗j = Y
∗
j /Z
∗.
Solving the first LPP using the simplex algorithm gives Z∗ = 1/u∗ = 1/v and Y ∗j = y
∗
j /u
∗
(j = 1, . . . ,m).
How should we calculate x∗i (i = 1, . . . , n)?
49
To complete the method we must consider what to do when v ≤ 0. Since the value of the game
cannot be less than the smallest value in the payoff matrix, we can simply add a constant to all
entries in the payoff matrix, solve the LPP, and then subtract the same constant at the end.
If a′ is the minimum entry in the payoff matrix, and a′ ≤ 0, add −a′ + 1 to all payoff matrix
entries, solve the LPP and then find y∗j = Y
∗
j /Z
∗ and v = 1/Z∗ + a′ − 1.
Example 4.18. Suppose we have the following payoff matrix.
B1 B2 B3
A1 3 -2 -3
A2 -1 0 -2
A3 -2 -1 3
The smallest entry is −3, so we add 4 to every entry and get,
B1 B2 B3
A1 7 2 1
A2 3 4 2
A3 2 3 7
which must have a value greater than zero. The LPP is
Maximise Z = Y1 + Y2 + Y3
subject to 7Y1 + 2Y2 + Y3 ≤ 1
3Y1 + 4Y2 + 2Y3 ≤ 1
2Y1 + 3Y2 + 7Y3 ≤ 1
Y1, Y2, Y3 ≥ 0
This gives the initial tableau:
Y1 Y2 Y3 Y4 Y5 Y6 Sol.
Y4 7 2 1 1 0 0 1
Y5 3 4 2 0 1 0 1
Y6 2 3 7 0 0 1 1
Z −1 −1 −1 0 0 0 0
The final tableau is
Y1 Y2 Y3 Y4 Y5 Y6 Sol
Y1 1 0 0
2
11
1
11 0
1
11
Y2 0 1 0 − 17121 47121 111 19121
Y3 0 0 0
1
121 − 17121 211 6121
Z 0 0 0 6121
19
121
1
11
36
121
The optimal solution is Z∗ = 36121 , Y
∗
1 =
1
11 , Y
∗
2 =
19
121 , Y
∗
3 =
6
121 . This gives the optimal
strategy for Player B: y∗1 = Y ∗1 /Z∗ =
11
36 , y
∗
2 =
19
36 , y
∗
3 =
1
6 .
50
The optimal solution for Player A can be obtained using duality. The shadow prices for the
primal problem are the solution to the dual problem. Thus, X∗1 =
6
121 , X
∗
2 =
19
121 , X
∗
3 =
1
11 , and
Player A’s optimal strategy is x∗1 = X∗1/Z∗ =
1
6 , x
∗
2 =
19
36 , x
∗
3 =
11
36 .
Finally, v = 1/Z∗ − 4 = −2336 . The value, v, has turned out to be negative, so we did need to
add a constant to the payoff matrix entries.
4.9. Games against nature
In this setting, we only have one ‘rational’ player, but the payoff also depends on some as yet
unknown state of nature. Here ‘nature’ might mean nature (e.g., weather, biology) or some
other phenomenon over which the player has no influence (e.g., stock-markets, traffic conditions,
political policy).
We adopt a similar framework to the two-player game, assuming that nature is player B.
However we no longer assume that player B adopts a strategy to minimise player A’s reward.
Nature is not out to get us. This fact means we must change our optimisation approach.
Example 4.19 (Electronics company). An electronics company is considering launching a new
product. They have 3 options:
A1: Launch now with a big advertising campaign
A2: Launch now with a minimal advertising campaign
A3: Make a small pilot study to see if the market is receptive to the product.
Nature is the market reaction, which can be good, moderate or poor.
Strategy Good Moderate Poor
A1 (risky) High sales Moderate sales Poor sales
High costs High costs High costs
A2 (neutral) Would do better with Moderate sales Poor sales
more advertising time Low costs Low costs
A3 (cautious) Competitors steal Lower sales Very low costs,
the market Lower costs can switch to
another product
The payoff matrix is given by
Good Moderate Poor
A1 (risky) 110 45 -30
A2 (neutral) 90 55 -10
A3 (cautious) 80 40 10
In the previous sections A’s strategy was determined by the maximin criterion. The maximin
strategy is pure strategy A3, as there is a saddle point at the cautious/poor entry (10). However,
unless we assume that the market is trying to minimise this company’s costs, this is a worst-case
analysis.
To simplify things, we will only consider pure strategies for games against nature.
There is no single optimal strategy. It depends on the player’s attitude to risk and/or how
likely they think the various possible states of nature are.
51
4.9.1. Maximax strategy
This is the strategy that can give the maximum payoff (but only if the state of nature happens
to be ‘right’). Thus the maximax strategy is the one that corresponds to the row with the largest
entry in the payoff matrix. In Example 4.19 this is A1.
The maximax strategy is risk taking.
4.9.2. Laplacian strategy
The Laplacian strategy is the one that maximises the expected payoff, assuming that each state
of nature is equally likely.
The expected payoffs for strategies A1, A2 and A3 are
For A1: 110/3 + 45/3− 30/3 = 4123
For A2: 90/3 + 55/3− 10/3 = 45
For A3: 80/3 + 40/3 + 10/3 = 4313
Thus, the Laplacian strategy is A2.
4.9.3. Hurwicz strategy (not examinable in 2022–23)
Fix α ∈ [0, 1]. For a move Ai the Hurwicz number Hi is
Hi = αaiu + (1− α)ail,
where aiu is the maximum entry in row i and ail is the minimum entry in row i of the payoff
matrix.
The Hurwicz strategy is the strategy that gives the greatest Hurwicz number. It depends on
α.
When α = 0 we get the maximin solution and when α = 1 we get the maximax solution. So,
α is a weighting of the best and worse states of nature.
In our example we have:
H1 = 110α− 30(1− α) = 140α− 30
H2 = 90α− 10(1− α) = 100α− 10
H3 = 80α+ 10(1− α) = 70α+ 10
110
90
10
0
−10
−30
αH( )
80
1
H
H
H1
2
3
52
So, when α = 0.25, we have H1 = 5, H2 = 15 and H3 = 27.5, giving A3 as the optimal strategy.
For what values of α is A3 the Hurwicz strategy?
A3 is Hurwicz optimal if α < α0, where α0 is the value at which the lines of H1 and H3
intersect. In other words: 40α0 − 30 = 70α0 + 10, giving α0 = 47 .
In general, the Hurwicz solution is a compromise of the two extreme states of nature. However,
the method requires us to choose α and it does not take into account the other states of nature
(which may be more likely).
4.9.4. Minimax regret strategy
Definition 4.20. The regret for a strategy and state of nature combination is the cost of taking
that strategy compared to the best strategy for that state of nature.
For example, if the market turns out to be good, with A1 we have a payoff of 110 which is the
best payoff for that outcome, and we have zero regret. For A2 we only get 90, when we could
have obtained a payoff of 110, which is a regret of 20.
The regret matrix is
Good Moderate Poor
A1 (risky) 0 10 40
A2 (neutral) 20 0 20
A3 (cautious) 30 15 0
The column minimum is always zero.
The minimax regret strategy is that which minimises the maximum regret, which in this case
is A2.
We can also use the minimin, Hurwicz or Laplace criterion on the regret matrix.
4.9.5. Expected Payoff and Variance
Suppose the probabilities of the states of nature are known/estimated, p = (p1, . . . , pm), with∑
pj = 1, 0 ≤ pj ≤ 1.
The expected payoff for pure strategy Ai is
Ep (Ai) =
m∑
j=1
pjaij i = 1, . . . , n.
We can choose our preferred strategy to be the pure strategy Ai that maximises Ep. This is the
expected value criterion. Note that if each pj = 1/m, we have the Laplacian strategy.
Similarly, the variance of the payoff is
Vp (Ai) =
m∑
j=1
pja
2
ij − (Ep (Ai))2
4.9.6. Expected value–standard deviation criterion
In many cases, the strategy that maximises the expected payoff is very risky (e.g., investment in
a high-return high-risk stock-market portfolio) and a less profitable but less risky option might
be more appropriate (e.g., for a pension fund).
53
The expected value–standard deviation (EVSD) criterion is defined as
EV SDp (Ai,K) = Ep (Ai)−K
√
Vp (Ai),
for some K. If K is large, strategies with a highly variable payoff are penalised. The optimal
strategy under the EVSD criterion is the strategy Ai that maximises EV SDp (Ai,K).
In Wilkes (1989) and many other OR books the expected value–variance criterion is considered
instead. The EVSD is better from a statistical viewpoint because the standard deviation is
measured on the same scale as the expected value, and so the EVSD has the same units as the
expected value. In the context of portfolio management, this criterion is connected to so-called
‘modern portfolio theory’ developed by Markowitz.
54
5. Dynamic programming
5.1. Introduction
Example 5.1. Look at the following network. We have to get from vertex A to vertex Z in the
shortest possible time, moving only in the direction of the arrows (edges). The time needed to
move from one vertex to another is indicated above the edge. For example, one possible route is
ACGJLZ, which takes 4 + 6 + 3 + 3 + 5 = 21 minutes.
A
B
C
D
E
F
G
H
I
J
K
L
M
Z
3
8
9
2
6
4
8
1
0
1
6
2
10
6
7
3
7
1
3
6
1
5
2
4
1
2
2
We could inspect all possible routes. However, there are many possible routes, and this number
would grow rapidly as the network grew. Dynamic programming is a more efficient method to
solve this problem.
Dynamic programming is used when the problem can be separated into stages. An optimisation
is performed at each stage in turn, but the optimal decision found at each stage depends on the
optimal decision found at the next stage, and so on. It is only when the optimisation has been
performed at the ‘final’ stage that it becomes clear what the optimal decision is at each of the
earlier stages.
One application of dynamic programming is to find the route through a network that gives the
minimum total ‘cost’ or maximum total ‘reward’. Example 5.1 is an example of such a problem,
in which the ‘costs’ are the times.
Notation
• Stage n ∈ {0, 1, . . . , N}.
• Sn is a state at stage n.
• Qn is the state space at stage n, i.e. is the set of all possible states at stage n. (So, Sn ∈ Qn.)
• ci,j is the transition cost for moving from state i to state j.
55
Example 5.2 (Example 5.1 continued). N = 5, Q0 = {A}, Q1 = {B,C,D,E}, Q2 = {F,G,H},
. . ., Q5 = {Z}, cA,B = 2, etc.
• For each state in stage 4, find the quickest route to the sink Z. How long does each route
take?
• For each state in stage 3, find the quickest route to Z, using the information from the
previous step. How long does each route take?
• ...
• Finally, for the source node A in stage 0, find the quickest route to the sink Z, using the
information about stage 1. How long does this route take?
We have separated the problem into stages and ‘solved’ each stage in turn. This is dynamic pro-
gramming. The method we have just used is a dynamic programming algorithm called ‘backward
recursion’.
5.2. Backward recursion
Let gn(Sn) represent the minimum cost of moving from state Sn in stage n to (any state in) stage
N .
Clearly, gN (SN ) = 0 for all SN ∈ QN .
The backward recursion equation is
gn(Sn) = min
Sn+1∈Qn+1
{
gn+1(Sn+1) + cSn,Sn+1
}
, n = 0, 1, . . . , N − 1.
If there is no transition possible from a given state Sn to another state Sn+1, we can set
cSn,Sn+1 =∞ so that this transition will never be taken.
Example 5.3 (Example 5.1 continued). Backward recursion is precisely the algorithm we sug-
gested using in Example 5.1 above.
In stage N = 5, as already mentioned, g5(Z) = 0.
In stage 4, we have
g4(L) = min
S5∈{Z}
{g5(S5) + cL,S5} = min{0 + 5} = 5.
The optimal route from L to the end is L→ Z, with cost 5. Similarly, g4(M) = 1 via M → Z.
In stage 3, things get interesting. We have
g3(I) = min
S4∈{L,M}
{g4(S4) + cI,S4} = min{5 + 6, 1 + 7} = 8,
so the optimal route from I to the next stage is I →M , with cost 8. Note how we have used the
information gained in the previous stage, i.e. g4(S4) for each possible S4. Similarly, g3(J) = 2
via J →M and g3(K) = 8 via K →M .
In stage 2, we have
g2(F ) = min
S3∈{I,J,K}
{g3(S3) + cF,S3} = min{11 + 2, 2 + 2, 8 + 1} = 4,
so the optimal route from F to the next stage is F → J , with cost 4. Similarly, g2(G) = 5 via
G→ J and g2(H) = 8 via H → J .
In stage 1, we have
g1(B) = min
S2∈{F,G,H}
{g2(S2) + cB,S2} = min{4 + 8, 5 + 9, 8 +∞} = 12,
56
so the optimal route from B to the next stage is B → F , with cost 12. Notice that we used here
our trick of setting cBH = ∞ to exclude the impossible transition B → H. Similarly, g1(C) = 6
via C → F ; g1(D) = 9 via D → G; and g1(E) = 6 via E → G.
Finally, in stage 0, we have
g0(A) = min
S1∈{B,C,D,E}
{g1(S1) + cA,S1} = min{12 + 2, 6 + 4, 9 + 1, 6 + 3} = 9,
so the optimal route from A to the next stage is A→ E, with cost 9.
This solves the problem. We know that the minimum cost of going from A to Z is 9. And we can
say more, because we noted the optimal route from each state to the next stage. Going through
our solution, we see that the route which minimises the cost is A→ E → G→ J →M → Z.
Maximising reward Sometimes, we may need to maximise a total reward instead of minimising
a cost. Mathematically, there is no difference between these two problems, since a reward r is
the same as a cost −r. However, it can be convenient to keep the mathematics consistent with
the problem statement. In that case, we would define transition rewards ri,j , and let gn(Sn)
be the maximum reward that can be gained by moving from state Sn to stage N . The backward
recursion equation becomes
gn(Sn) = max
Sn+1∈Qn+1
{gn+1(Sn+1) + rSn,Sn+1}, n = 0, 1, . . . , N − 1.
The moral Using the recursion equation, and exploiting the structure of the problem in stages,
we can find the optimal path through the network without computing cost/reward of every single
path.
Example 5.4 (Pipeline routeing problem). A company wishes to lay a pipeline from their pro-
cessing plant to a new local distribution point. The company has identified a network of possible
routes with the associated costs of installing each section.
A
B
C
D
E
F
G
H
8
9
6
I
11
8
12
11
9
8
12
11
6
14
9
10
57
We can write in the values gn(Sn) next to each state Sn. For example,
g3(I) = 0
g1(B) = min
S2∈{F,G}
{g2(S2) + cB,S2} = min{11 + 8, 9 + 9} = 18.
Now, we “roll forward” the solution to find the optimal route. This is A→ E → H → I, with
a total cost of 22.
5.3. Resource allocation problems
This is a common type of problem which once formulated as a dynamic programming problem
can be solved in a straightforward manner.
Example 5.5. A company wishes to expand its business by enlarging three of its manufacturing
plants. For each type of expansion there is an expense and a future revenue (reward) according
to the following table (all entries in £M).
Proposed Plant 1 Plant 2 Plant 3
expansion expense revenue expense revenue expense revenue
A 0 0 0 0 0 0
B 1 5 2 8 1 3
C 2 6 3 9 — —
D — — 4 12 — —
The company has a budget of £5M and it wants to maximise its revenue from the available
expansions.
We can consider the problem sequentially.
• Stage 0: no expansion plans have been decided.
• Stage 1: the expansion plan for plant 1 has been decided.
• Stage 2: the expansion plans for plants 1 and 2 have been decided.
• Stage 3: the expansion plans for plants 1, 2 and 3 have been decided.
Let Sn, the state at stage n (n = 0, 1, 2, 3), be the cumulative expense by stage n.
Let rn(i, j) be the transition reward of moving from state i at stage n to state j at stage n+1.
This will be the revenue of the expansion project for plant n+1 that has associated expense j− i.
We can now draw a network showing the possible states at each stage, the possible transitions
and the transition rewards.
This problem can be solved using backward recursion.
Q0 = {0}
Q1 = {0, 1, 2}
Q2 = {0, 1, 2, 3, 4, 5}
Q3 = {0, 1, 2, 3, 4, 5}
58
00 1
2
0
1
2
3
4
5
6
0
1
2
3
4
5
6
7
0
5
6
0
8
9
12
0
8
9
12
0
8
9
12
0
3
0
3
0
3
0
3
0
3
0
3
0
3
Backward recursion Let gn(Sn) be the maximum revenue available over future stages starting
in state Sn at stage n (i.e., having spent Sn million expanding plants 1, . . . , n).
Clearly, g3(S3) = 0 for all S3 ∈ Q3. Now,
gn(Sn) = max
Sn+1∈Qn+1
{gn+1(Sn+1) + rn(Sn, Sn+1)} (n = 0, 1, 2)
Stage 2 g2(S2) = maxS3∈{0,1,2,3,4,5} {g3(S3) + r2(S2, S3)}.
S2 S3 = 0 S3 = 1 S3 = 2 S3 = 3 S3 = 4 S3 = 5 g2(S2) S
∗
3
0 0+0 0+3 — — — — 3 1
1 — 0+0 0+3 — — — 3 2
2 — — 0+0 0+3 — — 3 3
3 — — — 0+0 0+3 — 3 4
4 — — — — 0+0 0+3 3 5
5 — — — — — 0+0 0 5
Stage 1 g1(S1) = maxS2∈{0,1,2,3,4,5} {g2(S2) + r1(S1, S2)}.
S1 S2 = 0 S2 = 1 S2 = 2 S2 = 3 S2 = 4 S2 = 5 g1(S1) S
∗
2
0 3+0 — 3+8 3+9 3+12 — 15 4
1 — 3+0 — 3+8 3+9 0+12 12 4 or 5
2 — — 3+0 — 3+8 0+9 11 4
Stage 0 g0(S0) = maxS1∈{0,1,2,3,4,5} {g1(S1) + r0(S0, S1)}.
S0 S1 = 0 S1 = 1 S1 = 2 g0(S0) S
∗
1
0 15+0 12+5 11+6 17 1 or 2
59
Again, the maximum revenue is £17M. We can now roll forward the solution, which shows us
that all of the following expansion plans are optimal.
Plan S0 S1 S2 S3
B → C → B 0 1 4 5
B → D → A 0 1 5 5
C → B → B 0 2 4 5
5.4. The warehousing problem
A company manufactures and supplies a product. Each month they manufacture a certain amount
at a production plant, which is then stored at their warehouse. They also distribute some or all
of the warehoused stock to the retailers.
Assume that the distributed stock leaves the warehouse early afternoon on the first of each
month and the manufactured stock is brought to the warehouse in the evening on the day it is
manufactured. On the morning of the first of each month, a manager has to choose how much
stock is to be distributed that day, and how much is to be manufactured that month.
Notation
t month (t = 1, . . . , T )
Qt number of items manufactured in month t
Vt number of items distributed to the retailers during the first day of month t
ct manufacturing cost (per item) in month t
pt sell price (per item) in month t
It number of items stored in the warehouse at beginning of first day of month t
K the maximum capacity (no. of items that can be stored) and
pit profit available in months t, t+ 1, . . . , T .
Each of Qt, Vt and K are ≥ 0, and for any month t the number of items that can be stored
cannot exceed the maximum capacity, i.e. 0 ≤ It ≤ K. The costs, ct, and prices, pt, are assumed
to be known in advance but may be different from month to month.
The number of items in the warehouse at the start of month t+1 is the number of items there
at the start of month t (It), plus the number of items manufactured in month t (Qt) minus the
number of items distributed in month t (Vt), giving
It+1 = It − Vt +Qt
We want to choose Vt and Qt (t = 1, . . . , T ) so that the profit pi1 is maximised.
The above is a general problem. In a specific example we have:
The maximum capacity is K = 1000.
The initial amount of stock in the warehouse is I1 = 300.
We are considering the problem over a four month period, so T = 4,
and at the end of the 4 month period we want an empty warehouse, so I5 = 0.
The values ct and pt are
t ct pt
1 70 90
2 64 82
3 73 70
4 70 85
We can use backward recursion for this problem, and our intention is to maximise the profit
pit = pit(It) at t = 1.
60
Month 4 Let t = 4 and assume that today is the first day of month 4. Imagine that the number
of items in the warehouse at the start of today, I4, is fixed. (We will write our maximal profit
and decisions in terms of I4.)
The profit for month 4 is pi4 = p4V4− c4Q4. Since we are required to have I5 = 0, and we know
that I5 = I4 − V4 +Q4, the only way we can satisfy these equations (and still have Q4 ≥ 0 and
V4 ≤ I4) is if Q4 = 0 and I4 = V4. So,
pi∗4 = p4I4 + 0 = 85I4,
with optimal actions V ∗4 = I4 and Q∗4 = 0.
Month 3 Now set t = 3 and assume that I3 is fixed. We will write everything this month in
terms of I3. The profit (over months 3 and 4) is
pi3 = p3V3 − c3Q3 + pi∗4
= 70V3 − 73Q3 + 85I4
= 70V3 − 73Q3 + 85(I3 − V3 +Q3)
= −15V3 + 12Q3 + 85I3.
We know that I4 = I3− V3 +Q3 ≤ K, since we cannot exceed the warehouse capacity; and we
also know that V3 ≤ I3, since we cannot sell more stock than we have available.
In other words, we are seeking to solve a linear programming problem in the two variables
V3, Q3 (plus a constant added to the objective function):
Maximise pi3 = −15V3 + 12Q3 + 85I3
subject to V3 ≤ I3
Q3 − V3 ≤ K − I3
Q3, V3 ≥ 0.
The feasible region is:
Q3
V3
I3
K − I3
B
K
A
61
We can solve this graphically; a contour of the objective function is
Q3 =
15
12
V3 +
1
12
pi3 − 85
12
I3,
so we should maximise the intercept, and this leads to the optimal solution at point A, i.e.,
V ∗3 = 0
Q∗3 = K − I3
pi∗3 = 0 + 12(K − I3) + 85I3 = 73I3 + 12K.
Month 2 We now set t = 2 and observe that
pi2 = p2V2 − c2Q2 + pi∗3
= 82V2 − 64Q2 + 73I3 + 12K
= 82V2 − 64Q2 + 73(I2 − V2 +Q2) + 12K
= 9V2 + 9Q2 + 73I2 + 12K.
This leads us to solve the linear programming problem
Maximise pi2 = 9V2 + 8Q2 + 73I2 + 12K
subject to V2 ≤ I2
Q2 − V2 ≤ K − I2
Q2, V2 ≥ 0.
The feasible region is the same shape as in the previous month, but now we are maximising
the intercept of a contour with gradient −1, and this leads us to the optimal solution at the point
corresponding to B. That is,
V ∗2 = I2,
Q∗2 = K,
pi∗2 = 9I2 + 9K + 73I2 + 12K = 82I2 + 21K.
Month 1 We now set t = 1 in order to find pi∗1, which was our goal. We observe
pi1 = p1V1 − c1Q1 + pi∗2
= 90V1 − 70Q1 + 82(I1 − V1 +Q1) + 21K
= 8V1 + 12Q1 + 82I1 + 21K
Solving this again using linear programming, we arrive at
V ∗1 = I1,
Q∗1 = K,
pi∗1 = 8I1 + 12K + 82I1 + 21K = 90I1 + 33K.
We summarise our results, and find numerical values, in the following table. We have used here
the information about the initial number of items in the warehouse, I1 = 300, and the numerical
62
value of the capacity, K = 1000.
t It V
∗
t Q
∗
t pi
∗
t
1 300 I1 = 300 K = 1000 90I1 + 33K = 60 000
2 1000 I2 = 1000 K = 1000 82I2 + 21K = 103 000
3 1000 0 K − I3 = 0 73I3 + 12K = 85 000
4 1000 I4 = 1000 0 85I4 = 85 000
5 0
The maximal profit attainable over the entire period is pi∗1 = 60 000, and this is obtained via
the choices of Vt, Qt given in the table.
63
6. Markov dynamic programming
6.1. Introduction
In Chapter 5, the rewards or costs of transitions were assumed to be known and fixed, and we
were able to choose which transitions to make. In this chapter we deal with problems where
these assumptions do not hold. Such problems are solved by stochastic programming or
stochastic optimisation, and we will concentrate on a particular aspect: Markov dynamic
programming (MDP).
6.2. Markov Chains
A (discrete-time) stochastic process is a sequence of random variables indexed by time,
X = {Xt : t = 0, 1, 2, . . . }.
Definition 6.1. A stochastic process X is called a Markov chain if
P (Xn+1 = xn+1 | X0 = x0, . . . , Xn = xn) = P (Xn+1 = xn+1 | Xn = xn), (6.1)
for n = 0, 1, 2, . . . . Property (6.1) is called the Markov property. If we regard time n as the
present, then the Markov property says that ‘given the present, the future is independent of the
past’.
Example 6.2. Mr Bond is playing roulette. At each turn he places a £1 chip on number 17.
He begins with X0 = x0 chips. Let Xn denote the number of £1 chips he has after n turns.
The number of chips he has after n + 1 turns, Xn+1, depends on the number of chips he has
after n turns, Xn (plus the random factor from the roulette wheel). But given the value of Xn,
Xn+1 is independent of X0, . . . , Xn−1. Therefore, X = {X0, X1, . . .} is a Markov chain.
In most of this chapter we consider finite horizon problems, meaning that we only consider
times n = 0, 1, . . . , N , with N fixed. In Section 6.6 we allow an infinite number of stages.
In a finite horizon Markov dynamic programming problem we have the following:
1. a Markov chain X = {X0, X1, . . . , XN} with known transition probabilities,
2. costs or rewards associated with each transition in the Markov chain,
3. terminal costs or rewards associated with the states at the final stage, and
4. (usually) actions that alter the transition probabilities and costs/rewards.
Notation We can use backward recursion for MDP problems. It is common in the MDP world
to use n to denote the number of stages to go, rather than the current stage (as we usually
did in Chapter 5.) So, n = 0 at the final (terminal) stage and n = N at the initial stage.
• Let X(n) denote the state with n stages to go (i.e., X(n) = XN−n.)
64
• Let Qn denote the state space with n stages to go. (Note the difference from Chapter 5.)
• Let p(n)i,j denote the transition probability for moving from state i with n stages to go to
state j (with n− 1 stages to go).
p
(n)
i,j = P (X
(n−1) = j | X(n) = i) = P (XN−n+1 = j | XN−n = i).
The transition matrix P (n) is the matrix whose (i, j)th entry is p(n)i,j .
Clearly,
p
(n)
i,j ≥ 0, for all i, j; and
∑
j∈Qn−1
p
(n)
i,j = 1, for all i.
So, each element of P (n) must be between zero and one, and the rows must sum to one.
• Let r(n)i,j or c
(n)
i,j denote the transition reward/cost for the transition from state i with n
stages to go to state j (with n− 1 stages to go).
• Let r(0)i or c
(0)
i denote the terminal reward/cost for state i, i.e., the reward/cost incurred
when we end in state i.
Sometimes the transition probabilities will be the same at every stage, in which case the
superscript ·(n) in p(n)i,j and P (n) can be dropped.
6.3. Rewards in a Markov setting (with no actions)
Example 6.3. Consider the following network, in which we start at state 1, and move to the
right with the probabilities given on the edges, collecting an associated reward for each edge and
a terminal reward at the end.
8
1 3
2
4
5
6
7 10
9
8
terminal
rewards
2
4
7
10
9
8
4
1
1
6
3
9
6
1
1
2
Stage 0
n=3
Stage 1
n=2
Stage 2
n=1
Stage 3
n=0
0.7
0.20.2
0.6
0.5
0.40.2 5 0.4 2
0.2
0.8
0.2
0.2
0.6
0.5 0.3
0.4
0.6 8
The state space for X(0) is Q0 = {8, 9, 10}, for X(1) is Q1 = {5, 6, 7}, for X(2) is Q2 = {2, 3, 4}
and for X(3) is Q3 = {1}.
The transition probabilities are P (X(0) = 8 | X(1) = 6) = p(1)6,8 = 0.4, etc. The transition
rewards are r(3)1,2 = 3, r
(2)
2,5 = 8, etc. Finally, the terminal rewards are r
(0)
8 = 1, r
(0)
9 = 1 and
r
(0)
10 = 2.
65
Definition 6.4. Let Rn(i) be the one-step expected reward when in state i with n stages to
go (X(n) = i), for the next step to X(n−1). The formula for Rn(i) is
Rn(i) =
{∑
j∈Qn−1 p
(n)
i,j r
(n)
i,j n ≥ 1
r
(0)
i n = 0
Let Vn(i) be the total future expected reward when in state i with n stages to go (X(n) = i).
The function Vn(i) is called the value function.
Vn(i) =
{∑
j∈Qn−1 p
(n)
i,j
(
r
(n)
ij + Vn−1(j)
)
n ≥ 1
r
(0)
i n = 0
=
{
Rn(i) +
∑
j∈Qn−1 p
(n)
i,j Vn−1(j) n ≥ 1
R0(i) = r
(0)
i n = 0.
This is a recursive formula for Vn(i) in terms of Vn−1, so we can use a similar methods to those
used in the last chapter in order to find the expected reward over the whole problem.
For the present example, we want to find V3(1) by recursion.
So, we start at stage 3, when n = 0. We have
V0(8) = r
(0)
8 = 1
V0(9) = r
(0)
9 = 1
V0(10) = r
(0)
10 = 2.
At stage 2 (n = 1) we can compute the one-step expected rewards for each state:
R1(5) =
10∑
j=8
p
(1)
5,jr
(1)
5,j = 0.6 · 8 + 0.4 · 8 = 8
R1(6) =
10∑
j=8
p
(1)
6,jr
(1)
6,j = 0.4 · 4 + 0.4 · 2 + 0.2 · 1 = 2.6
R1(7) =
10∑
j=8
p
(1)
7,jr
(1)
7,j = 0.7 · 1 + 0.3 · 6 = 2.5
This in turn allows us to compute the value function for each state in stage 2 (n = 1):
V1(5) = R1(5) +
10∑
j=8
p
(1)
5,jV0(j) = 8 + 0.6 · 1 + 0.4 · 1 = 9
V1(6) = R1(6) +
10∑
j=8
p
(1)
6,jV0(j) = 2.6 + 0.4 · 1 + 0.4 · 1 + 0.2 · 2 = 3.8
V1(7) = R1(7) +
10∑
j=8
p
(1)
7,jV0(j) = 2.5 + 1.3 = 3.8
66
Repeat this for stage 1 (n = 2):
R2(2) = 0.8 · 8 + 0.2 · 2 = 6.8
R2(3) = 4.8
R2(4) = 9.5
hence
V2(2) = R2(2) +
∑
j
p
(2)
2,jV1(j) = 6.8 + 0.8 · 9 + 0.2 · 3.8 = 14.76
V2(3) = 11.72
V2(4) = 13.3.
And finally, for stage 0 (n = 3):
R3(1) = 0.2 · 3 + 0.2 · 9 + 0.6 · 6 = 6
V3(1) = 6 + 0.2 · 14.76 + 0.2 · 11.72 + 0.6 · 13.3 = 19.276
The expected reward over the three stages starting in state 1 is 19.276.
6.4. Markov dynamic programming with actions
Suppose the process is in state i with n stages to go (X(n) = i), and that we now have K actions
available
{
a
(n)
1 , a
(n)
2 , . . . , a
(n)
K
}
= A(n). We call A(n) the action space.
We can extend the dynamic programming model to depend on the actions taken:
1. The transition probabilities depend on the action taken at each stage, so
p
(n)
i,j (a
(n)) = P (X(n−1) = j | X(n) = i, a(n)),
where a(n) ∈ A(n) is the particular action chosen with n stages to go. We also insist that
the probability of a transition to X(n−1) = j depends n X(n) = i and a(n), but does not
depend on the previous states, nor the previous actions.
2. The transition rewards/costs depend on the action taken at each stage. So, we have
r
(n)
i,j (a
(n)) or c(n)i,j (a
(n)).
3. The terminal rewards/costs do not depend on the actions.
Now that we have actions, there is a choice available. We want to find the sequence of actions
that will maximise our expected total reward, or minimise our expected total cost. First, we
extend the definitions of the one-step expected reward and the value function to take account of
the actions chosen.
Definition 6.5. Rn(i, a(n)) is the one-step expected reward when the state is i with n stages to
go (Xn = i) and action a(n) is taken.
V ∗n (i) is the future expected reward when in state i with n stages to go and an optimal action
plan is used for these remaining n stages. V ∗n (i) is called the optimal value function.
67
As before, we have an expression for the one-step expected reward:
Rn(i, a
(n)) =
{∑
j∈Qn−1 p
(n)
i,j (a
(n)) r
(n)
i,j (a
(n)), n ≥ 1
r
(0)
i , n = 0
The optimal value function is given by the following recursive equation.
V ∗n (i) =
{
maxa(n)∈A(n)
{∑
j∈Qn−1 p
(n)
i,j (a
(n))
(
r
(n)
i,j (a
(n)) + V ∗n−1(j)
)}
, n ≥ 1
r
(0)
i , n = 0
=
{
maxa(n)∈A(n)
{
Rn(i, a
(n)) +
∑
j∈Qn−1 p
(n)
i,j (a
(n)) V ∗n−1(j)
}
, n ≥ 1
r
(0)
i , n = 0
Example 6.6 (Advertising for a concert). A ticket agency is responsible for advertising and
selling tickets for a concert which will take place in two weeks. Each week the agency has the
choice either to put a half-page advert in the major newspapers and magazines, or just to submit
the concert details to the ‘what’s-on’ listings. To simplify the problem, the agency categorises
the ticket sales into three states, fast ticket sales, average ticket sales and slow ticket sales. At
present, the ticket sales are average.
Let n denote the number of weeks remaining before the concert takes place. Label the states as:
3 for fast sales, 2 for average sales and 1 for slow sales. Label the actions as A for Advertise, and
D for Don’t advertise. As the actions are the same at each stage we do not have to superscript
them with n.
We can represent this as a network:
n=2 n=1 n=0
2
1
2
3
1
2
3
Stage 0 Stage 1 Stage 2
All numbers are denominated in £1000s.
There are terminal rewards r(0)1 = 2, r
(0)
2 = 5 and r
(0)
3 = 10.
68
The transition rewards are:
[r
(1)
i,j (A)]i,j=1,2,3 =
 −1 0 10 1 2
1 2 3
 [r(1)i,j (D)]i,j=1,2,3 =
 1 2 32 3 4
3 4 5

P (1)(A) =
 .5 .3 .2.4 .2 .4
.3 .3 .4
 P (1)(D) =
 1.0 0.0 0.0.4 .3 .3
.4 .4 .2

[r
(2)
2,j (A)]j=1,2,3 =
[ −1 0 1 ] [r(2)2,j (D)]j=1,2,3 = [ 1 2 3 ]
P (2)(A) =
[
.4 .3 .3
]
P (2)(D) =
[
.6 .3 .1
]
The way to solve this problem is to set up a table for every state and possible action in each
stage. For each action a and state i compute Rn(i, a) and
∑
j∈Qn−1 p
(n)
i,j (a) V
∗
n−1(j). This gives
us Rn(i, a) +
∑
j∈Qn−1 p
(n)
i,j (a) V
∗
n−1(j) and we choose the action that maximises this.
n state i action a Rn(i, a)
∑
j∈Qn−1 p
(n)
i,j (a)V
∗
n−1(j) sum V ∗n (i) a∗
0 3 — 10 — 10 10 —
0 2 — 5 — 5 5 —
0 1 — 2 — 2 2 —
1 3 A 2.1 6.1 8.2
1 3 D 3.8 4.8 8.6 8.6 D
1 2 A 1.0 5.8 6.8
1 2 D 2.9 5.3 8.2 8.2 D
1 1 A -0.3 4.5 4.2 4.2 A
1 1 D 1 2 3
2 2 A -0.1 6.72 6.62
2 2 D 1.5 5.84 7.34 7.34 D
The last two columns tell us the optimal expected future reward and the action that should be
taken in order to attain the optimum. The optimal actions are: don’t advertise with two weeks
to go; and with one week to go advertise only if the ticket sales are slow. The optimal expected
future reward is £7 340.
Notes. To solve the problem, we need to find the optimal action(s) for every time point and
every possible state.
In deterministic dynamic programming problems the choice is the path through the network.
In Markov dynamic programming, we can not choose the path that is taken. Instead, we choose
actions which influence the probabilities of which path is taken.
6.5. Discounting
Suppose a profit of £1 gained now is worth more than a profit £1 gained at the next stage.
Reasons for this could be:
69
• inflation – by the next stage £1 will have less spending power
• investment – £1 can be invested and interest earned by the next stage.
• risk of bankruptcy – if this year’s reward will help us stay in business then we should weight
it higher than the reward in ten years’ time, because without cash this year we might not
be in business in ten years.
In order to account for this, we use a discount factor α, which compounds year on year. We
shall assume that α is constant and 0 < α < 1. So an amount of £1 at the next stage is worth
£α in “today’s money”; we say that £α is the present value of £1 at stage 1.
The recursive equation for the optimal value function is now given by
V ∗n (i) = max
a(n)
Rn(i, a(n)) + α ∑
j∈Qn−1
p
(n)
i,j (a
(n)) V ∗n−1(j)

6.6. Infinite horizon problems
When the problem has no (obvious) final stage, we still want to find the optimal value function
and the optimal actions.
We shall assume that there is a discount factor α (0 < α < 1) and that the state space, action
space and transition probabilities and rewards are the same for each stage. So, we can drop the
subscript/superscript n.
If we imagine that there are a finite number of stages (and assign terminal values arbitrarily),
then the recursion equation in this case becomes
V ∗n (i) = max
a∈A
R(i, a) + α∑
j∈Q
pij(a)V
∗
n−1(j)
 .
If V ∗n converges, as n→∞, to an optimal value function V ∗, then it must satisfy
V ∗(i) = max
a∈A
R(i, a) + α∑
j∈Q
pi,j(a) V
∗(j)

= R (i, a∗(i)) + α
∑
j∈Q
pi,j (a
∗(i)) V ∗(j).
where a∗(i) denotes the optimal action when in state i.
In general, solving the above equation is difficult.
One approach is to guess a general form of the solution and solve it analytically. Another is
to use iterative methods to obtain the optimal value function numerically.
Example 6.7 (Machine replacement). A machine in a manufacturing plant can be in states
{0, 1, 2, . . . } according to the condition it is in. At the end of each week the machine is inspected,
its condition is noted and we make a decision as to whether or not to replace it.
Let Yt denote the state of the machine at the end of week t. If Yt = i and the action is “don’t
replace”, we incur a running cost of c(i) = i and
Yt+1 =
{
i with prob. 12
i+ 1 with prob. 12
70
If Yt = i and the action is “replace” we incur a fixed cost of K and
Yt+1 =
{
0 with prob. 12
1 with prob. 12
The costs are discounted by a factor α over an infinite horizon.
Thus
• Q = {0, 1, 2, . . . }
• A = {1 = “replace”, 2 = “don’t replace”}
• pi,0(1) = 12
• pi,1(1) = 12
• pi,i(2) = 12
• pi,i+1(2) = 12
• rij(1) = −K, for j = 0, 1
• rij(2) = −i, for j = i, i+ 1
A policy pi(i) is a decision rule (or action plan) that specifies an action for each state i in the
state space Q.
We might guess that the optimal policy is of the form: choose a = 1 (replace) if i ≥ M and
a = 2 (don’t replace) if i < M , for some unknown M .
The above policy is pi(i) = 2 for 0 ≤ i ≤M − 1 and pi(i) = 1 for i ≥M and the value function
corresponding to this policy, Vpi(i), is
Vpi(i) = R(i, pi(i)) + α
∑
j
pi,j(pi(i)) Vpi(j)
=
{
−i+ α [12Vpi(i) + 12Vpi(i+ 1)] if 0 ≤ i ≤M − 1
−K + α [12Vpi(0) + 12Vpi(1)] if i ≥M
Analytical methods can be used to show that this policy is optimal and to find the values of
V ∗(i) and M , but this is not covered in this course. Instead we shall use an iterative method.
For this iterative method, pretend that the number of stages is finite. Let V ∗n (i) be the expected
future (discounted) reward when there are n stages to go and we use the optimal policy. Then
V ∗0 (i) = 0 ∀i ∈ N
V ∗n (i) = max {Un(i, 1), Un(i, 2)} ∀i ∈ N
where
Un(i, 1) = −K + α
[
1
2
V ∗n−1(0) +
1
2
V ∗n−1(1)
]
Un(i, 2) = −i+ α
[
1
2
V ∗n−1(i) +
1
2
V ∗n−1(i+ 1)
]
.
As we have said, if V ∗n (i) converges to V ∗(i) as n→∞, then V ∗(i) satisfies
V ∗(i) = max
a∈A
R(i, a) + α∑
j∈Q
pi,j(a)V
∗(j)
 .
71
We check this convergence; the iterative procedure will also give us the optimal policy.
If we now assume specific values for K and α of 10 and 0.75 respectively, and run the above
iterative method we get:
State (i) Iteration Number (n)
1 2 3 38 39
i a∗ V ∗1 (i) a∗ V ∗2 (i) a∗ V ∗3 (i) a∗ V ∗38(i) a∗ V ∗39(i)
0 2 -0 2 -0.375 2 -0.938 2 -5.106 2 -5.106
1 2 -1 2 -2.125 2 -3.250 2 -8.511 2 -8.511
2 2 -2 2 -3.875 2 -5.562 2 -11.518 2 -11.518
3 2 -3 2 -5.625 2 -7.875 2 -13.864 2 -13.864
4 2 -4 2 -7.375 2 -10.188 1 -15.106 1 -15.106
5 2 -5 2 -9.125 1 -10.938 1 -15.106 1 -15.106
6 2 -6 1 -10.375 1 -10.938 1 -15.106 1 -15.106
7 2 -7 1 -10.375 1 -10.938 1 -15.106 1 -15.106
8 2 -8 1 -10.375 1 -10.938 1 -15.106 1 -15.106
9 2 -9 1 -10.375 1 -10.938 1 -15.106 1 -15.106
10 1 -15.106 1 -15.106
11 1 -15.106 1 -15.106
The function V ∗n (i) converges (but needs 38 iterations to converge to 3 d.p.s). The convergence
for such problems in general is often very slow.
Notice that the solution has the same form as our guess: replace if i ≥ 4 and continue if i ≤ 3.
The optimal value function is given in the rightmost column (with V ∗(i) = −15.106 for all i ≥ 4).
6.7. Optimal stopping problems
An optimal stopping problem, or optimal stopping routine, is a special type of stochastic opti-
misation problem in which, at every stage, there are two possible actions, stop and continue.
If the action is ‘stop’ then a reward is obtained, depending on the current state, and that is the
end of the routine. If the action is ‘continue’ then a cost is incurred and the routine proceeds to
the next stage, according to a Markov chain.
The routines we shall consider have the following form:
• There are a finite number of stages and no discounting.
• The reward for stopping in state i before the final stage is r(i). This is independent of
stage.
• The reward for continuing to the final stage is r(0)(i). The cost of continuing is c(i).
• It depends on the state but not the stage.
• If continuing, the probability that the next state is j given that the current state is i with
n stages to go is p(n)ij .
The optimal value function is
V ∗n (i) = max
r(i), −c(i) + ∑
j∈Qn−1
p
(n)
i,j V
∗
n−1(j)
 (n ≥ 1)
and
V ∗0 (i) = r
(0)(i)
72
Example: TV Game Show
A contestant has N boxes each containing a random amount of money from £0 to £1000. All
amounts are equally likely, and the amount of money in each box is independent of the amount
of money in the others.
The game-show host opens the first box and the contestant has to choose whether to keep
the money, in which case the game is over (“stop”) or to move on to the next box. Once the
contestant has shut a box they can no longer claim the money in that box.
Given that the number of boxes is N = 4, what is the optimal strategy that the contestant
should follow?
• Stages are the boxes.
• n is the number of stages left, i.e., the number of boxes still to be accepted or rejected.
• For n ≥ 1, state space Qn = all possible values in box. The amounts of money are in
pounds, so Qn = {0, 1, . . . , 999, 1000}.
• The rewards are the amounts in the box (i.e. the states) r(i) = i.
• The terminal reward is nothing: r(0)(i) = 0.
• The transition costs are zero: c(i) = 0.
• For n ≥ 2, the transition probabilities are p(n)ij = 1/1001.
• Q0 = {0} and p(1)i0 = 1.
n = 0 Once all of the boxes have been rejected:
V ∗0 (0) = r
(0)(0) = 0.
n = 1 This corresponds to deciding whether to accept or reject the last of the four boxes.
V ∗1 (i) = max {i, V ∗0 (0)}
= max{i, 0}
So the optimal policy is to accept the box no matter how much money is inside.
n = 2 When the contestant decides on the third box:
V ∗2 (i) = max
i,
1000∑
j=0
p
(2)
ij V
∗
1 (j)

1000∑
j=0
p
(2)
ij V
∗
1 (j) =
1
1001
1000∑
j=0
j
=
1
1001
(
1000× 1001
2
)
= 500
giving V ∗2 (i) = max{i, 500}.
So the optimal policy is to take the money if and only the amount in the box is at least £500.
n = 3 When the contestant opens the second box:
V ∗3 (i) = max
i,
1000∑
j=0
p
(3)
ij V
∗
2 (j)

73
and
1000∑
j=0
p
(3)
ij V
∗
2 (j) =
1
1001
 500∑
j=0
500 +
1000∑
j=501
j

= 625.125.
Hence,
V ∗3 (i) = max{i, 625.125}
So the optimal policy is to take the money if and only if the amount in the box is at least
£626.
n = 4 When the contestant sees the contents of the first box:
V ∗4 (i) = max
i,
1000∑
j=0
p
(4)
ij V
∗
3 (j)

1000∑
j=0
p
(4)
ij V
∗
3 (j) =
1
1001
 625∑
j=0
625.125 +
1000∑
j=626
j

= 695.51
V ∗4 (i) = max{i, 695.51}
So the optimal policy is to take the money if and only if the amount in the box is at least
£696.
The complete optimal strategy is for the contestant to accept a box if and only if it contains
at least the amount shown
Box number 1 2 3 4
n = 4 n = 3 n = 2 n = 1
Amount 696 626 500 0
74
A. Reading list
This course covers a range of topics that are not usually covered all in one book. None of these
books are essential, but you may find additional examples or ways of thinking about the topic,
especially in the ‘recommended’ books.
Recommended
[HL21] Frederick Hillier and Gerald J. Lieberman. Introduction to operations research.
Eleventh edition. New York: McGraw-Hill Education, 2021. isbn: 978-1-260-57937-6.
[KB95] Bernard Kolman and Robert E. Beck. Elementary linear programming with ap-
plications. 2nd ed. Computer Science and Scientific Computing. Academic Press, 1995,
pp. xxii+449. isbn: 0-12-417910-X.
Optional
[Bat00] John Bather. Decision theory. Wiley-Interscience Series in Systems and Optimiza-
tion. John Wiley & Sons, 2000, pp. xii+191. isbn: 0-471-97649-0.
[Win04] Wayne L. Winston. Operations research: applications and algorithms. 4th ed.
Belmont, CA: Thomson Brooks/Cole, 2004. isbn: 0-534-42362-0.
75
B. Corrections
A list of all major corrections made to this file since it first appeared on Moodle:
• No corrections made yet!