MATH2070/MATH2970-python代写|学霸联盟

MATH2070/MATH2970-python代写

时间：2023-10-17

MATH2070/MATH2970
Optimisation and Financial
Mathematics
Semester 2, 2023
Lindon Roberts
Office: Carslaw Building F07, Room 638
Email: lindon.roberts@sydney.edu.au
Based on lecture notes by G. Gottwald, J.-Y. Fan.
Contents
1 Course Overview 2
2 Introduction to Optimisation 5
3 Linear Programming 7
3.1 Revision: Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 LP Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Non-Standard LPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Complexity [not examinable] . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Nonlinear Optimisation 46
4.1 Unconstrained Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Constrained Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Probability Review 70
5.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Decision Making Under Uncertainty 84
6.1 Pricing Risky Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Risk Aversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Limitations [not examinable] . . . . . . . . . . . . . . . . . . . . . . . . . 94
7 Portfolio Theory 96
7.1 Investment Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Mean-Variance Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . 98
7.3 Optimising Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Two-Asset Optimal Portfolios . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5 General Optimal Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.6 Restricted Portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.7 Risk-Free Asset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
1
Chapter 1
Course Overview
Coordinator
The coordinator for this course is:
• Lindon Roberts (Carslaw Building 638, lindon.roberts@sydney.edu.au)
Optimisation and Financial Mathematics
As the name suggests, there are two key parts to this course.
Optimisation is about how to use mathematics to make optimal decisions.
• Maximise profit
• Maximise efficiency of a product design
• Minimise transportation cost of deliveries
• Maximise predictive power of AI
We will mostly focus on optimising linear quantities with linear constraints, “linear pro-
gramming”. This is one of the most widely used (and profitable!) concepts in mathematics.
We will then look at nonlinear problems, which are more complicated but still very
important (e.g. science, engineering, machine learning). The techniques here will be useful
in the financial mathematics section too.
Optimisation and Financial Mathematics
Financial mathematics is exactly as the name suggests: the mathematics of financial
markets.
Modern finance requires a quantitative understanding of economics, markets and
decision-making.
Since it relies on predicting the future, managing uncertainty is a key aspect of financial
mathematics (less common in models for physics, engineering, etc.).
We will try to answer three basic questions:
• How do you make financial decisions based on an uncertain future?
• How can you construct a mathematical model of financial markets (e.g. stock
markets)?
• How do you find an optimal investment strategy? — this is optimisation!
2
CHAPTER 1. COURSE OVERVIEW 3
About me
My background:
• Bachelor’s degree in maths, computer science & mathematical economics (ANU)
• Risk management quant, Macquarie Group: expert in modelling ‘market risk’
• DPhil in maths (Oxford): academic research in optimisation algorithms
• Joined USyd in 2022
Course Content
• Optimisation (≈ 6 weeks)
– Linear Programming (theory and methods)
– Nonlinear optimisation with and without constraints (theory)
• Probability review (≈ 1 week)
• Financial Mathematics (≈ 5 weeks)
– Utility Theory (decision-making with uncertainty)
– Portfolio Theory (investment decision-making, modelling financial markets)
• Revision (≈ 1 week)
There are no assigned textbooks (lectures are self-contained) but some good references
are given in the reading list on Canvas.
Prerequisites
• Single-variable calculus (e.g. MATH1X21) and linear algebra (e.g. MATH1X02)
• MATH2970: advanced versions or ≥ 65 in mainstream versions
• Formal prerequisites on unit outlines
Software
Being able to use and write software is an important part of this course.
We will using Python, one of the most popular programming languages in use today.
• I recommend using Anaconda (https://www.anaconda.com/products/individu
al)
– Contains Python plus many standard packages for doing mathematics (linear
algebra, optimisation, etc).
– Free, so you can install it on your own computer
• Information about installing Python and writing/running code is on Canvas
CHAPTER 1. COURSE OVERVIEW 4
Format
The contact hours for the course are
• 3hrs of lectures per week (Tue 1–2pm & 3–4pm, Wed 10–11am)
• Weekly 1hr labs, covering coding (starting in week 1, not running week 13)
• Weekly 1hr tutorials, covering theory (starting in week 2, running until end of
semester)
Lectures are in-person (but recorded), labs and tutorials are in-person only.
Communication
• Check Canvas & Ed for all announcements and resources
• Please post general questions about admin or course content in the Ed discussion
forum
• Send me an email for confidential matters (lindon.roberts@sydney.edu.au)
Assessment
Assessment
• Assignment 15%: pen-and-paper assignment, due week 6
• Online quiz 10%: short quiz in week 9
• Computer project 15%, due week 13
• Final exam 60%
• MATH2970: same but some extra/harder questions
Late submissions of assignments allowed for up to 10 days (5% per penalty day or
extension).
See unit outline (link on Canvas) for details.
Generative AI/ChatGPT
You have probably heard lots about ChatGPT and other generative AI tools.
You are not allowed to use such tools for assessment in this course (but I wouldn’t
recommend it anyway).
Image: Wall Street Journal, https://www.wsj.com/articles/ai-bot-chatgpt-needs-some-hel
p-with-math-assignments-11675390552
Chapter 2
Introduction to Optimisation
Optimisation
In general, optimisation is about choosing the best inputs, according to some measure
of ‘best’.
Mathematically speaking, an optimisation problem comprises:
• Input values we can control
• Optional constraints on the allowed inputs (e.g. all values must be ≥ 0)
• A quantity depending on the input values that we want to maximise or minimise
The input values are called decision variables and the quantity of interest is called the
objective function.
Simple Example
Find the dimensions of a cylinder with surface area S which maximises the volume V .
How would we write this formally (mathematically)?
Optimisation
Simple Example
Find the dimensions of a cylinder with surface area S which maximises the volume V .
The decision variables are the cylinder dimensions: radius r ∈ R and height h ∈ R.
The objective is V = πr2h (maximise). The constraints are 2πr2 + 2πrh = S for some
parameter S ≥ 0 (given input value, not being optimised), plus the hidden constraints
r ≥ 0 and h ≥ 0 (not given explicitly, implied by context!).
We write the problem mathematically as
max
r,h∈R
V (r, h) = πr2h,
s.t. 2πr2 + 2πrh = S, (“s.t.” = “subject to”)
r, h ≥ 0.
We write the decision variables under the max (or min), it helps to distinguish them from
parameters.
5
CHAPTER 2. INTRODUCTION TO OPTIMISATION 6
Optimisation: General Case
In general, a mathematical optimisation problem looks like
min
x∈Rn
f(x),
s.t. x ∈ Ω,
where x ∈ Rn are the decision variables, the objective function is f : Rn → R and the
constraints are represented by the set Ω ⊆ Rn.
This represents the constraints geometrically (the feasible region of space).
Usually we will write constraints as equations/inequalities, which implies a feasible
region (all x satisfying the (in)equations). For example, for the cylinder problem
Ω = {(r, h) ∈ R2 : 2πr2 + 2πrh = S, r ≥ 0, h ≥ 0}.
We usually write min rather than max (change max to min by taking negative of
objective function).
Motivation
Question
Why should we care about optimisation?
Optimisation naturally arises in the world:
• Molecules interact in chemical reactions to form new components until the potential
energy of their electrons is minimised
• Light follows the path which minimises travel time
• Soap films
• Evolution
• Classical mechanics (Lagrangian/Hamiltonian mechanics)
Motivation
Question
Why should we care about optimisation?
People try to make optimal decisions:
• Designing a portfolio of investments to maximise expected return (or minimise risk).
We will discuss this a lot in the second half of this course!
• Designing a product to minimise manufacturing costs and/or maximise product
performance
• Minimising boarding times at airports, or finding the order of plane takeoff to
minimise time
• Minimising fuel costs in transportation schedules (while making all deliveries)
• Minimising electricity generation costs while meeting all demand
• Training an AI model by maximising its predictive ability
Chapter 3
Linear Programming
Introduction
Our first category of optimisation problem is linear programming (LP). For LPs, we
have
• The objective function is linear in x
• The constraints are linear or affine (linear + constant) equalities/inequalities in x
Note: “programming” refers to planning, not computer coding (program = term for a
specific plan, often military logistics). Early history:
• Leonid Kantorovich (Russian mathematician, 1939): “The mathematical method of
production planning and organization”
• Tjalling Koopmans (Dutch mathematician/economist, 1942): solving shipping prob-
lems using LP
• George Dantzig (American mathematician, 1947): published the first algorithm to
solve any LP problem, called the simplex method.
Kantorovich and Koopmans received the 1975 Nobel Prize in Economics “for their
contributions to the theory of optimum allocation of resources”.
Importance of LPs
LPs are probably the most important category of optimisation problem for real-world
problems:
• Scheduling of workers, timetabling, ...
• Production management, managing factory lines...
• Transportation: selection of routes, allocation of resources, ...
USyd’s course QBUS2310 is entirely about solving business management problems
with optimisation, mostly linear programming!
7
CHAPTER 3. LINEAR PROGRAMMING 8
Importance of LPs
New York Times
(19/11/1984)
Breakthrough in Problem Solving
A 28-year-old mathematician at A.T.&T. Bell Laborato-
ries has made a startling theoretical breakthrough... [of
interest to] brokerage houses, oil companies and airlines,
industries with millions of dollars at stake in problems
known as linear programming.
These problems... arise in a variety of commercial and
government applications, ranging from allocating time on
a communications satellite to routing millions of telephone
calls over long distances. Linear programming is particu-
larly useful whenever a limited, expensive resource must
be spread most efficiently among competing users.
https://www.nytimes.com/1984/11/19/us/breakthrough
-in-problem-solving.html
3.1 Revision: Linear Systems
Solving Linear Systems
Before we study LPs, we have to revise solutions of linear systems.
Suppose we have Ax = b, for A ∈ Rn×n, x ∈ Rn, b ∈ Rn.
Note: # equations (# rows of A, length of b) = # unknowns (# columns of A, length
of x).
If A is invertible (full rank), then we can solve with
x = A−1b
How can we do this calculation (by hand or on a computer)?
We need an algorithm: a known procedure that will guarantee the correct answer for
all valid inputs. In this case, the algorithm is Gauss-Jordan elimination.
Solving Linear Systems: Example
x1 − x2 + x3 = −2
2x1 + x2 − x3 = 5
−x1 + 2x2 + 3x3 = 0
Write in matrix-vector form Ax = b as 1 −1 12 1 −1
−1 2 3

x1x2
x3
 =
−25
0

In Gauss-Jordan elimination, we want to convert this system to row reduced echelon form
(RREF) 1 0 00 1 0
0 0 1

x1x2
x3
 =
∗∗
∗

CHAPTER 3. LINEAR PROGRAMMING 9
The RHS is the solution, provided our process doesn’t change the solution(s) of the
system.
Elementary Operations
The three elementary row operations are steps which do not change the solution of a
linear system.
Basic idea: pick a simple (invertible) matrix P ∈ Rn×n and change
Ax = b −→ PAx = Pb (such that PA is simpler than A)
1. Row scaling: multiply/divide any row by a nonzero constant.
For example, “multiply row 2 by 5” corresponds to
P =
1 0 00 5 0
0 0 1
 e.g. P
y1y2
y3
 =
 y15y2
y3

General trick for finding P : apply your desired row operation to the identity matrix.
Elementary Operations
2. Linear combination of rows: add/subtract a multiple of one row to any
other row
For example, “add (5×row 3) to row 1” corresponds to
P =
1 0 50 1 0
0 0 1
 e.g. P
y1y2
y3
 =
y1 + 5y3y2
y3

3. Swap any two rows
For example, “swap rows 1 and 2” corresponds to
P =
0 1 01 0 0
0 0 1
 e.g. P
y1y2
y3
 =
y2y1
y3

Gauss-Jordan Elimination
The Gauss-Jordan elimination algorithm solves a linear system by performing pivot
operations on each column.
A pivot operation is a specific combination of elementary operations: for a given
column j,
1. Pick a nonzero pivot entry aij
– Easiest to pick diagonal ajj, if it’s nonzero. Computers pick entry with largest
magnitude (better with rounding errors; see MATH3X76).
2. Divide row i by aij (to make aij = 1)
3. Make all other entries in column j zero (i.e. akj for all k ̸= i) by adding multiples of
row i
CHAPTER 3. LINEAR PROGRAMMING 10
Let’s apply this to our system from before (written in tableau form)
x1 x2 x3 b
1 −1 1 −2
2 1 −1 5
−1 2 3 0
Gauss-Jordan Elimination
For column 1, pick a nonzero entry as the pivot (e.g. diagonal a11 = 1). Note: no row
scaling needed, entry is already 1. Eliminate all other entries in that column:
x1 x2 x3 b
1 −1 1 −2
2 1 −1 5
−1 2 3 0
R2−2R1→R2
R3+R1→R3
x1 x2 x3 b
1 −1 1 −2
0 3 −3 9
0 1 4 −2
For column 2, pick diagonal entry a22 = 3 as pivot. Rescale then eliminate other entries
(1/3)R2→R2
x1 x2 x3 b
1 −1 1 −2
0 1 −1 3
0 1 4 −2
R1+R2→R1
R3−R2→R3
x1 x2 x3 b
1 0 0 1
0 1 −1 3
0 0 5 −5
For column 3, pick diagonal entry a33 = 5 as pivot. Rescale then eliminate other entries...
Gauss-Jordan Elimination
For column 3, pick diagonal entry a33 = 5 as pivot. Rescale then eliminate other
entries
(1/5)R3→R3
x1 x2 x3 b
1 0 0 1
0 1 −1 3
0 0 1 −1
R2+R3→R2
x1 x2 x3 b
1 0 0 1
0 1 0 2
0 0 1 −1
We are in now RREF, and it is trivial to read off the solution:1 0 00 1 0
0 0 1

x1x2
x3
 =
 12
−1
 ⇐⇒
x1x2
x3
 =
 12
−1

(check this solves the original system!)
Underdetermined Systems
What is the matrix A is not square?
• We will be interested in the case where A ∈ Rm×n for m < n
a11 a12 · · · a1n
...
...
. . .
...
am1 am2 · · · amn


x1
...
xn
 =

b1
...
bm

CHAPTER 3. LINEAR PROGRAMMING 11
• Intuitively, there should be multiple solutions (more unknowns than equations)
• Not always! Solutions may not exist[
1 · · · 1
1 · · · 1
]
x =
[
1
2
]
• If the rows of A are linearly independent (full row rank, rank(A) = m), then
solution(s) exist
• If m < n and solution(s) exist, then there are infinitely many solutions
Underdetermined Systems: Example
Consider the following (full row rank) system with m = 3 and n = 5:
x1 x2 x3 x4 x5 b
1 −1 1 −1 0 −2
2 1 −1 0 1 5
−1 2 3 1 2 0
Let’s do Gauss-Jordan elimination on this system, starting with column 1 (pivot a11 = 1,
no scaling needed)
R2−2R1→R2
R3+R1→R3
x1 x2 x3 x4 x5 b
1 −1 1 −1 0 −2
0 3 −3 2 1 9
0 1 4 0 2 −2
For column 2, pick pivot a22 = 3, rescale and eliminate...
Underdetermined Systems: Example
For column 2, pick pivot a22 = 3, rescale and eliminate
(1/3)R2→R2
x1 x2 x3 x4 x5 b
1 −1 1 −1 0 −2
0 1 −1 23 13 3
0 1 4 0 2 −2
R1+R2→R1
R3−R2→R3
x1 x2 x3 x4 x5 b
1 0 0 −13 13 1
0 1 −1 23 13 3
0 0 5 −23 53 −5
Column 3, pivot a33 = 5:
(1/5)R3→R3
x1 x2 x3 x4 x5 b
1 0 0 −13 13 1
0 1 −1 23 13 3
0 0 1 − 215 13 −1
R2+R3→R2
x1 x2 x3 x4 x5 b
1 0 0 −13 13 1
0 1 0 815
2
3 2
0 0 1 − 215 13 −1
We can’t progress any further: eliminating any entries in column 4 would break the
zeros we put in the first 3 columns (this system is in RREF). What is the solution?
CHAPTER 3. LINEAR PROGRAMMING 12
Underdetermined Systems: Example
Writing out the equations from our RREF tableau, we have
x1 − 13x4 +
1
3x5 = 1 x1 = 1 +
1
3x4 −
1
3x5
x2 +
8
15x4 +
2
3x5 = 2 or x2 = 2−
8
15x4 −
2
3x5
x3 − 215x4 +
1
3x5 = −1 x3 = −1 +
2
15x4 −
1
3x5
So, we can write some variables (x1, x2, x3) as an affine linear function of the others (x4
and x5). Any choice of x4 and x5 gives us a valid solution: e.g. x4 = x5 = 0 gives x1 = 1,
x2 = 2, x3 = −1. This means we must have infinitely many solutions (one for each choice
of x4 and x5).
• The ‘constrained’ variables x1, x2, x3 were the ones corresponding to our pivot
columns.
• Note that if we deleted the other variables (x4 and x5) from the original system, we
would have got a square (m = n = 3) invertible system, which must have a unique
solution.
• The split (x1, x2, x3) vs. (x4, x5) is not unique. We could have skipped pivoting on
column 2 and pivoted on column 4 instead to get (x1, x3, x4) vs. (x2, x5), for example
— check!
Underdetermined Systems: Interpretation
Interpretation 1: think of A as columns c1, . . . , cn ∈ Rm. Then Ax = b means
x1c1 + · · ·+ xncn = b
or we write b as a linear combination of the cj. If A has full row rank, we only need a
subset of m vectors cj to span Rm, so we can throw away some columns cj and still be
guaranteed a solution.
If we call the corresponding m × m matrix is B (which is invertible), then we get
a reduced system BxB = b with a unique solution. We can then set the other n −m
variables xj = 0.
Definition 3.1. The m variables xj for which the cj span Rm are called basic variables.
(e.g. x1, x2, x3 above)
The other n−m variables are called non-basic variables. (e.g. x4, x5 above)
The solution to Ax = b which solves BxB = b (basic variables) and non-basic variables
xj = 0 is called a basic solution of the linear system.
Underdetermined Systems: Interpretation
Interpretation 2: think of A as rows r1, . . . , rm ∈ Rn. Then Ax = b means
rT1 x = b1, . . . rTmx = bm
(i.e. the original system of equations).
We know that there are n−m non-basic variables (which we can choose to be anything),
and m basic variables (which are fixed once we know the non-basic variables).
CHAPTER 3. LINEAR PROGRAMMING 13
So, this means that we could add n − m extra equations to specify the non-basic
variables, giving a square invertible n× n system.
We get the basic solution if we add the n−m extra equations
eTj x = 0 (i.e. xj = 0) for all non-basic variables j
If the original A ∈ Rm×n has full row rank, this larger n× n system also has full rank
(with a unique solution = basic solution). Difficulty: initially, we don’t know what the
(non-)basic variables are! Gauss-Jordan elimination helps us find them.
Simultaneous Transformation
When we get to LPs, we will have constraints (like equations Ax = b), but also a
linear objective function
Z = cTx = c1x1 + · · ·+ cnxn.
Suppose we have Ax = b for the 3× 5 example above and Z = x1 + x2 − x5.
For basic variables x1, x2, x3 and non-basic variables x4, x5 we know from above that
x1 = 1 +
1
3x4 −
1
3x5 and x2 = 2−
8
15x4 −
2
3x5
and so we have
Z =
(
1 + 13x4 −
1
3x5
)
+
(
2− 815x4 −
2
3x5
)
− x5 = 3− 15x4 − 2x5
So Z looks like another basic variable (affine combination of non-basic variables)! For the
basic solution x4 = x5 = 0 we get Z = 3.
Simultaneous Transformation
Learning that Z = 3− 15x4 − 2x5 was useful. Conveniently, we can use Gauss-Jordan
elimination to give us this extra information.
To do this, we pretend Z is an extra variable, and add the extra equation Z = x1+x2−x5
(i.e. Z − x1 − x2 + x5 = 0). The new starting tableau is
Z x1 x2 x3 x4 x5 b
1 −1 −1 0 0 1 0
0 1 −1 1 −1 0 −2
0 2 1 −1 0 1 5
0 −1 2 3 1 2 0
We now do Gauss-Jordan elimination on this larger system.
For column 1 (Z), we pick pivot a11 = 1. No scaling needed, but no elimination needed
either!
Simultaneous Transformation
For column 1 (Z), no extra pivoting work needed! Now do pivoting on columns 2–4
(x1, x2, x3) exactly as before, we get:
Z x1 x2 x3 x4 x5 b
1 0 0 0 15 2 3
0 1 0 0 −13 13 1
0 0 1 0 815
2
3 2
0 0 0 1 − 215 13 −1
CHAPTER 3. LINEAR PROGRAMMING 14
Last 3 rows same as before (give us x1 = 1 + 13x4 − 13x5 etc), since coefficient of Z stays
zero.
Top row gives us
Z + 15x4 + 2x5 = 3 or Z = 3−
1
5x4 − 2x5
which is exactly what we wanted!
3.2 LP Basics
LP Simple Example
Now we can go back to thinking about LPs.
Example: a company produces two drugs (a white pill and a blue pill). Most
components are easily available, but one ingredient P is in short supply.
• Each white pill sells for $3000, and comprises 1 unit of A1 and 1 unit of B1.
• Each blue pill sells for $5000, and comprise 1 unit of A2 and 1 unit of B2.
• There are three factories:
– Factory 1 only produces A1. 1 unit of A1 requires 1 gram of P . Factory 1 has
4 grams of P .
– Factory 2 produces both B1 and B2. 1 unit of B1 requires 3 grams of P , and 1
unit of B2 requires 2 grams of P . Factory 2 has 18 grams of P .
– Factory 3 only produces A2. 1 unit of A2 requires 2 grams of P . Factory 3 has
12 grams of P .
How many white/blue pills should the company make to maximise revenue (sales)?
LP Simple Example
Let’s try to formalise this by introducing some notation:
x1 = # white pills produced, x2 = # blue pills produced
Then our revenue is $3000 per white pill and $5000 per blue pill: Z = 3000x1 + 5000x2.
Ingredients: we need x1 units of A1 and B1, and x2 units of A2 and B2. Our limiting
resource is P , so how much will we use up?
• Factory 1: x1 units of A1 requires x1 grams of P (4g available)
• Factory 2: x1 units of B1 and x2 units of B2 requires 3x1 + 2x2 grams of P (18g
available)
• Factory 3: x2 units of A2 requires 2x2 grams of P (12g available)
This gives constraints
x1 ≤ 4, 3x1 + 2x2 ≤ 18, and 2x2 ≤ 12
Any hidden constraints?
CHAPTER 3. LINEAR PROGRAMMING 15
LP Simple Example
All together, and writing x = (x1, x2)T ∈ R2 our problem is
max
x∈R2
Z = 3000x1 + 5000x2,
s.t. x1 ≤ 4,
3x1 + 2x2 ≤ 18,
2x2 ≤ 12,
x1, x2 ≥ 0 (hidden constraints!)
Our objective is linear and all constraints are affine (in x1 and x2), so this is an LP.
Note: if we let Z be the revenue in $1000s, then Z = 3x1 + 5x2 which is easier to
manage.
Often, it is not obvious when a practical problem is an LP! (which is why
QBUS2310 exists)
Standard Form
There are many different ways to write an LP. We will look at LPs written in standard
form:
max
x∈Rn
Z = cTx =
n∑
i=1
cixi,
s.t. a11x1 + · · ·+ a1nxn ≤ b1,
...
am1x1 + · · ·+ amnxn ≤ bm,
x1, x2, . . . , xn ≥ 0.
In standard form, we assume that all RHS values bi ≥ 0. Note max not min, direction of
inequalities. Later, we will look at how to convert any problem into standard form.
We sometimes write the constraints as Ax ≤ b and x ≥ 0 (vector inequalities meant
component-wise). Terminology: c is the cost vector (ci are cost coefficients), A is the
constraint matrix, b is the resource vector.
Standard Form
max
x∈Rn
Z = cTx s.t. Ax ≤ b,x ≥ 0
The feasible region Ω ⊂ Rn is
Ω = {x ∈ Rn : Ax ≤ b and x ≥ 0}
We will call any x ∈ Ω a feasible point.
If Ω ̸= ∅ (i.e. at least one feasible point exists), we say the LP is feasible or the
constraints are consistent.
An optimum (i.e. a solution) is any feasible point x∗ ∈ Ω such that cTx∗ ≥ cTx for
all x ∈ Ω. The corresponding Z∗ = cTx∗ is the optimal value.
Warning: some textbooks define standard form differently, usually Ax = b and x ≥ 0.
We can convert any problem from our standard form to this form (and vice versa) using
methods we will discuss later. Some textbooks also call any x ∈ Ω a “feasible solution”
(which I find confusing!).
CHAPTER 3. LINEAR PROGRAMMING 16
Standard Form
Simple observation: since Z is linear, any optimum must be on the boundary of
the feasible set Ω provided c ̸= 0. If c = 0 then we haven’t really got an optimisation
problem.
Why? If we are at a point x ∈ Ω not on the boundary, we can take a small step
x+ ϵs in any direction s ∈ Rn and stay in Ω, provided ϵ > 0 is sufficiently small. Trying
s = c gives
Znew = cT (x+ ϵc) = cTx+ ϵcTc = (cTx) + ϵ
(
n∑
i=1
c2i
)
> cTx = Zold
where the strict inequality requires c ̸= 0. So, the point x+ ϵc is in Ω and has a larger
objective value than x, i.e. x can’t be a solution.
As a result, we will always look for solutions by moving around the boundary
of the feasible region.
Graphical Solution
In the simple case of two decision variables (n = 2), we can plot our feasible region
graphically. For example, in the drug manufacturing problem:
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
Constraints are:
x1 ≤ 4
3x1 + 2x2 ≤ 18
2x2 ≤ 12
x1, x2 ≥ 0
The solution(s) must be on the boundary of
this region.
Graphical Solution
To find the maximum of the objective, here Z = 3x1 + 5x2, we can look at the contour
lines of Z (i.e. lines along which Z is constant).
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
Z=10
Z=20
Z=30
If Z is constant, we get a line
x2 = −35x1 +
Z
5
For example, Z = 10 gives contour line
x2 = −35x1 + 2
Imagine sliding the contour line up (increas-
ing Z) until you reach the edge of the feasible
region. Where is the intersection?
CHAPTER 3. LINEAR PROGRAMMING 17
Graphical Solution
To be more careful, we need to check the value of Z on the boundary of the feasible
region. Actually, we just need to check the corners (there is always at least one optimum
at a corner).
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
There are 5 corner points, with objective val-
ues
Z(0, 0) = 0
Z(0, 6) = 30
Z(2, 6) = 36
Z(4, 3) = 27
Z(4, 0) = 12
Maximum is Z∗ = 36 at x∗ = (2, 6)T .
Graphical Solution: Comments
This graphical method gives us a few insights.
1. The feasible region is convex (roughly, no ‘inward bumps’).
Formally, a set is convex if the line segment connecting any two points in the set stays
inside the set: for any x,y ∈ Ω
tx+ (1− t)y ∈ Ω, for all t ∈ [0, 1]
(e.g. t = 0 is y, t = 1 is x, t = 0.5 is halfway point)
2. The feasible region can be empty
For example, x1 ≤ 5 and x1 ≥ 7 (not in standard form, but can be converted).
In this case, we say the LP is infeasible.
Graphical Solution: Comments
3. The feasible region could be unbounded
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
Constraints are
−x1 + x2 ≤ 1
1
2x1 − x2 ≤ 1
x1, x2 ≥ 0
Note: this region is still convex (check!)
For unbounded problems, there may still be a (finite) optimum at a corner point. But
we may be able to find Z → +∞ inside the feasible region. In this case we say the LP is
unbounded.
CHAPTER 3. LINEAR PROGRAMMING 18
Graphical Solution: Comments
4. There may be more than one solution
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
Z=10
Z=20
Z=30
Change objective to Z = 6x1+4x2, so contour
lines are
x2 = −32x1 +
Z
4
For example, Z = 10 gives contour line
x2 = −32x1 +
5
2
Contour lines are parallel to a boundary seg-
ment.
Graphical Solution: Comments
4. There may be more than one solution
In this case, the solutions lie on the boundary 3x1 + 2x2 = 18 (constraint holds with
equality), between (2, 6) and (4, 3).
Algebraically, pick a parameter t for the boundary line, for example (x1, x2) = (t, 9− 32t).
Find the values of t that satisfy all other constraints:
x1 ≤ 4 =⇒ t ≤ 4
2x2 ≤ 12 =⇒ t ≥ 2
x1 ≥ 0 =⇒ t ≥ 0
x2 ≥ 0 =⇒ t ≤ 6
The solution line is all values of t satisfying all constraints, i.e. (x∗1, x∗2) = (t, 9− 32t) for all
2 ≤ t ≤ 4. All these points give Z∗ = 36.
Simplex Method: Motivation
To find a general procedure for solving LPs, we know:
• Any optimum is always on the boundary of the feasible region
• The feasible region is a polytope (high-dimensional polygon/polyhedron)
• There is always at least one optimum at a corner (if the problem has finite solution)
– If there are multiple optimum corner points, the line segment/face/etc between
them represents infinitely many solutions
So, our algorithm for solving LPs should look at feasible corner points.
If all corner points adjacent to the current one have worse (i.e. smaller) Z value, then
you are at an optimum. (in general, this means you don’t have to check all corners: only
look at adjacent corners which increase Z)
The resulting algorithm is the simplex method (George Dantzig, 1947). It is one of
the most important algorithms in applied mathematics.
CHAPTER 3. LINEAR PROGRAMMING 19
Simplex Method
At a high level, the simplex method is quite simple:
1. Pick a starting feasible corner point
2. If there is an adjacent feasible corner point which increases Z, go there
3. If there is no adjacent feasible corner point which increases Z, stop (solution found)
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
1. Start at x = (0, 0) with Z = 0.
2. Move to x = (0, 6) to increase Z to Z = 30.
3. Move to x = (2, 6) to increase Z to Z = 36.
4. Adjacent points are x = (0, 6) and x = (4, 3)
with Z = 30, 27 (both smaller): stop with x∗ =
(2, 6).
The main difficulty is writing this procedure for a
general problem.
Corner Points
The boundary of the feasible region corresponds to areas where the inequality con-
straints are active (hold with equality).
This is easy to check for constraints xi ≥ 0: just see if xi = 0 or xi > 0.
For the other constraints, Ax ≤ b, it will be useful to have some new variables which
represent the distance to equality. Rewrite the first constraint as
a11x1 + · · ·+ a1nxn ≤ b1 −→ a11x1 + · · ·+ a1nxn + xn+1 =b1, and xn+1 ≥ 0
That is, we define xn+1 := b1 − (a11x1 + · · · a1nxn) ≥ 0.
Do this for each of them constraintsAx ≤ b to getm new slack variables xn+1, . . . , xn+m ≥
0.
We go from Ax ≤ b with x ≥ 0 to (after adding slacks) Ax = b and x ≥ 0 (for a
larger vector x and wider matrix A).
Now, every boundary edge of the feasible region is given by xi = 0 for some i (possibly
a slack variable).
Corner Points
Recall, a hyperplane in Rn is the set of all solutions to an equation a1x1+ · · ·+anxn = b
(at least one ai nonzero). Hyperplanes divide Rn into two half-spaces:
• A hyperplane in R is a point
• A hyperplane in R2 is a line
• A hyperplane in R3 is a plane, etc
If we look at intersections of hyperplanes, in the typical case (e.g. no repeated equations,
inconsistent equations,...)
CHAPTER 3. LINEAR PROGRAMMING 20
• In R2, two lines intersect at a point
• In R3, two planes intersect at a line, but three hyperplanes intersect at a point
Here, we have n + m hyperplanes xi = 0 in Rn (including slack variables). This
corresponds to the feasible set being the intersection of n + m half-spaces (a convex
polytope).
A corner of the feasible set is the intersection of n hyperplanes.
Corner Points
How many corner points do we have?
Intersection of n out of n+m hyperplanes:
(
n+m
n
)
= (n+m)!
n!m! total corners
But not all corner points are feasible: e.g. drug example, n = 2 and m = 3 gives(
5
2
)
= 10 total corners but feasible region only has 5 corners.
For this example, we have original decision variables x1, x2 (both ≥ 0) and new slack
variables
x1 ≤ 4 =⇒ x3 := 4− x1 ≥ 0
3x1 + 2x2 ≤ 18 =⇒ x4 := 18− 3x1 − 2x2 ≥ 0
2x2 ≤ 12 =⇒ x5 := 12− 2x2 ≥ 0
The boundaries of the feasible region are where one (or more) constraints hold with
equality; i.e. xi = 0 for some i = 1, . . . , 5.
For example, x3 = 0 is the boundary line x1 = 4.
Corner Points
How many corner points do we have?
0 2 4 6 8 10
x
1
0
2
4
6
8
10
x
2
x
2
=0
x
1
=0
x
3
=0
x
4
=0
x
5
=0
Of the 10 possible pairs of active constraints
xi = 0:
• 5 corner points are feasible: e.g. x1 =
x5 = 0 gives (0, 6). Check (x2, x3, x4) =
(6, 4, 6) ≥ 0
• 3 corner points are infeasible: e.g. x3 =
x5 = 0 gives (4, 6). Here, x4 = 18 −
3x1 − 2x2 = −6 < 0.
• 2 ‘corner points’ do not exist: e.g. x1 =
x3 = 0 requires x1 = 0 and x1 = 4
Corner Points
Link to Gauss-Jordan elimination?
After adding slacks, the resource constraints Ax ≤ b are now equalities Ax = b,
essentially by definition of the slack variables:
e.g. 3x1 + 2x2 ≤ 18 is now 3x1 + 2x2 + x4 = 18 (by definition of slack x4)
CHAPTER 3. LINEAR PROGRAMMING 21
From Gauss-Jordan elimination, any point satisfying Ax = b (m constraints, n + m
unknowns) means we can writem of the variables (basic variables) as an affine combination
of the remaining n (non-basic variables).
Any corner point has n values xi = 0: pick these to be the non-basic variables and
this corresponds to a basic solution to Ax = b! That is:
• Basic solutions to Ax = b ⇐⇒ corner points
• Feasible corner point ⇐⇒ basic solution with x ≥ 0
e.g. (4, 3) has x3 = x4 = 0 (non-basic variables) with (x1, x2, x5) = (4, 3, 6) (basic
variables).
Corner Points
We now have an understanding of how to algebraically represent the (feasible) corner
points.
Unfortunately, there a lot of corner points (e.g. small LP n = 40, m = 20 gives(
40+20
20
)
≈ 4× 1015 corners), even if we only look at feasible ones.
Fortunately, we know that we only want to check the ‘adjacent’ feasible corner points
(from where we are currently). This dramatically decreases the amount of work required!
What does it mean to move to an ‘adjacent’ feasible corner point?
Definition 3.2. Two corner points are adjacent if they have all the same basic/non-basic
variables except one.
For example, (4, 3) has x3 = x4 = 0 (non-basic variables) and (x1, x2, x5) = (4, 3, 6)
basic variables.
• It is adjacent to (4, 6) with (x1, x2, x4) = (4, 6,−6) (basic variables) [not feasible].
• It is not adjacent to (0, 0) with (x3, x4, x5) = (4, 18, 12) (basic variables).
Corner Points
Note: this definition of ‘adjacent’ is not quite what we might expect. It includes all
corners on the same constraints, not just the immediate neighbours.
0 2 4 6 8 10
x
1
0
2
4
6
8
10
x
2
x
2
=0
x
1
=0
x
3
=0
x
4
=0
x
5
=0
Starting at (4, 3) with non-basic variables
x3 = x4 = 0
• (0, 9), (2, 6) and (6, 0) all have x4 = 0
• (4, 0), (4, 6) and (one non-existent cor-
ner) all have x3 = 0
To move from one corner to another, one
variable goes non-basic → basic (“enters the
basis”) and one variable goes basic → non-
basic (“leaves the basis”)
CHAPTER 3. LINEAR PROGRAMMING 22
Corner Points
Example: finding the optimal point for drug problem
We started at (0, 0):
(x1, x2, x3, x4, x5) = (0, 0, 4, 18, 12)
Move to (0, 6): x2 enters the basis, x5 leaves the basis
(x1, x2, x3, x4, x5) = (0, 6, 4, 6, 0)
Move to (2, 6): x1 enters the basis, x4 leaves the basis
(x1, x2, x3, x4, x5) = (2, 6, 2, 0, 0)
Check point (4, 3): x5 enters the basis, x3 leaves the basis
(x1, x2, x3, x4, x5) = (4, 3, 0, 0, 6)
At the final point (2, 6), x1, x2, x3 are basic variables, x4, x5 are non-basic variables,
3.3 Simplex Method
Simplex Method
We are now in a position to precisely specify the simplex method. Assume the problem
is in standard form:
max
x∈Rn
Z = cTx s.t. Ax ≤ b, x ≥ 0 (with b ≥ 0)
1. Pick any feasible corner point as a starting point (our standard form allows x = 0)
2. Move to an adjacent feasible corner point: pick one variable to enter the basis and
one variable to leave
– Choice must ensure Z increases and our new point is a feasible corner point
3. Continue until no adjacent feasible corner point increases Z (solution found!)
Changing the basis will require updating our Gauss-Jordan elimination. The values in
the matrix will help us decide which variables should enter/leave the basis.
We now look at these steps in detail.
Simplex Method: Initialisation
1. Initialisation
We have decision variables x1, . . . , xn. Write objective Z = cTx as
Z − c1x1 − · · · − cnxn = 0
Our constraints with slack variables xn+1, . . . , xn+m are
a11x1 + · · ·+ a1nxn + xn+1 = b1
...
am1x1 + · · ·+ amnxn + xn+m = bm
To find a feasible corner point, set all decision variables x1 = · · · = xn = 0 (and so slacks
are xn+1 = b1, ... xn+m = bm). This is feasible (all xi ≥ 0) since we assume b ≥ 0.
CHAPTER 3. LINEAR PROGRAMMING 23
Simplex Method: Initialisation
1. Initialisation
So, we start with the tableau
Z x1 x2 · · · xn xn+1 xn+2 · · · xn+m b
1 −c1 −c2 · · · −cn 0 0 · · · 0 0
0 a11 a12 · · · a1n 1 0 · · · 0 b1
0 a21 a22 · · · a2n 0 1 · · · 0 b2
...
...
...
. . .
...
...
...
. . .
...
...
0 am1 am2 · · · amn 0 0 · · · 1 bm
At the start, we have basic variables xn+1, . . . , xn+m (slacks) and non-basic variables
x1 = · · · = xn = 0 (decision variables).
Note: the columns corresponding to the basic variables are all zeros (except a single
1). We have already done elimination on these columns!
Simplex Method: Initialisation
For the drug problem,
min
x∈R2
Z = 3x1 + 5x2 s.t. x1 ≤ 4, 3x1 + 2x2 ≤ 18, 2x2 ≤ 12, x1, x2 ≥ 0
the slacks are
x3 = 4− x1, x4 = 18− 3x1 − 2x2, and x5 = 12− 2x2
So our initial tableau is
Z x1 x2 x3 x4 x5 b
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4
0 3 2 0 1 0 18
0 0 2 0 0 1 12
The basic variables are (x3, x4, x5) = (4, 18, 12) and non-basic variables are x1 = x2 = 0
Simplex Method: Iteration
2. Iteration
We now need a way to pick which variables enter/leave the basis. There are many
ways to choose this1. We will only look at one simple choice.
Which variable should enter the basis? (i.e. non-basic → basic, or xi = 0 → xi > 0)
Objective function is Z = c1x1 + · · ·+ cnxn. Currently all x1 = · · · = xn = 0.
Largest coefficient rule: non-basic variable entering the basis is the one with the largest
coefficient ci. i.e. variable with the best chance of increasing Z if xi increases ( ∂Z∂xi is
largest).
Warning! Tableau shows negative ci values — pick the column with largest negative
entry in Z row of the tableau.
For the drug problem, initially Z = 3x1+5x2 (tableau entries −3 and −5) — x2 enters
the basis.
1T. Terlaky & S. Zhang. Pivot rules for linear programming: A survey on recent theoretical develop-
ments. Annals of Operations Research 46 (1993), 203–233. ←− lists 16 popular choices!
CHAPTER 3. LINEAR PROGRAMMING 24
Simplex Method: Iteration
2. Iteration
Which variable should leave the basis? (i.e. basic → non-basic, or xi > 0 → xi = 0)
Idea: keep increasing new basic variable until we reach the next corner. That is,
remove the basic variable that becomes zero first (so x ≥ 0 always satisfied).
For the drug problem, we are at (x1, x2, x3, x4, x5) = (0, 0, 4, 18, 12) and start increasing
x2 (chosen above).
• x3 = 4− x1 — increasing x2 will never make x3 = 0
• x4 = 18− 3x1 − 2x2 — increasing x2 will make x4 = 0 when x2 = 9 (keep x1 = 0)
• x5 = 12− 2x2 — increasing x2 will make x5 = 0 when x2 = 6 (keep x1 = 0)
The boundary is first hit at x2 = 6 (here x5 = 0), so x5 should leave the basis.
Simplex Method: Iteration
2. Iteration
Which variable should leave the basis? (i.e. basic → non-basic, or xj > 0 → xj = 0)
This idea can be formally stated as follows.
Smallest ratio rule: if xi is entering the basis, choose basic variable xj such that aji > 0
and bj
aji
is smallest.
That is, look down column i (excluding Z row), for row j check aji > 0. For these rows only,
calculate bj
aji
and pick row with the smallest value.
Once you have picked the row, the basic variable leaving is the one with a 1 in that
row.
Why only look at xj with aji > 0?
• aji = 0 means constraint j doesn’t depend on xi (so increasing xi will never cause
xj = 0)
• aji < 0 means increasing xi causes xj > 0 to increase (again, xj = 0 will never
happen)
Simplex Method: Iteration
2. Iteration
For the drug problem, our tableau currently is
Z x1 x2 x3 x4 x5 b
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4
0 3 2 0 1 0 18
0 0 2 0 0 1 12
Which basic variable is associated with which row?
The basic variable for row 2 is x3, the basic variable for row 3 is x4 and the basic
variable for row 4 is x5 (look for columns of the identity matrix)
CHAPTER 3. LINEAR PROGRAMMING 25
Simplex Method: Iteration
2. Iteration
Which non-basic variable enters the basis?
Z x1 x2 x3 x4 x5 b
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4
0 3 2 0 1 0 18
0 0 2 0 0 1 12
Pick column with largest negative entry in Z row: x2 enters the basis.
Simplex Method: Iteration
2. Iteration
Which basic variable leaves the basis?
Z x1 x2 x3 x4 x5 b Ratio
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4 —
0 3 2 0 1 0 18 182 = 9
0 0 2 0 0 1 12 122 = 6
For rows with aji > 0, pick one with smallest ratio (last row).
Corresponding basis variable (column of identity) is x5, so x5 leaves the basis.
Simplex Method: Updating
We now know x2 enters the basis and x5 leaves the basis. How do we update the
tableau?
Eliminate x2 column (make basis variable), using x5 row for the pivot.
Z x1 x2 x3 x4 x5 b
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4
0 3 2 0 1 0 18
0 0 2 0 0 1 12 (1/2)R4→R4
Z x1 x2 x3 x4 x5 b
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4
0 3 2 0 1 0 18
0 0 1 0 0 12 6
R1+5R4→R1
R3−2R4→R3
Z x1 x2 x3 x4 x5 b
1 −3 0 0 0 52 30
0 1 0 1 0 0 4
0 3 0 0 1 −1 6
0 0 1 0 0 12 6
We are now ready for the next iteration!
Simplex Method: Updating
Z x1 x2 x3 x4 x5 b
1 −3 0 0 0 52 30
0 1 0 1 0 0 4
0 3 0 0 1 −1 6
0 0 1 0 0 12 6
CHAPTER 3. LINEAR PROGRAMMING 26
Largest negative entry in Z row is −3: x1 enters the basis. Calculate ratios:
Z x1 x2 x3 x4 x5 b Ratio
1 −3 0 0 0 52 30
0 1 0 1 0 0 4 41 = 4
0 3 0 0 1 −1 6 63 = 2
0 0 1 0 0 12 6 —
Third row has smallest ratio, x4 leaves the basis.
Simplex Method: Updating
x1 enters the basis, x4 leaves the basis (eliminate x1 column using x4 row as pivot):
Z x1 x2 x3 x4 x5 b
1 −3 0 0 0 52 30
0 1 0 1 0 0 4
0 3 0 0 1 −1 6
0 0 1 0 0 12 6
(check!)
−→
Z x1 x2 x3 x4 x5 b
1 0 0 0 1 32 36
0 0 0 1 −13 13 2
0 1 0 0 13 −13 2
0 0 1 0 0 12 6
What variable enters the basis? All Z-row coefficients are positive??
In our new basis, Z = 36− x4 − 32x5, so increasing x4 or x5 (from 0) decreases Z.
Simplex Method: Stopping
3. Stopping
At each iteration, the Z row changes (writing Z in terms of current non-basic variables).
Call resulting ci the modified cost coefficients.
Above, we have Z = 36− x4 − 32x5, so c4 = −1 and c5 = −32 .
Stop when all ci ≤ 0 (i.e. all Z row entries are ≥ 0.)
If we cannot change any basic variable xi = 0 to xi > 0 without decreasing Z, we are
at a solution!
How do we know what our solution is (using the tableau)?
Simplex Method: Stopping
Z x1 x2 x3 x4 x5 b
1 0 0 0 1 32 36
0 0 0 1 −13 13 2
0 1 0 0 13 −13 2
0 0 1 0 0 12 6
• Basic variables are x1, x2 and x3 (columns of identity), non-basic variables are x4, x5
• Z row gives Z∗ = 36− x∗4 − 32x∗5, so optimal objective is Z∗ = 36 when x∗4 = x∗5 = 0
• Constraint rows tell us the basic variables:
x∗3 −
1
3x
∗
4 +
1
3x
∗
5 = 2, x∗1 +
1
3x
∗
4 −
1
3x
∗
5 = 2 x∗2 +
1
2x
∗
5 = 6
So x∗1 = 2, x∗2 = 6, x∗3 = 2 (quicker: look at entries in b column!)
Overall, our solution is (x∗1, x∗2) = (2, 6) with objective Z∗ = 36.
CHAPTER 3. LINEAR PROGRAMMING 27
Issues
We have shown how the simplex method works. There are some special situations
that might cause problems:
• Ties (same modified cost coefficients or ratios)
• No variable leaves the basis
• Multiple optimal solutions
We also need to discuss how to handle LPs in non-standard form.
Issues: Ties
1. Ties
Which variable should enter the basis?
Imagine our tableau has first row
Z x1 x2 x3 x4 x5 b
1 0 −3 −3 0 0 20
We could pick either x2 or x3 to enter the basis (same negative modified cost coefficient).
There is no easy way to know which choice will get us to the solution quicker.
Issues: Ties
1. Ties
Which variable should leave the basis?
Ties can happen here when constraints are degenerate (i.e. corner point is intersection
of > n constraints).
0 2 4 6
x
1
0
1
2
3
4
5
6
7
x
2
x
2
=0
x
1
=0
x
3
=0
x
4
=0
x
5
=0 Drug problem, but modify x4 = 0 constraint
to
3x1 + 2x2 = 12
At feasible corner point (0, 6), have x1 = x4 =
x5 = 0 (intersection of three constraints!)
If we start at (0, 0) and increase x2, we hit
x4 = 0 and x5 = 0 at the same time
Issues: Ties
1. Ties
Which variable should leave the basis?
The tableau in this situation looks like
Z x1 x2 x3 x4 x5 b Ratio
1 −3 −5 0 0 0 0
0 1 0 1 0 0 4 —
0 3 2 0 1 0 12 122 = 6
0 0 2 0 0 1 12 122 = 6
CHAPTER 3. LINEAR PROGRAMMING 28
Smallest ratio is the same for last two rows (corresponding to basic variables x4 or x5
leaving).
If we are not careful about tiebreaking, it is possible to be stuck in infinite loops and
never solve the problem. To avoid this, we define a simple tiebreaking rule.
Smallest index rule: for any tie (enter or leave), always pick the variable with the
smallest index.
Issues: No variable leaves
2. No variable leaves the basis
Consider the problem (after adding slack variables)
max
x∈R2
Z = 3x1 + 5x2 =⇒ Z − 3x1 − 5x2 = 0
s.t. − x1 + x2 ≤ 4 −x1 + x2 + x3 = 4
x1 − x2 ≤ 2 x1 − x2 + x4 = 2
x1, x2 ≥ 0 x1, x2, x3, x4 ≥ 0
Following the usual method, we start at x1 = x2 = 0 (non-basic variables) with slacks
x3 = 4 and x4 = 2.
Issues: No variable leaves
The tableau is
Z x1 x2 x3 x4 b Ratio
1 −3 −5 0 0 0
0 −1 1 1 0 4 41 = 4
0 1 −1 0 1 2 —
So x2 enters the basis and x3 leaves the basis. Eliminate x2 column to get new tableau
Z x1 x2 x3 x4 b Ratio
1 −8 0 5 0 20
0 −1 1 1 0 4 —
0 0 0 1 1 6 —
x1 enters the basis, but no variable can leave!
Issues: No variable leaves
2. No variable leaves the basis
0 2 4 6
x
1
0
1
2
3
4
5
6
x
2
x
2
=0
x
1
=0
x
3
=0
x
4
=0
Started at (0, 0), moved to (0, 4)
At (0, 4) wanted to increase x1 (increases Z)
We can keep increasing x1 as much as we
want (keeping x3 = 0)! So, we can increase
Z as much as we want.
If no variable can leave the basis, the problem is unbounded (Z∗ = +∞).
CHAPTER 3. LINEAR PROGRAMMING 29
Issues: Multiple Solutions
3. Multiple Solutions
Recall that if the contour lines of Z are parallel with the boundary, we can have
infinitely many solutions.
We saw this on the modified drug problem (change cost coefficients):
max
x∈R2
Z = 6x1 + 4x2 =⇒ Z − 6x1 − 4x2 = 0
s.t. x1 ≤ 4 x1 + x3 = 4
3x1 + 2x2 ≤ 18 3x1 + 2x2 + x4 = 18
2x2 ≤ 12 2x2 + x5 = 12
x1, x2 ≥ 0 x1, x2, x3, x4, x5 ≥ 0
As before, our starting point is x1 = x2 = 0 with (x3, x4, x5) = (4, 18, 12)
Issues: Multiple Solutions
Z x1 x2 x3 x4 x5 b Ratio
1 −6 −4 0 0 0 0
0 1 0 1 0 0 4 4/1 = 4
0 3 2 0 1 0 18 18/3 = 6
0 0 2 0 0 1 12 —
−→ x1 enters, x3 leaves
Z x1 x2 x3 x4 x5 b Ratio
1 0 −4 6 0 0 24
0 1 0 1 0 0 4 —
0 0 2 −3 1 0 6 6/2 = 3
0 0 2 0 0 1 12 12/2 = 6
−→ x2 enters, x4 leaves
Issues: Multiple Solutions
We finally reach the optimal tableau (all modified cost coefficients ≤ 0)
Z x1 x2 x3 x4 x5 b
1 0 0 0 2 0 36
0 1 0 1 0 0 4
0 0 1 −32 12 0 3
0 0 0 3 −1 1 6
The non-basic variables are x3, x4 with Z∗ = 36− 2x∗4.
We get optimal Z∗ = 36 when x∗4 = 0, but any choice of x∗3 gives this value (provided
≥ 0 constraints satisfied)!
If a non-basic variable has zero modified cost coefficient (at the solution), there are
multiple optimal solutions. Parametrise them by setting x∗i = ti for each of these variables.
Issues: Multiple Solutions
What are all the optimal solutions? We already have x∗4 = 0.
Let all “zero cost non-basic variables” be equal to a parameter: here, x∗3 = t for some
t ∈ R.
CHAPTER 3. LINEAR PROGRAMMING 30
Then constraints give
x∗1 + x∗3 = 4, x∗2 −
3
2x
∗
3 +
1
2x
∗
4 = 3, 3x∗3 − x∗4 + x∗5 = 6
With x∗3 = t (parameter) and x∗4 = 0, we get
x∗1 = 4− t, x∗2 = 3 +
3
2t, x
∗
5 = 6− 3t
Valid range of t is whatever ensures all x∗i ≥ 0. (e.g. x∗1 ≥ 0 requires t ≤ 4)
Here, valid range is 0 ≤ t ≤ 2 (check!), so the set of optimal solutions is:
(x∗1, x∗2) = (4− t, 3 +
3
2t), ∀t ∈ [0, 2] (all with Z
∗ = 36)
Summary
Summary: Simplex Method for LPs in Standard Form
1. Add slack variables xn+1, . . . , xn+m to the m inequalities Ax ≤ b
2. Initialise x1 = · · · = xn = 0 (decision variables = non-basic variables) and
(xn+1, . . . , xn+m) = (b1, . . . , bm) (slack variables = basic variables)
3. At each iteration
– Non-basic variable xi with most negative Z-row entry enters the basis
– Basic variable xj with aji > 0 and minimal ratio bj/aji leaves the basis
– Eliminate xi column using xj row as a pivot
4. Continue until all Z-row coefficients are ≥ 0
Issues:
• Ties for entering/leaving? Always choose variable with smallest index
• No basic variable has aji > 0? Problem unbounded
• Final Z-row coefficient is zero for non-basic variable? Multiple solutions (pick those
columns as parameters)
3.4 Non-Standard LPs
Non-Standard LPs
The simplex method that we have seen so far will solve any problem in standard form
max
x∈Rn
Z = cTx s.t. Ax ≤ b,x ≥ 0 (b ≥ 0)
What happens if our problem is not in this form? Try to change it to standard form!
What types of non-standard things are there (while still being an LP)?
• min rather than max
CHAPTER 3. LINEAR PROGRAMMING 31
• Resource elements bi < 0
• Resource constraints Ax ≤ b constraints are = or ≥ instead
• Different sign constraints: xi ≤ 0 or xi ∈ R (unconstrained)
All of these can be handled, but some require an extended version of the simplex algorithm
(“two-phase”).
Non-Standard LPs
Some of these are easy to deal with.
Min vs. max:
minx Z = cTx is equivalent to maxx Z˜ = (−c)Tx, so just change the sign of c. Then
Z∗ = −Z˜∗ with same optimal point x∗.
Negative decision variables:
If xi ≤ 0, define a new variable x˜i = −xi, so x˜i ≥ 0. Replace cixi with (−ci)x˜i in
objective, same in Ax ≤ b constraints.
No sign constraints:
If xi ∈ R allowed, define two new variables xˆi ≥ 0 and x˜i ≥ 0 such that xi = xˆi − x˜i.
Replace cixi with cixˆi + (−ci)x˜i in objective (similar for constraints).
Idea: xˆi = max(xi, 0) is positive part of xi, x˜i = max(−xi, 0) is negative part.
Non-Standard LPs
Negative resource elements:
Suppose bi < 0 for some constraint i. Then
ai1x1 + · · ·+ ainxn ≤ bi ⇐⇒ −ai1x1 − · · · − ainxn ≥ (−bi)
So, bi < 0 with ≤ constraint is equivalent to bi > 0 with ≥ constraint.
Idea: make this sign change, and handle as bi > 0 with ≥ constraint.
≥ constraints:
Suppose we have
ai1x1 + · · ·+ ainxn ≥ bi
with bi ≥ 0. (if bi < 0, multiply by −1 on sides to get standard form)
If bi ≥ 0, add surplus variable to convert to equality (like opposite of slack variable)
ai1x1 + · · ·+ ainxn − xn+1 = bi, xn+1 ≥ 0
Non-Standard LPs
Equality constraints:
Do nothing! Just add as another row to the tableau.
(this includes ≥ constraints with surplus variables and bi < 0 cases)
Question
Is it really that simple?
CHAPTER 3. LINEAR PROGRAMMING 32
Non-Standard LPs
Question
Is it really that simple?
No! The initialisation step of simplex can fail.
We initialise at the feasible corner point x1 = · · · = xn = 0, and (xn+1, . . . , xn+m) =
(b1, . . . , bm). But this point now isn’t feasible!
• Equality constraints: ai1x1 + · · ·+ ainxn = bi aren’t satisfied
• ≥ constraints: ai1x1 + · · ·+ ainxn ≥ bi with bi > 0 aren’t satisfied
– Equivalently, surplus variables are not ≥ 0
– Can be caused by ≤ constraints with bi < 0 (converted to ≥ with bi > 0 as
above)
We need a different way to find our first feasible corner point. After that, everything
works as before.
Non-Standard LPs
Example:
max
x∈R2
Z = 3x1 + 5x2 =⇒ Z − 3x1 − 5x2 = 0
s.t. x1 ≤ 4 x1 + x3 = 4
3x1 + 2x2 ≥ 18 3x1 + 2x2 − x4 = 18
2x2 ≤ 12 2x2 + x5 = 12
x1, x2 ≥ 0 x1, x2, x3, x4, x5 ≥ 0
(note x3 and x5 are slacks, x4 is surplus)
We would normally start with x1 = x2 = 0, giving x3 = 4, x4 = −18 and x5 = 12, but
this point is not feasible.
Artificial Variables
How to handle ≥ and = constraints?
Add even more extra variables! For every ≥ and = constraint, add an extra artificial
variable (as well as slack & surplus variables).
The above problem becomes (x3, x5 slack, x4 surplus, x6 artificial)
Z − 3x1 − 5x2 = 0
x1 + x3 = 4
3x1 + 2x2 − x4 + x6 = 18
2x2 + x5 = 12
x1, x2, x3, x4, x5, x6 ≥ 0
Now it’s easy to find a feasible corner point: decision and surplus variables non-basic
(= 0), slack and artificial variables basic (≥ 0).
Here, take x1 = x2 = x4 = 0 basic, then (x3, x6, x5) = (4, 18, 12). All xi ≥ 0 so this is
feasible!
CHAPTER 3. LINEAR PROGRAMMING 33
Artificial Variables
Unlike slack & surplus variables, Adding artificial variables changes the LP we are
solving.
With slack & surplus variables inside x, we have
max
x
Z = cTx s.t. Ax = b, x ≥ 0
We can’t easily find a feasible corner point of this problem.
Adding artificial variables (call them x), we have
max
x,x
Z = cTx s.t. Ax+ x = b, x,x ≥ 0
They are only the same problem if x = 0.
If we have a feasible corner point (x,x) of the first problem with x = 0, then x is
also a feasible corner point of the first problem. But, we can find a feasible corner point
of the second problem very easily.
Idea: can we use the second set of constraints in an LP to find x with x = 0?
Artificial Variables
Lemma 3.3. Consider the two LPs
(1) max
x∈Rn
Z = cTx (2) max
x∈Rn,x∈Rm
W = −
m∑
i=1
xi
s.t. Ax = b s.t. Ax+ x = b
x ≥ 0 x,x ≥ 0
Then (1) is feasible (i.e. constraint set nonempty) if and only if W ∗ = 0 for (2).
Proof. Since xi ≥ 0, we know W ∗ ≤ 0 and W ∗ = 0 if and only if x∗ = 0 (with some x∗).
x∗ = 0 is feasible if and only if Ax∗ = b and x∗ ≥ 0 (i.e. x∗ for (2) is feasible for
(1)).
Two-Phase Simplex Method
This motivates the following idea (two-phase simplex method):
• (Phase 1) Solve problem (2) using simplex method to get (x∗,x∗) and W ∗ (initial
point easy to find)
• If W ∗ ̸= 0 then original problem (1) is infeasible — stop
• If W ∗ = 0 (i.e. x∗ = 0), then x∗ is a feasible corner point for (1)
• (Phase 2) Solve (1) using simplex method with x∗ as the initial point
This approach works for all LPs, including those not in standard form (after adding
slack, surplus & artificial variables).
If we are lucky (e.g. no artificial variables), then we do not need phase 1.
CHAPTER 3. LINEAR PROGRAMMING 34
Two-Phase Simplex Method
Let’s use the two-phase simplex method to solve
max
x∈R2
Z = 3x1 + 5x2
s.t. x1 ≤ 4
3x1 + 2x2 ≥ 18
2x2 ≤ 12
x1, x2 ≥ 0
Start by adding slack variables to ≤ constraints, surplus & artificial variables to ≥
constraint.
max
x∈R5,x6∈R
Z = 3x1 + 5x2
s.t. x1 + x3 = 4
3x1 + 2x2 − x4 + x6 = 18
2x2 + x5 = 12, with x1, x2, x3, x4, x5, x6 ≥ 0
Two-Phase Simplex Method
Since we have artificial variables, our usual starting point is infeasible. We have to
solve the phase 1 problem instead.
max
x∈R5,x6∈R
W = −x6
s.t. x1 + x3 = 4
3x1 + 2x2 − x4 + x6 = 18
2x2 + x5 = 12, with x1, x2, x3, x4, x5, x6 ≥ 0
Here, we have a good initial point:
• Non-basic variables (decision & surplus): x1 = x2 = x4 = 0
• Basic variables (slack & artificial): (x3, x5, x6) = (4, 12, 18)
We always need W to write in terms of non-basic variables:
W = −x6 = 3x1 + 2x2 − x4 − 18
Two-Phase Simplex Method
To speed up phase 2, it helps to include the original objective Z in the tableau too
(not used except for elimination).
W Z x1 x2 x3 x4 x5 x6 b Ratio
1 0 −3 −2 0 1 0 0 −18
0 1 −3 −5 0 0 0 0 0
0 0 1 0 1 0 0 0 4 4
0 0 3 2 0 −1 0 1 18 6
0 0 0 2 0 0 1 0 12 —
−→ x1 enters, x3 leaves
Eliminate x1 column using pivot from x3 row (including putting zero in Z row, useful for
later).
CHAPTER 3. LINEAR PROGRAMMING 35
Two-Phase Simplex Method
W Z x1 x2 x3 x4 x5 x6 b Ratio
1 0 0 −2 3 1 0 0 −6
0 1 0 −5 3 0 0 0 12
0 0 1 0 1 0 0 0 4 —
0 0 0 2 −3 −1 0 1 6 3
0 0 0 2 0 0 1 0 12 6
−→ x2 enters, x6 leaves
W Z x1 x2 x3 x4 x5 x6 b
1 0 0 0 0 0 0 1 0
0 1 0 0 −92 −52 0 52 27
0 0 1 0 1 0 0 0 4
0 0 0 1 −32 −12 0 12 3
0 0 0 0 3 1 1 −1 6
−→ W row ≥ 0, phase 1 done
Two-Phase Simplex Method
W Z x1 x2 x3 x4 x5 x6 b
1 0 0 0 0 0 0 1 0
0 1 0 0 −92 −52 0 52 27
0 0 1 0 1 0 0 0 4
0 0 0 1 −32 −12 0 12 3
0 0 0 0 3 1 1 −1 6
So the solution to phase 1 is
• Non-basic variables x3, x4, x6 with x6 = 0 (x3, x4 are parameters since zero cost
coefficient)
• Basic variables x1, x2, x5
• Optimal objective value W ∗ = 0 — phase 1 succeeded! (original LP feasible)
Any solution to phase 1 is ok (just need one feasible corner point), e.g. pick x3 = x4 = 0
giving (x1, x2, x5) = (4, 3, 6). At this point, the original objective is Z = 27.
Two-Phase Simplex Method
Phase 1 found a feasible corner point for the original problem: (x1, x2, x3, x4, x5) =
(4, 3, 0, 0, 6) with Z = 27.
We can just continue with the same tableau (dropping W row and x6 now as not
needed).
Z x1 x2 x3 x4 x5 b Ratio
1 0 0 −92 −52 0 27
0 1 0 1 0 0 4 4
0 0 1 −32 −12 0 3 —
0 0 0 3 1 1 6 2
−→ x3 enters, x5 leaves
CHAPTER 3. LINEAR PROGRAMMING 36
Z x1 x2 x3 x4 x5 b Ratio
1 0 0 0 −1 32 36
0 1 0 0 −13 −13 2 —
0 0 1 0 0 12 6 —
0 0 0 1 13
1
3 2 6
−→ x4 enters, x3 leaves
Two-Phase Simplex Method
Finally, phase 2 arrives at
Z x1 x2 x3 x4 x5 b
1 0 0 3 0 52 42
0 1 0 1 0 0 4
0 0 1 0 0 12 6
0 0 0 3 1 1 6
−→ solution found!
The solution to the original problem is (x∗1, x∗2) = (4, 6) with (x∗3, x∗4, x∗5) = (0, 6, 0) and
Z∗ = 42.
Big-M Method
There is an alternative to the two-phase simplex method (when we have artificial
variables), called the big-M method.
Instead of maxx Z = cTx, change the objective to
max
x,x
Z = Z −MW = cTx−M
m∑
i=1
xi,
where M ≫ 1 is some large positive number.
This combines our two objectives from phases 1 & 2, trying to maximise them
simultaneously. Picking M large means “we really care about making xi = 0” (which we
do because we want to solve the original problem!).
This type of extra term (that encourages something you want, like xi = 0) is called a
penalty. It is used in many areas of optimisation, including Karmarkar’s 1984 interior
point method (but he didn’t know that at the time!).
Big-M Method
To use the big-M method, add slack, surplus & artificial variables as before. For
example, the above problem starts as
Z x1 x2 x3 x4 x5 x6 b
1 −3 −5 0 0 0 M 0
0 1 0 1 0 0 0 4
0 3 2 0 −1 0 1 18
0 0 2 0 0 1 0 12
But, we always start by writing the objective (Z) in terms of non-basic variables x1, x2, x4.
In this case, Z = 3x1 + 5x2 −Mx6 is written as
Z = 3x1 + 5x2 −M(18− 3x1 − 2x2 + x4) = (3 + 3M)x1 + (5 + 2M)x2 −Mx4 − 18M
The first row of the tableau is then actually
Z x1 x2 x3 x4 x5 x6 b
1 −(3 + 3M) −(5 + 2M) 0 M 0 0 −18M
CHAPTER 3. LINEAR PROGRAMMING 37
Big-M Method
Advantages of the big-M method:
• Only one run of simplex method needed
– But can still take as many steps as doing two phases!
Disadvantages of the big-M method:
• Don’t know M until the end (pick M when finished to make all xi = 0)
• Each step of simplex is more complicated (including unknown variable)
Two-phase simplex is much more widely used than the big-M method.
Non-Standard LPs: Summary
Summary of solving non-standard LPs:
• minZ problem: change to max Zˆ for Zˆ = −Z
• Negative variables, xi ≤ 0: introduce x˜i = −xi with x˜i ≥ 0
• Unsigned variables, xi ∈ R: introduce xˆi, x˜i ≥ 0 with xi = xˆi − x˜i
• Negative resource coefficient bi < 0: multiply inequality by −1 (flips inequality sign)
• ≥ constraint: add surplus variable ai1x1 + · · ·+ ainxn − xn+1 = bi for xn+1 ≥ 0
Then, add artificial variables xi to all ≥ and = constraints.
• Phase 1: maximise W = −∑i xi. Start with decision & surplus variables non-basic
(zero), slack & artificial variables basic (positive)
• If W ∗ = 0, then all x∗i = 0. Start phase 2 from these optimal decision/surplus/slack
variables (drop W and xi from tableau)
• If W ∗ < 0, original problem is infeasible (no solution)
3.5 Duality
Duality
A surprising theoretical result is that LPs come in pairs: for every LP (called the
primal problem) there is a different LP (called the dual). The primal and dual problems
are closely related.
Motivating example (from tutorials)
A gardener wants to prepare a fertiliser by mixing 2 ingredients: GROW and THRIVE.
Each kg of GROW contains 4g of nitrogen and 4g of phosphate, while each kg of THRIVE
contains 10g of nitrogen and 2g of phosphate. The final product must contain at least
100g of nitrogen and at least 60g of phosphate. If GROW costs $1.00/kg and THRIVE
costs $1.50/kg, how many kg of each ingredients should be mixed if the manufacturer
wants to minimise the cost?
CHAPTER 3. LINEAR PROGRAMMING 38
If x1, x2 are the kg of GROW and THRIVE, the corresponding LP is
min
x∈R2
Z = x1 + 1.5x2 s.t. 4x1 + 10x2 ≥ 100, 4x1 + 2x2 ≥ 60, x1, x2 ≥ 0
We previously found the optimal solution (x∗1, x∗2) = (12.5, 5) with minimum cost Z∗ =
$20.
Duality: Motivation
Now let’s imagine we are a seller of nitrogen and phosphate (to the manufacturer of
GROW and THRIVE). What is our business strategy?
We set the price (per g) of both chemicals to maximise revenue (from guaranteed sale
of 100g N and 60g P to the manufacturer). But the manufacturer will only buy from us if
our prices aren’t too high, so they can sell GROW and THRIVE to the gardener without
making a loss.
Let y1, y2 be our price per g of nitrogen and phosphate. We solve
max
y∈R2
V = 100y1 + 60y2
s.t. 4y1 + 4y2 ≤ 1 (cost to make GROW)
10y2 + 2y1 ≤ 1.5 (cost to make THRIVE)
y1, y2 ≥ 0
This is the dual of the gardener’s (cost minimisation) problem.
Dual Problem
The two problems are:
min
x∈R2
Z = x1 + 1.5x2 max
y∈R2
V = 100y1 + 60y2
s.t. 4x1 + 10x2 ≥ 100 s.t. 4y1 + 4y2 ≤ 1
4x1 + 2x2 ≥ 60 10y2 + 2y1 ≤ 1.5
x1, x2 ≥ 0 y1, y2 ≥ 0
If one is the primal, the other is the dual (i.e. the dual of the dual is the primal).
Think about the units of each quantity (dimensional analysis):
• Z = cTx is $ and x is units of product, so c is $ per unit of product
• V = bTy is $ and b is units of resource, so y is $ per unit of resource
Here, the dual variables y measure the monetary value of the resources in the primal.
Derivation of Dual Problem
We can derive the general form of the dual problem in a different way: trying to
estimate the solution of the primal problem!
Consider the drug example from earlier
max
x∈R2
Z = 3x1 + 5x2,
s.t. x1 ≤ 4,
3x1 + 2x2 ≤ 18,
2x2 ≤ 12,
x1, x2 ≥ 0
CHAPTER 3. LINEAR PROGRAMMING 39
Suppose we don’t know the solution and can’t be bothered to find it. Can we estimate
how large Z∗ will be? The first and third constraints give
Z∗ = 3x∗1 + 5x∗2 ≤ 3(4) + 5(6) = 42
Derivation of Dual Problem
This isn’t the only option: the second and third constraints give
Z∗ = 3x∗1 + 5x∗2 = (3x∗1 + 2x∗2) + 3x∗2 ≤ (18) + 3(6) = 36
We get a better estimate this way. What are all the possible estimates we could get using
this idea, and which one gives the best estimate?
Try to upper bound with any linear combination of constraints (weights y1, y2, y3):
Z∗ = 3x∗1 + 5x∗2 ≤ y1(x∗1) + y2(3x∗1 + 2x∗2) + y3(2x∗2) ≤ 4y1 + 18y2 + 12y3
Since all x∗i ≥ 0, the first inequality only holds if
y1 + 3y2 ≥ 3 and 2y2 + 2y3 ≥ 5
The second inequality holds if all yi ≥ 0.
Then, pick yi to get the best (i.e. smallest) estimate for Z∗ ≤ 4y1 + 18y2 + 12y3.
min
y∈R3
4y1 + 18y2 + 12y3 s.t. y1 + 3y2 ≥ 3, 2y2 + 2y3 ≥ 5, y1, y2, y3 ≥ 0 (dual)
Derivation of Dual Problem
In general, suppose we have an LP in standard form
max
x∈Rn
Z = cTx s.t. Ax ≤ b, x ≥ 0
There are m constraints Ax ≤ b. Take linear combination of these constraints with
weights y ∈ Rm. Upper bound for Z is Z ≤ V where
Z = cTx ≤
m∑
i=1
yi(ai1x1 + · · ·+ ainxn) ≤ bTy =: V
Given x ≥ 0, the first inequality holds if (compare coefficients of xj)
m∑
i=1
yiaij ≥ cj ∀j = 1, . . . , n ⇐⇒ ATy ≥ c
Given Ax ≤ b, the second inequality holds if y ≥ 0. Choose weight y to get smallest
upper bound V gives the dual:
min
y∈Rm
V = bTy s.t. ATy ≥ c, y ≥ 0
CHAPTER 3. LINEAR PROGRAMMING 40
Weak Duality
So the standard form for our primal and dual LPs are
max
x∈Rn
Z = cTx min
y∈Rm
V = bTy
s.t. Ax ≤ b s.t. ATy ≥ c
x ≥ 0 y ≥ 0
Note: the dual of the dual is the primal (so LPs really do come in pairs).
The argument on the previous page gives us a useful result, called weak duality. It
says that any feasible linear combination y makes V an upper bound for any Z (we proved
for Z∗, the largest possible Z).
Theorem 3.4 (Weak Duality). If x and y are any feasible points for the primal and dual
problems respectively, then
Z = cTx ≤ bTy = V
Weak Duality
For any feasible x and y, we know Z ≤ V . So, this must be true for the solutions x∗
and y∗; i.e. the optimal values satisfy Z∗ ≤ V ∗.
How large is this duality gap? Does any upper bound V ∗ give exactly the true
maximum Z∗?
Theorem 3.5. If x∗ and y∗ are feasible for the primal and dual problems respectively such
that cTx∗ = bTy∗, then they are both solutions to their respective problems and Z∗ = V ∗.
Proof. Using weak duality and cTx∗ = bTy∗, for any primal feasible x we have
cTx ≤ bTy∗ = cTx∗,
so Z = cTx is maximised at x∗ (i.e. x∗ is optimal and Z∗ = cTx∗).
Similarly, for any dual feasible y we have bTy ≥ cTx∗ = bTy∗ so y∗ is optimal.
Strong Duality
Does any upper bound V ∗ give exactly the true maximum Z∗?
The above theorem answers the opposite question (if Z = V then we are at a solution).
If we are at a solution, then do we have Z = V ?
Yes! This is called strong duality. We actually have a little bit more.
Theorem 3.6 (Strong Duality). For any primal/dual pair of LPs:
• If either problem has a solution, then so does the other, and Z∗ = V ∗
• If one problem is unbounded, then the other is infeasible
We will start by proving: “if the primal has a solution, then the dual has a solution
and Z∗ = V ∗”.
CHAPTER 3. LINEAR PROGRAMMING 41
Strong Duality: Proof
Proof: primal problem is in standard form, so we first add slack variables
xn+i = bi −
n∑
j=1
aijxj
Suppose we have run the (one-phase) simplex method and found a solution x∗. The top
row of the tableau has non-negative coefficients and the last column has Z∗:
Z −
n+m∑
k=1
(−ck)xk = Z∗ with ck ≤ 0
for any feasible x (remember Z row has negative cost coefficients, ck = 0 for basic
variables)
Define y∗i = −cn+i for i = 1, . . . ,m.
Claim: y∗ is feasible for the dual problem and cTx∗ = bTy∗. If this is true, then we
are done: y∗ is the dual solution and V ∗ = Z∗ (from previous theorem).
Strong Duality: Proof
Proof (of claim): From above, for any feasible x
cTx = Z = Z∗ +
n+m∑
k=1
ckxk = Z∗ +
n∑
k=1
ckxk +
m∑
i=1
cn+ixn+i
But y∗i = −cn+i and we have slacks xn+i = bi −
∑n
j=1 aijxj, so
cTx = Z∗ +
n∑
k=1
ckxk −
m∑
i=1
y∗i
bi − n∑
j=1
aijxj

Rearrange to collect coefficients of x1, . . . , xn and (all other terms):
n∑
j=1
cjxj =
(
Z∗ −
m∑
i=1
y∗i bi
)
+
n∑
j=1
(
cj +
m∑
i=1
aijy
∗
i
)
xj
This holds for all primal feasible x, so the constant term must be zero and the xj coefficients
must be equal.
Strong Duality: Proof
Proof (of claim, cont’d): That is,
Z∗ −
m∑
i=1
y∗i bi = 0
cj +
m∑
i=1
aijy
∗
i = cj ∀j = 1, . . . , n
The first equation is bTy∗ = Z∗.
The second equation (in vector form) is c + ATy∗ = c. Since c ≤ 0 (negative cost
coefficients at solution), we have ATy∗ ≥ c.
Lastly, y∗i = −cn+i and c ≤ 0 gives y∗ ≥ 0, so y∗ is feasible for the dual problem.
So, y∗ is dual feasible and Z∗ = bTy∗, so the claim is true (and above theorem says
y∗ is solution of dual, so Z∗ = bTy∗ = V ∗).
CHAPTER 3. LINEAR PROGRAMMING 42
Strong Duality: Proof
We can prove another part of strong duality easily: “if the primal is unbounded then
the dual is infeasible”.
Proof: If the primal is unbounded, there exist feasible points x1,x2, . . . with cTxk →
+∞.
To find a contradiction, suppose the dual is feasible (let y be any feasible point).
Since cTxk → +∞, there is some k with cTxk > bTy.
But xk and y are feasible, so this violates weak duality (contradiction!). Hence the
dual must be infeasible.
The other parts of the strong duality theorem follow using similar arguments to these.
Strong Duality: Practicalities
• If either the primal or dual are feasible, then so is the other, and Z∗ = V ∗
• The primal is infeasible ⇐⇒ the dual is unbounded
• The primal is unbounded ⇐⇒ the dual is infeasible
Remember, the simplex method can identify if a problem is unbounded (no variable leaves
the basis). Even better, since y∗i = −cn+i, the value of y∗ is given in the final simplex
tableau: y∗i is the Z-row tableau entry for slack xn+i. Simplex solves the primal and dual
problems simultaneously!
Consider if solving the dual problem is easier than the primal problem (you get the
answer to both anyway)
• If the primal has m = 2 constraints (but n > 2 decision variables), then the dual is
in R2 and can be solved graphically
• Primal simplex usually needs 3m steps, so dual usually takes 3n steps (quicker if
m≫ n)
• The dual may not need the two-phase simplex method
Strong Duality: Example
min
x∈R2
Z = 3x1 + 2x2
s.t. 8x1 + 3x2 ≥ 24
5x1 + 6x2 ≥ 30
2x1 + 9x2 ≥ 18
x1, x2 ≥ 0
To solve this, we need to add surplus & artificial variables to handle the ≥ constraints.
Solving requires the two-phase simplex method. But the dual problem is in standard form
(don’t need phase 1):
CHAPTER 3. LINEAR PROGRAMMING 43
max
x∈R2
V = 24y1 + 30y2 + 18y3
s.t. 8y1 + 5y2 + 2y3 ≤ 3
3y1 + 6y2 + 9y3 ≤ 2
y1, y2, y3 ≥ 0
Strong Duality: Example
Adding slack variables we get
max
x∈R2
V = 24y1 + 30y2 + 18y3 V − 24y1 − 30y2 − 18y3 = 0
s.t. 8y1 + 5y2 + 2y3 ≤ 3 8y1 + 5y2 + 2y3 + y4 = 3
3y1 + 6y2 + 9y3 ≤ 2 3y1 + 6y2 + 9y3 + y5 = 2
y1, y2, y3 ≥ 0 y1, y2, y3, y4, y5 ≥ 0
As usual, our initial feasible corner point starts with the non-basic variables being the
decision variables (y1 = y2 = y3 = 0) and basic variables are the slacks (y4, y5) = (3, 2).
V y1 y2 y3 y4 y5 b
1 −24 −30 −18 0 0 0
0 8 5 2 1 0 3
0 3 6 9 0 1 2
Strong Duality: Example
Skipping the working, simplex takes two steps to find a solution
−→
V y1 y2 y3 y4 y5 b
1 −9 0 27 0 52 10
0 112 0 −112 1 −56 43
0 12 1
3
2 0
1
6
1
3
−→
V y1 y2 y3 y4 y5 b
1 0 0 18 1811
40
11
134
11
0 1 0 −1 211 − 533 833
0 0 1 2 − 111 833 733
Reading off as usual, the dual solution is
(y∗1, y∗2, y∗3) =
( 8
33 ,
7
33 , 0
)
with V ∗ = 13411
What is the primal solution? Look at the Z-row entries for the slack variables!
(x∗1, x∗2) =
(18
11 ,
40
11
)
We could calculate Z∗ using Z = 3x1 + 2x2, but there’s no need. We automatically know
Z∗ = V ∗ = 13411 from strong duality.
Duality for Non-Standard LPs
For non-standard LPs, the following table illustrates the types of constraints you get
in the primal vs. dual.
CHAPTER 3. LINEAR PROGRAMMING 44
Primal Dual
maxx cTx miny bTy
Ax ≤ b y ≥ 0
Ax ≥ b y ≤ 0
Ax = b y any sign
x ≥ 0 ATy ≥ c
x ≤ 0 ATy ≤ c
x any sign ATy = c
(mix of LHS constraints ←→ mix of RHS constraints)
Interpretation of Duality
The idea that y∗i represents the “price” of resource bi (see GROW/THRIVE example)
can be made in the general case.
Imagine we get a small amount extra of the ith resource, so bi becomes bi + ϵ. This
perturbs our feasible region slightly.
Assuming our optimal solution is at the same vertex (just perturbed), we have a new
optimal value Z∗(ϵ). How much better/worse is Z∗(ϵ)? (or, what is ∂Z∗
∂bi
?)
From strong duality, Z∗ = bTy∗ = ∑mi=1 biy∗i , so ∂Z∗∂bi = y∗i (this can be made more
rigorous).
So, y∗i represents how much we could increase Z∗ if we had more of resource bi. If Z
represents profit, y∗i tells us how much we would be willing to pay for more bi. This is
useful information in practice (how should we allocate extra budget to buy resources?).
For example, if ai1x∗1 + · · · ainx∗n < bi at the solution, then the slack variable must be
strictly positive (basic variable), so y∗i = 0 (Z row at solution).
3.6 Complexity [not examinable]
Complexity [not examinable]
Complexity
Many practical LPs are extremely large. How many steps does the simplex method take
(as a function of problem size)?
If we have n decision variables and m constraints, the simplex method takes between
2m and 3m steps on most practical problems.
There are some problems that are much slower: up to 2n steps! Every known method
for picking variables to enter/leave has this issue.
• 1947 (Dantzig): simplex method developed
• 1972 (Klee & Minty): first example of simplex showing exponentially slow behaviour
• 1979 (Khachiyan): first non-simplex method which always works in polynomial time
• 1984 (Karmarkar): first practical polynomial time (non-simplex) method — NYT
article!
• 2004 (Spielman & Teng): simplex method takes polynomial time for almost all
problems
CHAPTER 3. LINEAR PROGRAMMING 45
• Open question: (non-random) version of simplex which always takes polynomial
time?
Complexity [not examinable]
Klee & Minty example (1972): feasible region is deformed (hyper)cube, simplex
passes through every corner! Can build a problem for any choice of n.
Examples (n = 2 and n = 3):
max
x∈R2
Z = 10x1 + x2 max
x∈R3
Z = 9x1 + 3x2 + x3
s.t. x1 ≤ 1 s.t. x1 ≤ 1
20x1 + x2 ≤ 100 6x1 + x2 ≤ 9
x1, x2 ≥ 0 18x1 + 6x2 + x3 ≤ 81
x1, x2, x3 ≥ 0
Both take 2n − 1 steps to solve (try it!).
Interior Point Methods [not examinable]
The “startling theoretical breakthrough” by Karmarkar in 1984 was the development of
interior point methods (IPMs) for solving LPs. IPMs can always solve an LP in polynomial
# steps (depending on n and m), unlike the simplex method.
In essence, IPMs do something similar to the big-M method by adding a penalty to
the objective. Using standard form (after adding slack variables), we convert
max
x∈Rn
Z = cTx =⇒ max
x∈Rn
f(x) = cTx+ ϵ
n∑
i=1
log(xi)
s.t. Ax = b s.t. Ax = b
x ≥ 0
The log(xi) terms guarantee xi > 0. We solve a sequence of problems taking ϵ → 0
(allowing xi → 0 to occur if needed).
Interior Point Methods [not examinable]
max
x∈Rn
f(x) = cTx+ ϵ
n∑
i=1
log(xi)
s.t. Ax = b
This isn’t an LP! Have to solve using other optimisation methods.2
Intuition: simplex method goes around the boundary of the feasible region, visiting
corners. IPMs go through the interior of the feasible region, heading towards the boundary
(where the solution is).
Both simplex and IPMs work extremely well in practice. Simplex usually preferred for
small LPs, IPMs preferred for large LPs.
2Roughly: write the KKT conditions (next topic!) as a nonlinear system of equations, do 1 iteration
of Newton’s method, decrease ϵ & repeat...
Chapter 4
Nonlinear Optimisation
Nonlinear Optimisation
We now turn our attention to nonlinear optimisation (sometimes called nonlinear
programming, NLP). This covers optimisation problems where the objective and/or some
constraints are nonlinear. We will first look at the unconstrained case,
min
x∈Rn
f(x)
for some nonlinear function f : Rn → R, and then at the constrained case
min
x∈Rn
f(x) s.t. x ∈ Ω
for some feasible region Ω. Just like for LPs, we will formulate our constraints as equations
or inequalities
min
x∈Rn
f(x)
s.t. gi(x) = 0, i = 1, . . . , k
hj(x) ≤ 0, j = k + 1, . . . ,m
Examples
There are many areas where nonlinear optimisation occurs. In this course, we will see
it in:
Fitting probability distributions (week 8 lab)
Suppose we have data x1, . . . , xn coming from some probability distribution with
parameters θ. We can estimate a good θ using
min
θ
−
n∑
i=1
log(f(xi, θ))
where f(x, θ) is the probability density function. (don’t worry if you don’t understand
this, we will cover it later!)
Determining optimal investment strategies (mean-variance portfolio theory)
We will model the problem of choosing how much money xi to invest in stock i =
1, . . . , n as a quadratic program (QP),
min
x∈Rn
xTQx+ cTx s.t. Ax = b
46
CHAPTER 4. NONLINEAR OPTIMISATION 47
Examples
Natural gas supply networks: when natural gas flows through a network of pipes
(connecting different cities, for example), each pipe i has a pressure pi and gas flow rate
qi. We also have extra decision variables zi ∈ {0, 1} indicating if a compressor (pump) is
active in a given pipe.
The physics of the network gives constraint equations
Aq − d = 0, ATp2 +Kq2.8359 = 0, BTq + zTc(p, q) = 0,
for some matrices A, K and B, vector d and nonlinear functions c(p, q).
There are usually bound constraints on pi ∈ [pmin, pmax] and qi ∈ [qmin, qmax] too.
Our objective here might be to minimise operating cost, minimise total gas pressure
(safety), or many other possibilities.
4.1 Unconstrained Optimisation
Optimisation in 1D
Let’s start by looking at the simple case of optimising a function in 1D (without
constraints:
min
x∈R
f(x)
From high school calculus, you know that the way to solve this is to find x such that
f ′(x) = 0. But when does that give you a minimum or maximum? You probably know
this too: look at the sign of f ′′(x) (if it’s nonzero). We can make this more general using
Taylor’s Theorem
Theorem 4.1. Suppose f : R→ R is Cp+1(R) (i.e. p+1 times continuous differentiable).
For every a, x in the domain of f , there exists ξ between a and x such that
f(x) = f(a) + f ′(a)(x− a) + 12f
′′(a)(x− a)2 + · · ·+ 1
p!f
(p)(a)(x− a)p
+ 1(p+ 1)!f
(p+1)(ξ)(x− a)p+1
Optimisation in 1D
Intuition for Taylor’s Theorem: if f is smooth, and x is close enough to a, then f(x)
looks like a (low degree) polynomial.
This gives us a way to fully understand extrema (maxima/minima) of f . For example,
if f ′(a) = 0 and f ′′(a) > 0 then
f(x) = f(a) + 12f
′′(a)(x− a)2 + 16f
′′′(ξ)(x− a)3
If x is close to a, then (x− a)3 ≪ (x− a)2, so
f(x) ≈ f(a) + 12f
′′(a)(x− a)2 > f(a)
so a is a minimiser of f (at least compared to nearby points; “local minimiser”).
Terminology: the optimal decision variable x∗ is called a maximiser/minimiser, the
value f(x∗) is called the maximum/minimum.
CHAPTER 4. NONLINEAR OPTIMISATION 48
Optimisation in 1D
In general, Taylor’s Theorem allows us to say: for some integer m,
• If f ′(a) = f ′′(a) = · · · = f (2m−1)(a) = 0 and f (2m)(a) < 0, then a is a local maximiser
• If f ′(a) = f ′′(a) = · · · = f (2m−1)(a) = 0 and f (2m)(a) > 0, then a is a local minimiser
• If f ′(a) = f ′′(a) = · · · = f (2m)(a) = 0 and f (2m+1)(a) ̸= 0, then a is an inflection
point
These only tell us about the local behaviour of f . It can’t tell us if a is a maximise/min-
imiser compared to all possible values f(x) (global maximiser/minimiser), only values
nearby.
Similar ideas work in higher dimensions, but it gets very complicated very quickly (as
m increases).
Multivariate Taylor’s Theorem
To look at optimisation in Rn, we need a version of Taylor’s Theorem in higher
dimensions.
Suppose we care about a ∈ Rn and another point x ∈ Rn. Let’s join the two points
via a line (and assume the whole line is in the domain of f , i.e. domain is convex): The
line is
a+ t(x− a), t ∈ [0, 1]
Look at the value of f along this line
g(t) := f(a+ t(x− a)) = f(a+ th), where h = x− a
and apply Taylor’s Theorem to g. But what are the derivatives of g?
g′(t) = d
dt
f(a+ th) = d
dt
f(a1 + th1, . . . , an + thn)
and by applying the chain rule we get
g′(t) = h1
∂f(a+ th)
∂x1
+ · · ·+ hn∂f(a+ th)
∂xn
Multivariate Taylor’s Theorem
g′(t) = h1
∂f(a+ th)
∂x1
+ · · ·+ hn∂f(a+ th)
∂xn
This looks like the dot product of h with something. Give it a name: the gradient of f is
∇f(x) =
(
∂f(x)
∂x1
, . . . ,
∂f(x)
∂xn
)T
and so g′(t) = hT∇f(a+ th).
For second derivatives, we have
g′′(t) = h1
d
dt
∂f(a+ th)
∂x1
+ · · ·+ hn d
dt
∂f(a+ th)
∂xn
=
n∑
i=1
hi
 n∑
j=1
hj
∂2f(a+ th)
∂xi∂xj

Again, we can write this succinctly.
CHAPTER 4. NONLINEAR OPTIMISATION 49
Multivariate Taylor’s Theorem
g′′(t) =
n∑
i=1
hi
 n∑
j=1
hj
∂2f(a+ th)
∂xi∂xj

Define the Hessian matrix
∇2f(x) =

∂2f(x)
∂x21
· · · ∂2f(x)
∂x1∂xn
...
. . .
...
∂2f(x)
∂xn∂x1
· · · ∂2f(x)
∂x2n
 ∈ Rn×n
then we get g′′(t) = hT∇2f(a+ th)h.
Important fact: Hessian matrices are always symmetric. That is, ∂2f
∂xi∂xj
= ∂2f
∂xj∂xi
for
all i and j.
Multivariate Taylor’s Theorem
We have g(t) = f(a+ th) with g′(t) = hT∇f(a+ th) and g′′(t) = hT∇2f(a+ th)h.
What about higher derivatives of g?
Same basic pattern, for example
g′′′(t) =
n∑
i,j,k=1
∂3f(a+ th)
∂xi∂xj∂xk
hihjhk =: ∇3f(a+ th)[h,h,h]
and so on.
Applying Taylor’s Theorem in 1D (with t = 1 so a + th = x) we get multivariate
Taylor’s Theorem:
f(x) = f(a) +∇f(a)T (x− a) + 12(x− a)
T∇2f(a)(x− a) + · · ·
+ 1
p!∇
pf(a)[x− a, . . . ,x− a] + remainder
Multivariate Taylor’s Theorem
For our purposes in optimisation, it is most convenient to only look up to the quadratic
terms: if x ≈ a then
f(x) ≈ f(a) +∇f(a)T (x− a) + 12(x− a)
T∇2f(a)(x− a)
Example: what is a quadratic Taylor approximation to f(x, y) =
√
1 + x− y near
(x, y) = (0, 0)? Compute the partial derivatives
∂f
∂x
= 12
√
1 + x− y ,
∂f
∂y
= − 12√1 + x− y
and
∂2f
∂x2
= ∂
2f
∂y2
= 14(1 + x− y)3/2 ,
∂2f
∂x∂y
= − 14(1 + x− y)3/2
Evaluating at (x, y) = (0, 0) we get f(0, 0) = 1, ∂f(0,0)
∂x
= 12 , etc.
CHAPTER 4. NONLINEAR OPTIMISATION 50
Multivariate Taylor’s Theorem
The Taylor series is
f(x, y) ≈ f(0, 0) +
[
x
y
]T [
∂f(0, 0)/∂x
∂f(0, 0)/∂y
]
+ 12
[
x
y
]T [
∂2f(0, 0)/∂x2 ∂2f(0, 0)/∂x∂y
∂2f(0, 0)/∂x∂y ∂2f(0, 0)/∂y2
] [
x
y
]
= f(0, 0) + x∂f(0, 0)
∂x
+ y∂f(0, 0)
∂y
+ 12
[
x2
∂2f(0, 0)
∂x2
+ xy∂
2f(0, 0)
∂x∂y
+ xy∂
2f(0, 0)
∂y∂x
+ y2∂
2f(0, 0)
∂y2
]
= 1 + 12x−
1
2y +
1
2
[
−14x
2 + 14xy +
1
4xy −
1
2y
2
]
= 1 + 12x−
1
2y −
1
8x
2 + 14xy −
1
8y
2
Quick check: use 1D Taylor series
√
1 + h ≈ 1 + 12h− 18h2 and substitute h = x− y.
Optimality
Now let’s go back to our optimisation problem:
min
x∈Rn
f(x)
What does it mean to ‘solve’ this problem?
Definition 4.2. A point x∗ ∈ Rn is a global minimiser of f if f(x∗) ≤ f(x) for all
x ∈ Rn.
But thinking back to the 1D case, we could only use derivatives to check if a point is
bigger/smaller than nearby points.
Definition 4.3. A point x∗ ∈ Rn is a local minimiser of f if f(x∗) ≤ f(x) for all x
sufficiently close to x∗.
Maximiser? Same definitions but with ≥.
First-Order Optimality
To relate local minimisers/maximisers to derivatives, we need to think about how f
behaves when we move in a given direction away from a point.
Given a point x, a direction p is a descent direction if f(x+ αp) < f(x) for all α > 0
sufficiently small.
Similarly, p is an ascent direction if f(x+ αp) > f(x) for all α > 0 sufficiently small.
By definition, if x∗ is a local minimiser (maximiser), no vector is a descent (ascent)
direction.
From Taylor’s Theorem, if α is sufficiently small then x+ αp is close to x, so
f(x+ αp) ≈ f(x) + α∇f(x)Tp
So, if ∇f(x)Tp < 0 then p is a descent direction (ascent if > 0). (if ∇f(x)Tp = 0, not
sure)
Geometrically, p is ascent if it forms an acute angle with ∇f(x), and descent if it
forms an acute angle with −∇f(x).
CHAPTER 4. NONLINEAR OPTIMISATION 51
First-Order Optimality
This gives us the first key way to characterise local extrema.
Theorem 4.4. If f is continuously differentiable and x∗ is a local maximiser or minimiser,
then ∇f(x∗) = 0.
Proof. Suppose ∇f(x∗) ̸= 0. Then direction p = ∇f(x∗) gives
∇f(x∗)Tp = ∇f(x∗)T∇f(x∗) =
n∑
i=1
(
∂f(x∗)
∂xi
)2
> 0
so p is an ascent direction. Similarly, p = −∇f(x∗) is a descent direction. So, x∗ cannot
be a local maximum or minimum (contradiction).
This is the Rn equivalent of “x is a max/min of f means f ′(x) = 0” from high school.
Second-Order Optimality
So local extrema are stationary points (∇f(x∗) = 0). How can we classify stationary
points?
A stationary point x∗ is:
• A local minimiser if all directions from x∗ are ascent directions
• A local maximiser if all directions from x∗ are descent directions
• A saddle point if there are both ascent and descent directions from x∗
Saddle points are the equivalent of inflection points in 1D. For example, the origin is a
saddle point of f(x, y) = x2 − y2.
At a stationary point, our Taylor series is
f(x) ≈ f(x∗) +∇f(x∗)T (x− x∗) + 12(x− x
∗)T∇2f(x∗)(x− x∗)
Since ∇f(x∗) = 0, we need to look at the Hessian ∇2f(x∗).
Reminder: Eigenvalues
From first-year linear algebra, you should know that an eigenvector/eigenvalue of a
square matrix A ∈ Rn×n is a value λ ∈ C and nonzero vector v ∈ Cn such that Av = λv.
You find all such λ by solving the equation det(A − λI) = 0 (i.e. find roots of a
degree-n polynomial). For each λ, you then use Gauss-Jordan elimination to find any
v ̸= 0 such that (A− λI)v = 0, which always exists for these λ since A− λI is singular.
Fact 1: if A ∈ Rn×n is symmetric, then all eigenvalues and eigenvectors are real.
Fact 2: if A ∈ Rn×n is symmetric, then there exists a set of eigenvectors (one per
eigenvalue, counting multiple roots) which form an orthonormal basis for Rn. This means
we can write
A = QΛQT
where Q ∈ Rn×n has columns vi and is orthogonal (Q−1 = QT ), and Λ ∈ Rn×n is diagonal
with entries λi.
Why care? ∇2f(x∗) is symmetric!
CHAPTER 4. NONLINEAR OPTIMISATION 52
Reminder: Quadratic Forms/Definiteness
At a stationary point (∇f(x∗) = 0), our Taylor series is
f(x) ≈ f(x∗) + 12(x− x
∗)T∇2f(x∗)(x− x∗)
Since we want to look at “all x near x∗”, we are interested in expressions like yT∇2f(x∗)y
for different inputs y.
For a matrix A ∈ Rn×n, the quadratic form associated with A is the quadratic function
Q(y) = yTAy.
The matrix A is:
• Positive definite if Q(y) > 0 for all y ̸= 0 (note Q(0) = 0 always)
• Positive semidefinite if Q(y) ≥ 0 for all y
• Indefinite if Q(y) > 0 for some y and Q(y) < 0 for other y
• Negative semidefinite if Q(y) ≤ 0 for all y
• Negative definite if Q(y) < 0 for all y ̸= 0
Reminder: Quadratic Forms/Definiteness
If A symmetric, write A = QΛQT , so
Q(y) = yTQΛQTy = (QTy)TΛ(QTy) = zTΛz
where z = QTy. Since QT invertible (QTQ = I), z takes all values in Rn if y does, and
y = 0 if and only if z = 0 (check!). Since Λ is diagonal with entries λ1, . . . , λn (eigenvalues
of A), we get
Q(y) =
n∑
i=1
λiz
2
i
Varying z over all vectors in Rn (including coordinate vectors), we get:
• A is positive definite if all λi > 0
• A is positive semidefinite if all λi ≥ 0
• A is indefinite if some λi > 0 and some λi < 0
• A is negative semidefinite if all λi ≤ 0
• A is negative definite if all λi < 0
Second-Order Optimality
At a stationary point (∇f(x∗) = 0), our Taylor series for x ≈ x∗ is
f(x) ≈ f(x∗) + 12(x− x
∗)T∇2f(x∗)(x− x∗)
For local minimiser/maximiser, f(x∗) is smaller/bigger than f(x) for all nearby x.
CHAPTER 4. NONLINEAR OPTIMISATION 53
Theorem 4.5. Suppose f is C2(Rn) and x∗ is a stationary point (i.e. ∇f(x∗) = 0).
• If ∇2f(x∗) is positive definite, then x∗ is a local minimiser
• If ∇2f(x∗) is negative definite, then x∗ is a local maximiser
• If ∇2f(x∗) is indefinite, then x∗ is a saddle point
If ∇2f(x∗) has a zero eigenvalue (so Q(y) = 0 for some y ̸= 0 and Hessian is is
positive/negative semidefinite), no conclusion can be drawn.
Examples
Example: find and classify all stationary points of f(x, y) = x2 + y2.
First, compute gradient and Hessian:
∇f(x, y) =
[
2x
2y
]
, and ∇2f(x, y) =
[
2 0
0 2
]
Second, find stationary points:
∇f(x∗, y∗) = 0 =⇒ (x∗, y∗) = (0, 0).
Last, look at eigenvalues of Hessian ∇2f(x∗, y∗): eigenvalues are λ = 2, 2 (check using
det(A− λI) = 0, or note eigenvalues of diagonal matrix are diagonal entries). Hessian is
positive definite, so point is local minimiser.
All together: f has one stationary point (x∗, y∗) = (0, 0) which is a local minimiser.
In this case, f(x, y) ≥ 0 by definition and f(x∗, y∗) = 0, so the point is also a global
minimiser.
Examples
Example: find and classify all stationary points of f(x, y, z) = 3xy − x3 − y3 − 3z2.
First, compute gradient and Hessian:
∇f(x, y) =
3y − 3x
2
3x− 3y2
−6z
 , and ∇2f(x, y) =
−6x 3 03 −6y 0
0 0 −6

Second, find stationary points:
∇f(x∗, y∗) = 0 =⇒
y = x2
x = y2
z = 0
Substitute x = y2 into y = x2 to get y = y4, and so y = 0 or y = 1 (then x = y2 so x = 0
or x = 1 too).
So there are two stationary points, (x, y, z) = (0, 0, 0) and (1, 1, 0).
CHAPTER 4. NONLINEAR OPTIMISATION 54
Examples
Example: find and classify all stationary points of f(x, y, z) = 3xy − x3 − y3 − 3z2.
At stationary point (x, y, z) = (0, 0, 0) the Hessian is
H = ∇2f(0, 0, 0) =
0 3 03 0 0
0 0 −6
 =⇒ H − λI =
−λ 3 03 −λ 0
0 0 −6− λ

Find eigenvalues
det(H − λI) = −λ(−λ)(−6− λ)− 3(3)(−6− λ) = (−6− λ)[λ2 − 9]
So the eigenvalues are λ = −6 and λ = ±3, and the Hessian is indefinite.
The stationary point (0, 0, 0) is a saddle point.
Examples
Example: find and classify all stationary points of f(x, y, z) = 3xy − x3 − y3 − 3z2.
At stationary point (x, y, z) = (1, 1, 0) the Hessian is
H = ∇2f(0, 0, 0) =
−6 3 03 −6 0
0 0 −6
 =⇒ H − λI =
−6− λ 3 03 −6− λ 0
0 0 −6− λ

Find eigenvalues
det(H − λI) = (−6− λ)(−6− λ)(−6− λ)− 3(3)(−6− λ) = (−6− λ)[(−6− λ)2 − 9]
= (−6− λ)[λ2 + 12λ+ 27]
So the eigenvalues are λ = −6 and λ = −3,−9, and the Hessian is negative definite.
The stationary point (0, 0, 0) is a local maximiser.
No conclusion?
We said: if ∇2f(x∗) has a zero eigenvalue then no conclusion can be drawn. Some
simple 1D examples show why this is the case: if f ′(x∗) = 0 and f ′′(x∗) = 0 (only way to
get zero eigenvalue of 1× 1 matrix)...
• f(x) = x3 and x∗ = 0 — point of inflection (saddle point)
• f(x) = x4 and x∗ = 0 — local minimiser
• f(x) = −x4 and x∗ = 0 — local maximiser
Just looking at gradients & Hessians, we cannot tell these cases apart (“no conclusion
can be drawn”).
CHAPTER 4. NONLINEAR OPTIMISATION 55
Global Extrema?
So far, we have
Global min =⇒ Local min =⇒ Stationary point
In general, the reverse implications are not true. For example, f(x) = (x− 1)2− 3x cos(x):
−4 −2 0 2 4
0
10
20
30
40
For the second implication, we know that
Stationary point+ positive definite Hessian =⇒ Local min
What about the first implication? When are local minima also global minima?
Convexity
One special class of functions are convex functions.
Definition 4.6. A function f : Rn → R is convex if, for all x,y ∈ Rn and all t ∈ [0, 1],
f(tx+ (1− t)y) ≤ tf(x) + (1− t)f(y).
For example, all linear functions, f(x) = x4, f(x) = ex and f(x) = |x| are convex
functions.
Less obvious examples: f(x) = xTAx if A is symmetric & positive semidefinite,
f(x) = log(ex1 + · · · exn), f(x) = max(x1, . . . , xn) are all convex.
An easy(ish) way to check for convexity is:
Theorem 4.7. If f is C2(Rn), then f is convex if and only if ∇2f(x) is positive semidef-
inite for all x ∈ Rn.
Convexity
Theorem 4.8. If f is C1(Rn) and convex, then
Global min ⇐⇒ Local min ⇐⇒ Stationary point
This is the most common situation (minimising a convex function) where we can easily
find global minima, just by looking at for a stationary point!
Warning 1: not all convex functions have stationary points (e.g. f(x) = ex)
Warning 2: some non-convex functions have “all stationary points are local & global
minima” (can you draw a picture of such a function in 1D?)
What about “local max implies global max”? When f is concave (i.e. −f is convex).
CHAPTER 4. NONLINEAR OPTIMISATION 56
Convexity
Proof: from our existing ⇒ implications, we just need to show “global minimiser ⇐
stationary point” (equivalently: if x∗ is not a global minimiser, then x∗ is not a stationay
point).
Suppose x∗ is not a global minimiser, so there exists z with f(z) < f(x∗). Then we
have
∇f(x∗)T (z − x∗) = d
dt
f(x∗ + t(z − x∗))|t=0 = lim
t→0
f(x∗ + t(z − x∗))− f(x∗)
t
Using convexity of f ,
∇f(x∗)T (z − x∗) = lim
t→0
f((1− t)x∗ + tz)− f(x∗)
t
≤ lim
t→0
(1− t)f(x∗) + tf(z)− f(z∗)
t
Hence ∇f(x∗)T (z − x∗) ≤ f(z)− f(x∗) < 0, so p = z − x∗ is a descent direction at x∗.
Hence x∗ cannot be a stationary point.
Optimisation Algorithms
We have studied what it means to solve minx∈Rn f(x), but what algorithms can find a
solution?
Many available, one popular choice is linesearch methods: given a current point xk,
find a better point xk+1 via:
• Find a descent direction pk (i.e. ∇f(xk)Tpk < 0)
• Find a stepsize αk > 0 such that f(xk + αkpk) < f(xk). Pick xk+1 = xk + αkpk.
Most common types:
• pk = −∇f(xk), direction fastest local decrease (“steepest descent” or “gradient
descent”)
• pk = −[∇2f(xk)]−1∇f(xk), minimise quadratic Taylor series (“Newton’s method”).
Needs Hessian to be positive definite for descent!
• pk = −Hk∇f(xk) for some Hk ≈ ∇2f(xk) and positive definite (most famous are
“quasi-Newton” methods, especially “BFGS”)
I recommend BFGS as a default choice (doesn’t need Hessian but fast convergence).
For high-dimensional problems (n large), use the “limited memory” version, L-BFGS.
4.2 Constrained Optimisation
Constrained Optimisation
Now let’s consider a problem with constraints:
min
x∈Ω
f(x),
CHAPTER 4. NONLINEAR OPTIMISATION 57
for a feasible region Ω ⊂ Rn. As mentioned above, we will express our constraints
algebraically, as equations/inequalities:
min
x∈Rn
f(x)
s.t. gi(x) = 0, i = 1, . . . , k
hj(x) ≤ 0, j = k + 1, . . . ,m
Note: just like LPs, constraints like c(x) ≥ 0 can be written as −c(x) ≤ 0, and c(x) ∈ [a, b]
can be written as c(x)− b ≤ 0 and a− c(x) ≤ 0.
The feasible region is then
Ω = {x ∈ Rn : g1(x) = · · · = gk(x) = 0 and hk+1(x), . . . , hm(x) ≤ 0}
Constrained Optimisation: Optimality
What does it mean to be a solution to a constrained optimisation problem?
Definition 4.9. A point x∗∈ Ω is a global minimiser if f(x∗) ≤ f(x) for all x ∈ Ω.
Definition 4.10. A point x∗∈ Ω is a local minimiser if f(x∗) ≤ f(x) for all x ∈ Ω
sufficiently close to x∗..
These definitions are very similar to the unconstrained case (just take Ω = Rn).
Unfortunately, it is much harder to write necessary conditions (like ∇f(x∗) = 0) for
constrained problems.
There are second-order conditions (involving ∇2f) for constrained problems just like
the unconstrained case, but we will not look at these in this course.
Geometric Ideas
Plot the unconstrained problem maxx,y∈R f(x, y) = 4− (x− 1)2 − 1.5(y − 2)2:
x
−2
−1
0
1
2
3
4
y
−1
0
1
2
3
4
5
−17.5
−15.0
−12.5
−10.0
−7.5
−5.0
−2.5
0.0
2.5
f(x, y)
−2 −1 0 1 2 3 4
x
−1
0
1
2
3
4
5
y
−18
−15
−12
−9
−6
−3
0
3
f(x, y) and ∇f(x, y)
Gradient is pointing in direction of increasing f , perpendicular to level sets.
Geometric Ideas
A first-order Taylor series at x gives
f(y) ≈ f(x) +∇f(x)T (y − x)
So if y is a point near x on the same level set (so f(y) = f(x)), we have
∇f(x)T (y − x) ≈ 0
CHAPTER 4. NONLINEAR OPTIMISATION 58
Taking limits as y → x, we get
∇f(x)Tv(x) = 0
where v(x) is the tangent to the level set at x.
That is, ∇f always points perpendicular to (tangents of) level sets.
Geometric Ideas
Now let’s add a single equality constraint:
−2 −1 0 1 2 3 4
x
−1
0
1
2
3
4
5
y
−18
−15
−12
−9
−6
−3
0
3
max
x,y∈R
f(x, y) = 4− (x− 1)2 − 1.5(y − 2)2
s.t. g(x, y) = y − (x− 4)2 + 1 = 0
Observations: ∇g is perpendicular to constraint set, solution is where level set of f is
parallel to constraint set
So at the solution, ∇g is parallel to ∇f (possibly with sign difference), or ∇f(x∗) =
−λ∇g(x∗) for some λ ∈ R.
Lagrangian
A unified way to think about the solutions of constrained optimisation problems is to
introduce a new function called a Lagrangian.
If we only have equality constraints:
min
x∈Rn
f(x) s.t. g(x) = 0 (where g(x) = (g1(x), . . . , gm(x))T )
the Lagrangian is defined to be
L(x,λ) := f(x) +
m∑
i=1
λigi(x) = f(x) + λTg(x),
where λ ∈ Rm is some vector of weights.
Lagrangian
L(x,λ) := f(x) +
m∑
i=1
λigi(x) = f(x) + λTg(x),
Why is this useful?
The partial derivatives of L are
∂L
∂xj
= ∂f
∂xj
+
m∑
i=1
λi
∂gi
∂xj
and ∂L
∂λi
= gi(x)
CHAPTER 4. NONLINEAR OPTIMISATION 59
So a stationary point (x∗,λ∗) of L satisfies
∇xL = ∇f(x∗) +
m∑
i=1
λ∗i∇gi(x∗) = 0 and ∇λL = g(x∗) = 0
The first condition generalises what we saw earlier (∇f = −λ∇g). The second condition
is feasibility, x∗ ∈ Ω.
Both of these are necessary requirements for x∗ to be a solution of the constrained
problem!
Equality Constraints
For the equality-constrained problem
min
x∈Rn
f(x) s.t. g(x) = 0 (where g(x) = (g1(x), . . . , gm(x))T )
we have the following (formalised later when we add inequality constraints).
Theorem 4.11. Under suitable conditions (e.g. differentiability of f and all gi), if x∗ is
a local minimiser then there exist Lagrange multipliers λ∗1, . . . , λ∗m ∈ R such that
∇f(x∗) = −
m∑
i=1
λ∗i∇g(x∗), and g(x∗) = 0.
Equivalently, x∗ and λ∗ = (λ∗1, . . . , λ∗m)T form a stationary point for L(x,λ).
These conditions give n+m equations in n+m unknowns.
Example
Example: solve the problem
min
x,y∈R
f(x, y) = 12(x− 1)
2 + 12(y − 2)
2 + 1
s.t. x+ y = 1
First, write constraint as g(x, y) = x+ y − 1 = 0.
Method 1: the Lagrangian is
L(x, y, λ) = 12(x− 1)
2 + 12(y − 2)
2 + 1 + λ(x+ y − 1)
Stationary points satisfy
∂L
∂x
= (x− 1) + λ = 0 ∂L
∂y
= (y − 2) + λ = 0 ∂L
∂λ
= x+ y − 1 = 0
Last equation gives y = 1− x, substitute into second equation to get (1− x− 2) + λ = 0.
Add this to first equation to get 2λ− 2 = 0 or λ = 1, and then x = 0 and y = 1.
CHAPTER 4. NONLINEAR OPTIMISATION 60
Example
Method 2: compute gradients
∇f(x, y) =
[
x− 1
y − 1
]
and ∇g(x, y) =
[
1
1
]
Our theorem tells us: find x, y, λ ∈ R such that[
x− 1
y − 1
]
= −λ
[
1
1
]
and g(x, y) = x+ y − 1 = 0
Same equations as method 1 (fortunately)! Solution was x = 0, y = 1 and λ = 1
So both methods give solution (x∗, y∗) = (0, 1) with Lagrange multiplier λ∗ = 1.
Inequality Constraints
Now let’s look at inequality constraints:
min
x∈Rn
f(x)
s.t. h(x) ≤ 0 (where h(x) = (h1(x), . . . , hm(x))T )
Just like for LPs, we can add slack variables si to convert ≤ inequalities to equalities. The
main difference here is we don’t want to add new constraints like si ≥ 0, so instead we
write hi(x) ≤ 0 as hi(x) + s2i = 0, not hi(x) + si = 0.
The new problem is
min
x∈Rn,s∈Rm
f(x)
s.t. hi(x) + s2i = 0 ∀i = 1, . . . ,m
This is now in equality form: what are the optimality conditions here?
Inequality Constraints
The Lagrangian is
L(x, s,λ) = f(x) +
m∑
i=1
λi(hi(x) + s2i )
so the stationarity conditions give
∂L
∂xj
= ∂f
∂xj
+
m∑
i=1
λi
∂hi
∂xj
∂L
∂si
= 2λisi = 0
∂L
∂λi
= hi(x) + s2i = 0
Since s and λ are in Rm, we now have n+m+m equations in n+m+m unknowns.
CHAPTER 4. NONLINEAR OPTIMISATION 61
Inequality Constraints
Written in vector form (as much as possible), our stationarity conditions are
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗), λ∗i s∗i = 0, and h(x∗) + (s∗)2 = 0
(with (s∗)2 meant element-wise), and the complementary slackness conditions λ∗i s∗i = 0
holding for all i = 1, . . . ,m.
This is clearly more complicated. What types of behaviours can arise here?
Inequality Constraints
Case 1: unconstrained minimiser outside feasible region
−2 −1 0 1 2 3 4
x
−1
0
1
2
3
4
5
y
−18
−15
−12
−9
−6
−3
0
3
min
x,y∈R
f(x, y) = −4 + (x− 1)2 + 1.5(y − 2)2
s.t. h(x, y) = −y + (x− 4)2 − 1 ≤ 0
We aren’t at the unconstrained solution, so ∇f(x∗) ̸= 0, and at least one constraint
must be stopping us, so hi(x∗) = 0 for some i (such constraints are called active).
Plot shows ∇h(x∗, y∗) pointing in opposite direction to ∇f(x∗, y∗): to decrease f we
would need to increase h (which isn’t allowed when h = 0)
Inequality Constraints
Case 1: unconstrained minimiser outside feasible region
There is some constraint i preventing us from decreasing f further. Consider
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗)
To decrease f , we would need to increase hi above zero (not allowed), i.e. ∇f points
in the opposite direction to ∇hi. This gives λ∗i > 0.
We also have λ∗i s∗i = 0, and so s∗i = 0. The last condition is hi(x∗) + (s∗i )2 = 0 and so
hi(x∗) = 0 (constraint i is active, exactly what we expect).
Since hi(x∗) = 0, note that λ∗ihi(x∗) = 0.
Inequality Constraints
Case 2: unconstrained minimiser inside feasible region
−2 −1 0 1 2 3 4
x
−1
0
1
2
3
4
5
y
−18
−15
−12
−9
−6
−3
0
3
min
x,y∈R
f(x, y) = −4 + (x− 1)2 + 1.5(y − 2)2
s.t. h(x, y) = −y + (x− 2)2 − 1 ≤ 0
CHAPTER 4. NONLINEAR OPTIMISATION 62
Constrained solution = unconstrained solution, so ∇f(x∗) = 0. Solution is away from
constraint boundary, so hi(x∗) > 0 for all i.
Inequality Constraints
Case 2: unconstrained minimiser inside feasible region
If hi(x∗) > 0, then hi(x∗) + (s∗i )2 = 0 implies s∗i ̸= 0.
Complementary slackness λ∗i s∗i = 0 implies λ∗i = 0.
Then from λ∗i = 0 and
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗)
we conclude ∇f(x∗) = 0, as expected.
Again note that λ∗ihi(x∗) = 0.
Inequality Constraints
Case 3: unconstrained minimiser on boundary of feasible region
−2 −1 0 1 2 3 4
x
−1
0
1
2
3
4
5
y
−18
−15
−12
−9
−6
−3
0
3
min
x,y∈R
f(x, y) = −4 + (x− 1)2 + 1.5(y − 2)2
s.t. h(x, y) = −y + (x− 1−√3)2 − 1 ≤ 0
Unconstrained solution is still feasible, so ∇f(x∗) = 0. But it is on the boundary, so
hi(x∗) = 0 for some i.
Inequality Constraints
Case 3: unconstrained minimiser on boundary of feasible region
If hi(x∗) = 0, we have s∗i = 0, but if hi(x∗) < 0 then s∗i > 0.
Complementary slackness λ∗i s∗i = 0 implies λ∗i = 0 if hi(x∗) < 0 and λ∗i ∈ R otherwise.
From
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗)
all the terms with hi(x∗) < 0 vanish (since λ∗i = 0).
What about for constraints with hi(x∗) = 0? From x∗, moving inside the feasible
region (so hi(x) < 0) must increase f . That is, decreasing h leads to increasing f , so ∇h
pointing in opposite direction to ∇f .
Similar to case 1, this means λ∗i > 0 whenever hi(x∗) = 0. Again, note λ∗ihi(x∗) = 0
in both situations.
Inequality Constraints
For
min
x∈Rn
f(x) s.t. h(x) ≤ 0
CHAPTER 4. NONLINEAR OPTIMISATION 63
adding slacks and finding stationary points of the Lagrangian gives
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗), λ∗i s∗i = 0, and h(x∗) + (s∗)2 = 0
Our analysis of 3 cases shows this is the same as
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗), λ∗ihi(x∗) = 0, h(x∗) ≤ 0 and λ∗i ≥ 0
So there are two ways of finding solutions (Lagrangian & above conditions), just like for
equality constraints.
Warning: solving these equations analytically is harder!
Example
Example: solve the problem
min
x∈R3
f(x) = x1(x1 − 10) + x2(x2 − 50)− 2x3
s.t. x1 + x2 ≤ 10, x3 ≤ 10
Start by defining h1(x) = x1 + x2 − 10 ≤ 0 and h2(x) = x3 − 10 ≤ 0.
Method 1: Add slacks, so the Lagrangian is
L(x, s,λ) = f(x) +
m∑
i=1
λi(hi(x) + s2i )
= x1(x1 − 10) + x2(x2 − 50)− 2x3 + λ1(x1 + x2 − 10 + s21) + λ2(x3 − 10 + s22)
Set all partial derivatives to zero...
Example
Method 1: partial derivatives are
(1) ∂L
∂x1
= 2x1 − 10 + λ1 = 0, (2) ∂L
∂x2
= 2x2 − 50 + λ1 = 0, (3) ∂L
∂x3
= −2 + λ2 = 0
and (4) ∂L
∂si
= 2λisi = 0 always, plus
(5) ∂L
∂λ1
= x1 + x2 − 10 + s21 = 0, (6)
∂L
∂λ2
= x3 − 10 + s22 = 0
Firstly, (2) gives λ2 = 2, so (4) gives s2 = 0 (i.e. x3 = 10 from (6)).
Now, if λ1 = 0, then (1) & (2) give x1 = 5 and x2 = 25. But then (5) has no solutions.
So, we must have λ1 ̸= 0.
Since λ1 ̸= 0, (4) means s1 = 0, and then (5) implies x1 + x2 = 10. This, (1) & (2)
give 3 linear equations in x1, x2, λ1 with solution x1 = −5, x2 = 15 and λ1 = 20.
CHAPTER 4. NONLINEAR OPTIMISATION 64
Example
Method 2: we want to find x∗ and λ∗ satisfying
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗), λ∗ihi(x∗) = 0, h(x∗) ≤ 0 and λ∗i ≥ 0
Compute gradients
∇f(x) =
2x1 − 102x2 − 50
−2
 , ∇h1(x) =
11
0
 , and ∇h2(x) =
00
1

The gradient condition is therefore2x1 − 102x2 − 50
−2
 =
−λ1−λ1
−λ2

Together with λ1h1(x) = 0 and λ2h2(x) = 0 we get 5 equations in 5 unknowns (but we
need to keep the inequalities in mind when solving).
Example
Method 2: the 5 equations in 5 unknowns are:
(1–3)
2x1 − 102x2 − 50
−2
 =
−λ1−λ1
−λ2
 , (4) λ1(x1 + x2 − 10) = 0, and (5) λ2(x3 − 10) = 0
with extra inequalities x1 + x2 − 10 ≤ 0, x3 − 10 ≤ 0 and λi ≥ 0.
(3) gives λ2 = 2, and so (5) means x3 = 10. Confirm that λ2 ≥ 0 and x3 − 10 ≤ 0
both hold.
If λ1 = 0, then (1) & (2) give x1 = 5 and x2 = 25, violating the inequality x1+x2−10 ≤
0. So, λ1 ̸= 0.
Since λ1 ̸= 0, (4) means x1 + x2 − 10 = 0. Together with (1) & (2) we have 3 linear
equations in x1, x2λ1. Just as in method 1, it has solution x1 = −5, x2 = 15 and λ1 = 20.
We confirm λ1 ≥ 0 and x1 + x2 − 10 ≤ 0 both hold.
From both methods, the solution is x∗1 = −5, x∗2 = 15 and x∗3 = 10 (with λ∗1 = 20 and
λ∗2 = 2).
Summary so far
Equality constraints (g(x) = 0): Find stationary point of Lagrangian
L(x,λ) = f(x) +
m∑
i=1
λigi(x)
or solve (for x∗ ∈ Rn and λ∗ ∈ Rm)
∇f(x∗) = −
m∑
i=1
λ∗i∇g(x∗), and g(x∗) = 0.
CHAPTER 4. NONLINEAR OPTIMISATION 65
Inequality constraints (h(x) ≤ 0): Add slacks and find stationary point of La-
grangian
L(x, s,λ) = f(x) +
m∑
i=1
λi(hi(x) + s2i )
or solve (for x∗ ∈ Rn and λ∗ ∈ Rm)
∇f(x∗) = −
m∑
i=1
λ∗i∇hi(x∗), λ∗ihi(x∗) = 0, h(x∗) ≤ 0 and λ∗i ≥ 0
General Case
Now let’s look at the fully general case with both equality and inequality constraints:
min
x∈Rn
f(x)
s.t. gi(x) = 0, i = 1, . . . , k
hj(x) ≤ 0, j = k + 1, . . . ,m
As before, we can find stationary points of the Lagrangian (after adding slacks to the
inequality constraints)
L(x, s,λ) = f(x) +
k∑
i=1
λigi(x) +
m∑
j=k+1
λj(hj(x) + s2j)
However it is more common for people to look at the other set of conditions (method 2)...
KKT Conditions
Theorem 4.12. Suppose f, gi, hi are C1(Rn), x∗ ∈ Ω is a local minimiser, and (constraint
qualification holds at x∗). Then x∗ is a KKT point (William Karush 1939, Harold Kuhn
& Albert Tucker, 1951): that is, there exist Lagrange multipliers λ∗ ∈ Rm such that
∇f(x∗) = −
k∑
i=1
λ∗i∇gi(x∗)−
m∑
j=k+1
λ∗j∇hj(x∗), (stationarity)
gi(x∗) = 0, ∀i = 1, . . . , k, (feasibility)
hj(x∗) ≤ 0, ∀j = k + 1, . . . ,m, (feasibility)
λ∗j ≥ 0, ∀j = k + 1, . . . ,m, (dual feasibility)
λ∗jhj(x∗) = 0, ∀j = k + 1, . . . ,m. (complementary slackness)
Intuition: each Lagrange multiplier represents how much (and in what direction)
each constraint is preventing us from decreasing f further. Compare: dual solution for
LPs
Constraint Qualification
This result only works if we have a constraint qualification (CQ): roughly, this says
that first-order Taylor series for gi and hj accurately represent the geometry of Ω.
There are many CQs, such as:
CHAPTER 4. NONLINEAR OPTIMISATION 66
• All constraints gi and hj are linear/affine (i.e. linear ± constant)
• Slater’s CQ: if gi are linear/affine and hj are convex, and there exists x such that
hj(x) < 0 for all j.
• LICQ: {∇gi(x∗) : i = 1, . . . , k}∪{∇hj(x∗) : j ∈ A(x∗)} is a linearly independent set,
where A(x) = {j = k + 1, . . . ,m : hj(x) = 0} are the active inequality constraints.
In this course, we will assume every problem has a CQ, so you don’t need to worry
about this.
Constraint Qualification
min
x∈R2
f(x) = x1 + x2 s.t. g1(x) = (x1 − 1)2 + x22 − 1 = 0, g2(x) = (x1 − 2)2 + x22 − 4 = 0.
0 1 2 3 4
−2.0
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
g1(x)=0 g2(x)=0
There is only one feasible point, Ω = {(0, 0)}, so it must be the local (and global)
minimiser.
However, ∇f(x∗) = (1, 1)T , ∇g1(x∗) = (−2, 0)T and ∇g2(x∗) = (−4, 0)T , so we can
never write ∇f(x∗) as a linear combination of ∇gi(x∗). Hence x∗ cannot be a KKT
point.
Example
Example: solve the problem
min
x,y∈R
f(x, y) = 12(x− 1)
2 + 12(y − 2)
2 + 1
s.t. x+ y = 1, x ≥ 12 , y ≤
3
4
Start by defining g1(x, y) = x+ y − 1 = 0, h2(x, y) = 12 − x ≤ 0 and h3(x, y) = y − 34 ≤ 0.
The KKT conditions are: find x, y, λ1, λ2, λ3 such that
∇f(x, y) = −λ1∇g1(x, y)− λ2∇h2(x, y)− λ3∇h3(x, y)
g1(x, y) = 0
h2(x, y), h3(x, y) ≤ 0 and λ2, λ3 ≥ 0
λ2h2(x, y) = λ3h3(x, y) = 0
As before, we solve the 5 equations in 5 unknowns (and just keep in mind the inequalities
hj ≤ 0 and λj ≥ 0).
CHAPTER 4. NONLINEAR OPTIMISATION 67
Example
Computing gradients, our equations are
(1–2)
[
x− 1
y − 2
]
= −λ1
[
1
1
]
− λ2
[−1
0
]
− λ3
[
0
1
]
=
[−λ1 + λ2
−λ1 − λ3
]
(3) x+ y − 1 = 0, (4) λ2(1/2− x) = 0, and (5) λ3(y − 3/4) = 0
plus 12 − x ≤ 0, y − 34 ≤ 0 and λ2, λ3 ≥ 0.
Everything linear except complementary slackness, so try different cases for these and
see what happens.
Case 1: Suppose λ2 = λ3 = 0. Then (1), (2) & (3) are 3 linear equations in x, y, λ1
with solution x = 0, y = 1 and λ1 = 1. This violates the constraint 12 − x ≤ 0. No
solutions in this case.
Case 2: Suppose λ2 = 0, λ3 > 0. Then (5) gives y = 34 , and hence (3) gives x =
1
4 .
This violates the constraint 12 − x ≤ 0. No solutions in this case.
Example
(1–2)
[
x− 1
y − 2
]
= −λ1
[
1
1
]
− λ2
[−1
0
]
− λ3
[
0
1
]
=
[−λ1 + λ2
−λ1 − λ3
]
(3) x+ y − 1 = 0, (4) λ2(1/2− x) = 0, and (5) λ3(y − 3/4) = 0
plus 12 − x ≤ 0, y − 34 ≤ 0 and λ2, λ3 ≥ 0.
Case 3: Suppose λ2 > 0, λ3 = 0. Then (4) gives x = 12 , and hence (3) gives y =
1
2 .
This and λ3 = 0 in (2) gives λ1 = 32 , and finally (1) gives λ2 = 1. Inequalities? All hold —
solution found!
Case 4: Suppose λ2, λ3 > 0. Then (4) & (5) give x = 12 and y =
3
4 , which violates (3).
No solutions in this case.
All together: there is only one KKT point, (x∗, y∗) = (12 ,
1
2) with λ
∗
1 = 32 , λ
∗
2 = 1 and
λ∗3 = 0.
Exercise: plot the feasible region, and figure out where f is small/large. Use this to
show (x∗, y∗) is a local minimiser (not a maximiser).
KKT for LPs
The KKT conditions work for any constrained problem (provided the functions are
smooth and a CQ holds). What about for LPs?
max
x∈Rn
Z = cTx s.t. Ax ≤ b,x ≥ 0
Constraints are all linear/affine, so a CQ holds (and all functions are linear so differentiable)
— assumptions of KKT theorem hold!
Using our notation, we have m+ n constraints: (where aTj is jth row of A)
hj(x) = aTj x− bj ≤ 0, j = 1, . . . ,m
hj(x) = −xj ≤ 0, j = m+ 1, . . . ,m+ n
Computing gradients (remembering to convert max to min)
∇f(x) = −c, ∇hj(x) = aj (j = 1, . . . ,m) and ∇hj(x) = −ej (j = m+ 1, . . . , n)
CHAPTER 4. NONLINEAR OPTIMISATION 68
KKT for LPs
The KKT conditions are (splitting Lagrange multipliers for Ax ≤ b and x ≥ 0 into
separate variables): find x ∈ Rn, y ∈ Rm and λ ∈ Rn such that
−c = −
m∑
j=1
yjaj +
n∑
j=1
λjej,
Ax ≤ b, x ≥ 0
y,λ ≥ 0
yj(aTj x− bj) = λj(−xj) = 0
The first equation is −c = −ATy + λ or ATy = c + λ. Since λ ≥ 0, we could write
ATy ≥ c.
So we have ATy ≥ c and y ≥ 0. These are the constraints for the dual problem!
Actually, the connection is even closer than this...
KKT for LPs
Suppose x∗,y∗,λ∗ is a KKT point for the original LP, and let y ∈ Rm be any other
dual feasible point (ATy ≥ c and y ≥ 0). Then,
bTy ≥ (x∗)TATy (Ax∗ ≤ b and y ≥ 0)
≥ (x∗)Tc (ATy ≥ c and x∗ ≥ 0)
= (x∗)T (ATy∗ + λ∗) (ATy∗ + λ∗ = c)
= (x∗)TATy∗ (λjxj = 0)
= (Ax∗ − b)Ty∗ + bTy∗ (rearrange)
= bTy∗ (yj(aTj x∗ − bj) = 0)
So, the Lagrange multipliers y∗ for the original LP minimise bTy over all dual feasible y:
the Lagrange multipliers y∗ are the solution to the dual problem!
This also gives us an extra fact about duality for LPs: complementary slackness,
y∗j (aTj x∗ − bj) = 0 for all j = 1, . . . ,m.
Convexity
Just like for unconstrained problems, we have the implications (under reasonable
assumptions)
Global min =⇒ Local min =⇒ KKT point
and the implications do not usually go the other way.
One key result about existence of global extrema for constrained problems is:
Theorem 4.13 (Weierstrass Extreme Value Theorem). If f is continuous and the feasible
region Ω is closed and bounded, then there exists a global minimiser and maximiser of f
in Ω
So if Ω is closed and bounded, then at least one KKT point is a global minimiser and
at least one is a global maximiser (check f(x∗) to see which ones). We will not study how
to classify KKT points as local minima/maxima using Hessians (but it is possible).
What about convexity? If f and Ω are convex (tx + (1 − t)y ∈ Ω ∀x,y ∈ Ω,
t ∈ [0, 1]), then
Global min ⇐⇒ Local min ⇐⇒ KKT point (if CQ, smoothness)
CHAPTER 4. NONLINEAR OPTIMISATION 69
Quadratic Programming
One important class of convex problems are quadratic programs (QPs):
min
x∈Rn
xTQx+ gTx
s.t. (linear constraints, e.g. Ax ≤ b, Ax = b, x ≥ 0, ...)
The objective is convex if Q is symmetric & positive semidefinite, the feasible region
is convex (exercise!).
The functions involved are smooth and a CQ holds (all constraints linear/affine). So,
KKT points are local & global minimisers for QPs.
If there are no inequality constraints, the KKT conditions are just a big linear system!
We will see QPs later in the course when looking at portfolio theory.
Algorithms for Constrained Optimisation
What algorithms are there for constrained optimisation?
Just like for unconstrained optimisation, there are a few common approaches:
• Penalty/barrier methods: replace constraints with extra term in the objective which
encourages feasible solutions (→ solve unconstrained problem). For example, replace
g(x) = 0 with
min
x∈Rn
f(x) +M
m∑
i=1
gi(x)2, for some large M > 0
(Similar to big-M method for simplex method)
• Sequential Quadratic Programming (SQP): replace functions in original problem
with quadratic Taylor series (objective) or linear Taylor series (constraints) based at
current solution estimate. Solve the resulting QP to get a new solution estimate.
In general, it is very hard to implement constrained optimisation algorithms well: use
existing software! (e.g. SciPy in labs)
Chapter 5
Probability Review
5.1 Basic Concepts
Probability
Probability is the mathematical study of uncertainty and randomness. Many real-world
phenomena are modelled using randomness:
• Financial markets
• Weather (links to finance: electricity/gas prices, insurance)
• Physics & chemistry (e.g. small-scale motions of atoms)
From mathematical point of view, be aware that “probability is a good model” ̸= “actually
random” (e.g. weather).
This short section builds up the mathematical structures underlying probability theory,
and introduces key probabilistic concepts (many of which you likely already know).
After this, we will look at how to make decisions when there is uncertainty/randomness
present (especially in the context of finance).
Experiments
A random experiment is any process that generates observable information about a
random phenomenon (e.g. roll a dice, observe today’s stock prices). Performing a single
experiment is called a trial, and the result of a trial is called an outcome.
In theory, we can usually write down all possible outcomes (where the experiment
involves picking one of them). This is called the sample space Ω.
The sample space Ω could be finite (e.g. dice roll) or infinite (e.g. a value in R, like a
stock price).
Example: what is the sample space for tossing a coin twice?
Ω = {(H,H), (H,T ), (T,H), (T, T )}
Once we have defined an experiment (with an associated sample space), we can start
asking questions:
• What is the likelihood of getting two heads?
• What is the likelihood that the second coin shows a different side to the first coin?
70
CHAPTER 5. PROBABILITY REVIEW 71
Events
These types of questions relate to events. Mathematically, events are subsets of Ω.
• Two heads: A = {(H,H)}
• Different results: A = {(H,T ), (T,H)}
Every experiment has at least two events: A = Ω and A = ∅. Intuitively, what should be
the probabilities of those events?
Probability
The probability of an event is a number between 0 and 1. Formally, we can think of
this as a function:
P : {subsets of Ω} → R
Of course, not every such function represents a ‘probability’ (e.g. P(A) = 0 for every event
A): we have to impose some restrictions on P.
Intuition: suppose we do an experiment N times and measure the number of times
event A occurs, N(A) (e.g. A = coin is a head). The ‘probability’ of A should be such
that
N(A)
N
≈ P(A)
This approximation should get better as N →∞. That is, P(A) is the fraction of Ω that
is in A.
Probability
Intuition 2: suppose all outcomes ω ∈ Ω are equally likely. For example, “toss 3
coins” has 8 outcomes:
Ω = {HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
If all outcomes are equally likely, we can assign a weight of 18 to each one.
Then, if A = {two heads} this corresponds to 4 possible events
A = {HHH,HHT,HTH,THH}
so the likelihood of A is 48 , or (roughly) “A includes half of Ω”.
These two ideas are the same: P(A) should be the fraction of Ω taken up by A. Our
formal definition should ensure P represents this idea: what are the minimal assumptions
needed for this?
Probability
Formally, a function
P : {subsets of Ω} → R
is a probability measure if it satisfies the following properties:
• P(A) ≥ 0 for all events A
CHAPTER 5. PROBABILITY REVIEW 72
• P(Ω) = 1
• For disjoint events A1, . . . , An (i.e. Ai ∩ Aj = ∅ whenever i ̸= j), we have
P(∪ni=1Ai) =
n∑
i=1
P(Ai)
This definition is enough to formalise our intuition. For example, it gives us some
unsurprising results that we would expect of ‘probabilities’...
Basic Results
Implication 1: P(AC) = 1− P(A)
Proof: pick disjoint sets A and AC , so
1 = P(Ω) = P(A) + P(AC)
Implication 2: P(∅) = 0
Proof: pick A = Ω in implication 1.
Implication 3: if A ⊆ B then P(B \ A) = P(B)− P(A)
Note: B \ A is “set minus”, i.e. B \ A := {ω : ω ∈ B,ω /∈ A} = B ∩ AC .
Proof: pick disjoint sets A and B \ A, so
P(B) = P(A) + P(B \ A)
Basic Results
Implication 4: if A ⊆ B then P(A) ≤ P(B)
Proof: from implication 3,
P(B)− P(A) = P(B \ A) ≥ 0
Implication 5: P(A) ≤ 1 for all events A
Proof: take B = Ω in implication 4.
Implication 6: P(A ∪B) = P(A) + P(B)− P(A ∩B)
Proof: take disjoint subsets of A ∪B,
A ∪B = [A \ (A ∩B)] ∪ (A ∩B) ∪ [B \ (A ∩B)]
and so from implication 3,
P(A ∪B) = [P(A)− P(A ∩B)] + P(A ∩B) + [P(B)− P(A ∩B)]
Conditional Probability
When we look at real-world probabilities, we want to take into account information we
already know when assessing uncertainty: what is the likelihood the stock price increases
this year, given it went up last year?
Incorporating extra information is done through conditional probability. We write
P(A|B) to be “the probability of event A, given that we know event B occurred”. Thinking
intuitively, we arrive at the definition
P(A|B) = A and B both occurred
B occurred =
P(A ∩B)
P(B)
CHAPTER 5. PROBABILITY REVIEW 73
Alternatively, we can write P(A ∩B) = P(A|B)P(B).
Some information is not relevant for some questions: what is the likelihood the stock
price increases this year, given I am thirsty? We call two events independent if:
P(A|B) = P(A) and P(B|A) = P(B)
Or, P(A ∩B) = P(A)P(B).
Example
Example: suppose that the probability that a lecturer makes no mistakes during a
lecture is 14 .
Then the probability that a lecturer makes at least one mistake during a lecture is
P(mistake) = 1− 14 =
3
4
If making a mistake in different lectures is independent (hopefully!), what is the
probability that the lecturer making no mistakes over the course of 3 consecutive lectures?
P(3x no mistakes) = P(no mistakes in L1)× P(no mistakes in L2)× P(no mistakes in L3) = 143
This course has 39 lectures, so read everything carefully and check things for yourself!
Sally Clark
Probability can be easily misunderstood/misconstrued and should be treated carefully!
(e.g. Monty Hall problem)
Sally Clark (UK) was convicted of murdering her two children as infants in 1996
and 1998. The only other explanation (correct!) was Sudden Infant Death Syndrome
(SIDS). An expert witness gave statistical evidence why SIDS was an extremely unlikely
explanation:
Medical data suggests likelihood of death from SIDS is 18543 . So, both children dying of
SIDS has probability 185432 ≈ 173 million (i.e. extremely unlikely). It’s very likely they were
murdered.
But death by SIDS is not independent within a family (maybe genetics?), so P(A∩B) ̸=
P(A)P(B). Also, prosecutor’s fallacy: death by murder is also extremely unlikely, so need
to look at relative likelihood to determine guilt/innocence.
The Royal Statistical Society (among others) publicly questioned the analysis and she
was eventually released after 3 years in jail. Similar issues appeared in Kathleen Folbigg’s
case (Hunter Valley, released June 2023).
Bayes’ Rule
An important result from conditional probability is Bayes’ rule
P(A|B) = P(A ∩B)
P(B) =
P(B|A)P(A)
P(B|A)P(A) + P(B|AC)P(AC)
where the denominator comes from the law of total probability: for disjoint events
A1 ∪ · · · ∪ An = Ω,
P(B) = P(B ∩ A1) + · · ·+ P(B ∩ An) = P(B|A1)P(A1) + · · ·+ P(B|An)P(An)
CHAPTER 5. PROBABILITY REVIEW 74
Simple example: medical testing
P(sick|test positive) = P(test positive|sick)P(sick)
P(test positive|sick)P(sick) + P(test positive|well)P(well)
Accurate tests have P(test positive|sick) ≈ 1 − ϵ and P(test positive|well) ≈ ϵ, but the
conclusion also depends on how common the illness is (P(sick) vs. P(well)).
5.2 Discrete Random Variables
Random Variables
Sometimes we are not interested in the outcome of the experiment (i.e. how the random
outcome was generated), but in something derived from the outcome.
• If stock prices are the experiments, what is my profit/loss?
• If the evolution of the weather is the experiment, how much will you claim on your
insurance?
Random variables are functions from Ω to a state space (often R or a subset like Z). If
the state space (not Ω) is discrete, we have a discrete random variable (e.g. X : Ω→ Z),
otherwise it’s continuous (e.g. X : Ω→ R).
The probability associated with the value of a random variable comes from looking at
the values in Ω that give the result. For example,
P(X = a) = P({ω : X(ω) = a}), P(X ≤ a) = P({ω : X(ω) ≤ a})
However, it often is easier to quantify P(X = a) directly and ignore the specific underlying
Ω (which we will do).
Random Variables
We also have independence for random variables.
Two random variables X and Y are independent if (for all a and b)
P(X = a and Y = b) = P(X = a)P(Y = b)
i.e. knowing the value of X doesn’t tell you anything about the likelihood of different
values of Y .
Functions of random variables are defined in the natural way: if X is a random variable
and f is any function, then f(X) is a random variable with
P(f(X) = a) = P({ω : f(X(ω)) = a}, P(f(X) ≤ a) = P({ω : f(X(ω)) ≤ a}), . . .
Example
Example: A coin is tossed n times, and where the probability of a head is p each
time (independence!). The sample space is
Ω = {H,T}n
Given independence, the probability of a single outcome is
P({ω}) = p(# heads in ω) × (1− p)(# tails in ω)
CHAPTER 5. PROBABILITY REVIEW 75
Let X be the number of heads (random variable!), X : Ω→ {0, . . . , n}.
The probability of a given value of X is
P(X = k) = P({ω : X(ω) = k}) =
(
n
k
)
pk(1− p)n−k
since there are
(
n
k
)
outcomes with exactly k heads.
Binomial Distribution
The above example follows a binomial distribution.
A random variable X has a binomial distribution with parameters n and p (written
X ∼ B(n, p)) if
P[X = k] =
(
n
k
)
pk(1− p)n−k, k = 0, . . . , n
X represents “how many successes occurred in n independent trials, if the probability of
success in each trial is p”.
Any discrete random variable is fully determined by its probability mass function,
pX(xi) = P(X = xi).
Expectation
If we have a random variable, we often want to compute particular quantities to
understand the distribution better.
The simplest one is expected value (i.e. mean/average): it tells us the ‘typical size’ of
a random variable.
If X is a discrete random variable taking values x1, . . . , xn with probabilities p1, . . . , pn,
then
E[X] :=
n∑
i=1
pixi = p1x1 + · · ·+ pnxn
The mean is sometimes denoted µX (or µ if there is only one random variable being
discussed).
Note that ∑ni=1 pi = 1, so this is just a weighted average.
E[X] is just a number, it is not a random quantity.
Expectation
Theorem 5.1. Let X and Y be two discrete random variables and a, b ∈ R be constants.
Then E[aX + bY ] = aE[X] + bE[Y ] (i.e. E[·] is a linear operator).
Proof: define the joint probability distribution over X and Y by
P(X = xi, Y = yj) = pij, i = 1, . . . , n, j = 1, . . . ,m
Then we get the marginal distributions (i.e. probability of a subset)
P(X = xi) =
m∑
j=1
pij and P(Y = yj) =
n∑
i=1
pij
(why? sum over disjoint events Y = yj for all j, etc) Then we compute the expectation
E[aX + bY ] =
n∑
i=1
m∑
j=1
pij(axi + byj) = a
n∑
i=1
 m∑
j=1
pij
xi + b m∑
j=1
(
n∑
i=1
pij
)
yj = aE[X] + bE[Y ]
CHAPTER 5. PROBABILITY REVIEW 76
Example
Example: if X ∼ B(n, p), what is E[X]?
Directly from the definition, we would compute
E[X] =
n∑
k=0
(
n
k
)
pk(1− p)n−k · k =??
This isn’t so easy to evaluate!
Easier: if Xi ∈ {0, 1} represents “success in the ith attempt?”, then X = ∑ni=1Xi.
But we can easily compute
E[Xi] = p · 1 + (1− p) · 0 = p
and so E[X] = np by linearity.
Expectation
If we have a function of a random variable, then expectation works in the obvious way:
E[f(X)] =
n∑
i=1
pif(xi)
if X takes values x1, . . . , xn with probabilities p1, . . . , pn.
How does expectation work with independence?
Theorem 5.2. If X and Y are independent discrete random variables, then E[XY ] =
E[X]E[Y ].
Proof: independence gives P(X = xi, Y = yj) = P(X = xi)P(Y = yj), so
E[XY ] =
n∑
i=1
m∑
j=1
P(X = xi, Y = yj)xiyj =
(
n∑
i=1
P(X = xi)xi
) m∑
j=1
P(Y = yj)yj
 = E[X]E[Y ]
Note: linearity of expectation result doesn’t require independence.
Expectation
The properties of expectation extend to functions of random variables.
Theorem 5.3. Let X and Y be discrete random variables and f , g be any functions.
Then
• If a, b ∈ R are constants then E[af(X) + bg(Y )] = aE[f(X)] + bE[g(y)]
• If X and Y are independent, then E[f(X)g(Y )] = E[f(X)]E[g(Y )]
Proof: exercise! (mimic above proofs)
CHAPTER 5. PROBABILITY REVIEW 77
Variance
The next important summary statistic is variance:
Var(X) := E[(X − µ)2]
where µ := E[X]. The variance is sometimes denoted σ2X or σ2. The value σ =
√
Var(X)
is called the standard deviation (note Var(X) ≥ 0 always).
Variance/standard deviation is a measure of uncertainty in X (how different is it to
µ). For example, if Var(X) = 0 then X = µ always.
Which one to use? Variance is mathematically nicer to deal with, standard deviation
is more informative/intuitive (same units as X).
Lemma 5.4. For any random variable X, we have Var(X) = E[X2]− E[X]2.
Proof: if µ = E[X] then use linearity to get
Var(X) = E[X2 − 2µX + µ2] = E[X2]− 2µE[X] + µ2 = E[X2]− 2µ2 + µ2 = E[X2]− µ2
Variance
Unlike expectation, variance is not a linear operator. It does have some nice properties,
though!
Theorem 5.5. Let X and Y be random variables and a ∈ R be a constant. Then
(a) Var(aX) = a2Var(X)
(b) Var(X + a) = Var(X)
(c) If X and Y are independent, then Var(X + Y ) = Var(X) + Var(Y )
Proof: (a) compute
Var(aX) = E[(aX − E[aX])2] = E[(aX − aE[X])2] = E[a2(X − E[X])2] = a2Var(X)
(b) compute
Var(X + a) = E[(X + a− E[X + a])2] = E[(X + a− E[X]− a)2] = E[(X − E[X])2] = Var(X)
Variance
Proof: (c) expand and use linearity of expectations to get
Var(X + Y ) = E[(X + Y − E[X + Y ])2]
= E[(X + Y − E[X]− E[Y ])2]
= E[(X − E[X])2 + (Y − E[Y ])2 + 2(X − E[X])(Y − E[Y ])]
= Var(X) + Var(Y ) + 2E[(X − E[X])(Y − E[Y ])]
= Var(X) + Var(Y ) + 2E[XY −XE[Y ]− Y E[X] + E[X]E[Y ]]
= Var(X) + Var(Y ) + 2 (E[XY ]− 2E[X]E[Y ] + E[X]E[Y ])
= Var(X) + Var(Y ) + 2 (E[XY ]− E[X]E[Y ])
The last term is zero because E[XY ] = E[X]E[Y ] (since X and Y are independent).
CHAPTER 5. PROBABILITY REVIEW 78
Example
Example: if X ∼ B(n, p), what is Var(X)?
As before, we have X = ∑ni=1Xi where Xi is “success on ith trial”, i.e.
Xi =
1, with probability p0, with probability 1− p
We have E[Xi] = p, so compute
Var(Xi) = E[(Xi − p)2] = p · (1− p)2 + (1− p) · (0− p)2
= (p− 2p2 + p3) + (p2 − p3)
= p− p2 = p(1− p)
Since the Xi are independent, Var(X) =
∑n
i=1Var(Xi) = np(1− p).
5.3 Continuous Random Variables
Continuous Random Variables
Now let’s consider the case where X : Ω→ R is a continuous random variable. That
is, it can take uncountably infinitely many possible values.
What is P(X = a) for some value a? Zero (usually)! Otherwise, would get something
like ∑a P(X = a) =∞.
Instead, we look at ‘summing’ the probability in some interval, P(X ∈ [a, b]). This
gives us an integral
P(X ∈ [a, b]) =
∫ b
a
f(x)dx
for some function f (depending on X).
The function f (sometimes fX) is called the probability density function for X
(c.f. probability mass function for discrete random variables).
Note that we must have
∫
R f(x)dx = 1.
Continuous Random Variables
A useful related concept is the cumulative distribution function for X:
F (a) =
∫ a
−∞
f(x)dx = P(X ≤ a)
Note that standard properties of integrals gives
P(X ∈ [a, b]) = F (b)− F (a)
and by the fundamental theorem of calculus, f(x) = F ′(x).
Functions of continuous random variables are defined in the same way as the discrete
case,
P(g(X) ∈ [a, b]) =
∫ b
a
g(x)f(x)dx
(just change sum to integral and mass function → density function)
CHAPTER 5. PROBABILITY REVIEW 79
Mean and Variance
The definition of expected value for a continuous random variable is
E[X] =
∫
R
xf(x)dx
The definition of variance is the same as the discrete case:
Var(X) = E[(X − E[X])2] =
∫
R
(x− E[X])2f(x)dx
All the same properties of mean and variance hold in the continuous case (e.g. expectation
is linear, Var(X + Y ) = Var(X) + Var(Y ) if X and Y are independent, ...)
Uniform Distribution
The simplest continuous random variable is the uniform distribution on an interval
[a, b], written X ∼ Unif([a, b]).
The probability density function is
f(x) =
1/(b− a), a < x < b0, otherwise
It has mean E[X] = (a+ b)/2 and variance Var(X) = (b− a)2/12.
Normal Distribution
The most important continuous random variables is the normal/Gaussian distribution.
For parameters µ ∈ R and σ > 0, we write X ∼ N(µ, σ2) with probability density
function
f(x) = 1√
2π σ
exp
(
−(x− µ)
σ2
)
, ∀x ∈ R
(it’s not obvious that
∫
R f(x)dx = 1 is true!)
It has mean E[X] = µ and variance Var(X) = σ2 (hence why we use this notation).
The value of a normal random variable is clustered close to the mean. For example,
P(µ− σ < X < µ+ σ) ≈ 0.683, P(µ− 3σ < X < µ+ 3σ) ≈ 0.997
The standard normal distribution has µ = 0 and σ = 1, i.e. X ∼ N(0, 1).
5.4 Covariance and Correlation
Covariance
Now let’s think about the situation where we have multiple random variables. It will
be important to talk about associations/links between them:
• The stock prices of BHP and Rio Tinto are somewhat related (both mining compa-
nies)
• Electricity prices are related to weather (hot day→ high demand for air conditioning)
The covariance of two random variables X and Y is
Cov(X, Y ) := E[(X − E[X])(Y − E[Y ])]
This measures the joint uncertainty in X and Y together.
Note: we saw this quantity in the proof of Var(X + Y ) = Var(X) + Var(Y ) for
independent X and Y .
CHAPTER 5. PROBABILITY REVIEW 80
Covariance
Theorem 5.6. Suppose X, Y and Z are random variables, and a, b ∈ R are constants.
Then
(a) Cov(X,X) = Var(X)
(b) Cov(X, Y ) = E[XY ]− E[X]E[Y ]
(c) Cov(X, Y ) = Cov(Y,X)
(d) Cov(aX + bY, Z) = aCov(X,Z) + bCov(Y, Z)
Proof: When we proved that Var(X + Y ) = Var(X) +Var(Y ) for independent X and
Y , we inadvertently proved (b). For (a) compute
Cov(X,X) = E[(X − E[X])(X − E[X])] = E[(X − E[X])2] = Var(X)
For (c) compute
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[(Y − E[Y ])(X − E[X])] = Cov(Y,X)
Covariance
Proof: For (d), we use (b) to get
Cov(aX + bY, Z) = E[(aX + bY )Z]− E[aX + bY ]E[Z]
= aE[XZ] + bE[Y Z]− (aE[X] + bE[Y ])E[Z]
= a(E[XZ]− E[X]E[Z]) + b(E[Y Z]− E[Y ]E[Z])
= aCov(X,Z) + bCov(Y, Z)
Note: properties (c) and (d) are also satisfied by the usual vector dot product on Rn
(also (a) if you replace ‘variance’ with ‘length’). In general, covariance is an inner product
on the vector space of random variables.
Correlation
The covariance between two random variables and be positive, negative or zero.
It is often useful to look at a scaled covariance, called the correlation coefficient:
Corr(X, Y ) := Cov(X, Y )√
Var(X)Var(Y )
The correlation coefficient is sometimes denoted ρXY or just ρ.
Theorem 5.7. If X and Y are random variables and a, b ∈ R are constants, then
(a) −1 ≤ Corr(X, Y ) ≤ 1 ← this is helpful for intuition
(b) If X and Y are independent, then Corr(X, Y ) = Cov(X, Y ) = 0
(c) If Y = aX + b is a linear function of X, then
Corr(X, Y ) =
1, if a > 0−1, if a < 0
CHAPTER 5. PROBABILITY REVIEW 81
Independent vs. Uncorrelated
Warning! Independent and uncorrelated (ρ = 0) are different!
• X and Y independent = knowing the value of X tells you nothing about the value
of Y
• X and Y uncorrelated = knowing that X is above/below E[X] doesn’t tell you if Y
is above/below E[Y ]
Independent implies uncorrelated, but not the other way around.
Example: Suppose X ∼ N(0, 1) and Y = X2. Then since E[X] = 0,
Cov(X, Y ) = E[XY ]− E[X]E[Y ] = E[XY ] =
∫
R
x3f(x)dx = 0
since f(x) is an even function and x3 is odd. So X and Y are uncorrelated.
But clearly X and Y are still related (i.e. not independent)! For example,
P(X ∈ [0, 1], Y ∈ [0, 1]) = P(X ∈ [0, 1]) ≈ 0.341
P(X ∈ [0, 1])P(Y ∈ [0, 1]) = P(X ∈ [0, 1])P(X ∈ [−1, 1]) = 2P(X ∈ [0, 1])2 ≈ 0.233
Covariance Matrices
Now let’s suppose we have several random variables, X1, . . . , Xn. We have n2 covari-
ances, Cov(Xi, Xj). Put them in an n× n covariance matrix C with Ci,j = Cov(Xi, Xj):
C =

Cov(X1, X1) Cov(X1, X2) · · · Cov(X1, Xn)
Cov(X2, X1) Cov(X2, X2) · · · Cov(X2, Xn)
...
...
. . .
...
Cov(Xn, X1) Cov(Xn, X2) · · · Cov(Xn, Xn)

=

Var(X1) Cov(X1, X2) · · · Cov(X1, Xn)
Cov(X2, X1) Var(X2) · · · Cov(X2, Xn)
...
...
. . .
...
Cov(Xn, X1) Cov(Xn, X2) · · · Var(Xn)

Since Cov(X, Y ) = Cov(Y,X), the matrix C is symmetric, C = CT . So, C has real
eigenvalues and is orthogonally diagonalisable (c.f. properties of Hessians).
Covariance Matrices
The associated correlation matrix P ∈ Rn×n is formed the same way, Pij = Corr(Xi, Xj).
P =

1 Corr(X1, X2) · · · Corr(X1, Xn)
Corr(X2, X1) 1 · · · Corr(X2, Xn)
...
...
. . .
...
Corr(Xn, X1) Corr(Xn, X2) · · · 1

remembering that Corr(X,X) = 1 always. It is also symmetric, P = P T .
CHAPTER 5. PROBABILITY REVIEW 82
If the variables Xi are pairwise independent (i.e. Xi and Xj are independent whenever
i ̸= j) then
C =

Var(X1) 0 · · · 0
0 Var(X2) · · · 0
...
...
. . .
...
0 0 · · · Var(Xn)
 and P = I
Covariance Matrices
Note that P can be formed from C by scaling the ith row of C by σi =
√
Var(Xi) and
the jth row of C by σj. In matrix form, this is
P = diag(C)−1/2 · C · diag(C)−1/2
where
diag(C) =

Var(X1) 0 · · · 0
0 Var(X2) · · · 0
...
...
. . .
...
0 0 · · · Var(Xn)
 =

σ21 0 · · · 0
0 σ22 · · · 0
...
...
. . .
...
0 0 · · · σ2n

so diag(C)−1/2 is diagonal with entries σ−11 , . . . , σ−1n
Covariance Matrices
Theorem 5.8. For any X1, . . . , Xn, the covariance and correlation matrices C and P are
positive semidefinite.
Proof: pick any y ∈ Rn and show yTCy ≥ 0:
yTCy =
n∑
i=1
n∑
j=1
yiyjCi,j =
n∑
i=1
n∑
j=1
yiyj Cov(Xi, Xj) = Cov
 n∑
i=1
yiXi,
n∑
j=1
yjXj

= Var
(
n∑
i=1
yiXi
)
≥ 0
Then write P as P = S−1/2CS−1/2 for S = diag(C)−1/2, giving
yTPy = yTS−1/2CS−1/2y = (S−1/2y)TC(S−1/2y) ≥ 0
using the fact that C is positive definite.
Covariance Matrices
The last useful fact we will need later is the following, about linear combinations of
random variables.
Theorem 5.9. Let X1, . . . , Xn be random variables and define Y1 =
∑n
i=1 a1,iXi = aT1X
and Y2 =
∑n
i=1 a2,iXi = aT2X (where X ∈ Rn has entries Xi). Then
Cov(Y1, Y2) = aT1Ca2
CHAPTER 5. PROBABILITY REVIEW 83
Proof: just like in the positive semidefinite proof, directly compute
aT1Ca2 =
n∑
i=1
n∑
j=1
a1,iaj,2Ci,j =
n∑
i=1
n∑
j=1
a1,ia2,j Cov(Xi, Xj)
= Cov
 n∑
i=1
a1,iXi,
n∑
j=1
a2,jXj
 = Cov(Y1, Y2)
Chapter 6
Decision Making Under Uncertainty
Decision Making Under Uncertainty
We are now ready to start discussing concepts from mathematical finance.
We will first look at how to make decisions in the presence of uncertainty/risk:
• What stocks should I buy to maximise my future returns?
• Is it better to invest in equity (i.e. stock) or debt (i.e. bonds ≈ term deposits)?
The outcome of financial decisions depends on the unknown future.
There are many possible future scenarios, usually some will be good and some will be
bad (for us).
What is a principled/mathematical way to make good decisions in this situation?
6.1 Pricing Risky Assets
Expected Returns
Example: you will roll one die. If you roll a 6, I pay you $12, otherwise you get
nothing. What would be a reasonable price to pay in order to play this game?
The price should be between $0 and $12 (min & max possible payout).
Your chance of getting $0 is much higher than $12, so the price is probably closer to
$0 than $12.
The principle of expected return says: a reasonable price for an uncertain payoff is the
expected return.
• A sensible seller wouldn’t offer a too-low price that would lose them money on
average
• A sensible buy wouldn’t accept a too-high price that would lose them money on
average
In this case, the expected return is E[R] = 16 · $12 + 56 · $0 = $2.
Intuitively, if you play the game many times, on average you will win $2.
Expected Returns
Example: I will toss a coin repeatedly until it comes up heads for the first time.
• If it comes up heads on the first try, you get $2
84
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 85
• If it comes up heads on the second try, you get $4
• If it comes up heads on the third try, you get $8, ...
• If it comes up heads on the nth try, you get $2n
What do you think is a fair price to play this game?
The expected return from this game is
E[R] = 12 · $2 +
1
4 · $4 + ·+
1
2n · $2
n + · · · = 1 + 1 + · · ·+ 1 + · · · =∞
So you should be willing to pay any price to play this game? This is called the St
Petersburg paradox (Nicolaus Bernoulli, 1713).
This seems wrong and suggests the principle of expected return isn’t the right idea. (
Risk Attitude
Example: Australian media billionaire Kerry Packer (1937–2005) liked to gamble.
One night at a high-rollers room in Las Vegas (allegedly!):
A Texan... approached Packer at the tables in Vegas and started making a nuisance of
himself. When asked to give it a rest, the Texan supposedly replied with the old, “Do you
know who I am”, to which the Packer reply was, “No”.
“I happen to be worth $100 million,” said the Texan.
“Toss you for it,” said Packer.
https://www.smh.com.au/business/packers-gambling-feats-fact-or-fiction-20051228-g
dmozo.html
If you were the Texan, would you take the bet?
People tend to be risk-averse: losing money matters more than winning the same
amount. Similarly, it matters more to the Texan to win/lose $100m than to Packer.
Fun fact: bees are risk averse when deciding which flowers to visit!
https://theconversation.com/bees-are-astonishingly-good-at-making-decisions-and-o
ur-computer-model-explains-how-thats-possible-208189
Risk Attitude
A consequence of risk-aversion is a preference to diversify investments.
Example (Daniel Bernoulli 1738 — Nicolaus’s cousin!): Sempronius owns
goods at home worth a total of 4000 ducats and in addition possesses 8000 ducats worth
commodities in foreign countries from where they can only be transported by sea. However,
past experience tells us that of two ships one perishes. So, sending the goods home in one
ship gives payoff:
R1 =
4 000, if ship perishes, p = 1/212 000, if ship survives, p = 1/2
What if they use two ships?
R2 =

4 000, if both ships perish, p = 1/4
8 000, if one ship survives, p = 1/2
12 000, if both ships survive, p = 1/4
In both cases the expected return is E[R1] = E[R2] = 8000 ducats, but most people prefer
R2.
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 86
Utility
Making decisions based on expected return is not a reasonable model of decision-
making:
• Doesn’t handle infinite expected return (St Petersburg paradox)
• Doesn’t capture risk aversion or preference for diversification
An alternative model was proposed by Daniel Bernoulli.
Idea: the ‘benefit’ we derive from wealth w is a nonlinear function of w.
This captures the idea that $100m matters more to the Texan than to Kerry Packer,
or that $1000 matters more to you than the Texan.
In economics, the idea of a ‘benefit’ from wealth is called utility. We assume each
person has their own utility function:
U : R→ R
where U(x) < U(y) means “I prefer wealth $y to $x”.
Utility
Definition 6.1. A utility function is any function U : D ⊂ R → R that is strictly
increasing.
That is, more money is always preferred to less money (“law of non-satiety”).
Usually we will assume U is differentiable, so this means U ′(x) > 0 for all x ∈ D.
How do we represent “a fixed amount of money matters more to you the less you
have”? The “law of diminishing returns” holds if U ′′(x) < 0 (i.e. U is concave).
Recall (from optimisation): U is concave if
U(tx+ (1− t)y) ≥ tU(x) + (1− t)U(y), ∀x, y ∈ D, t ∈ [0, 1]
(i.e. −U is convex). This holds if U ′′(x) ≤ 0 for all x (or use strict inequalities throughout).
There are many strictly increasing, concave functions: U(x) =
√
x, U(x) = 1− e−x,
U(x) = log(x), etc.
Utility
Given a utility function, the decision-making idea is the principle of expected utility:
make the decision which has the highest expected utility (not expected wealth).
Sempronius ship example: with one ship, our expected utility is
E[U(R1)] =
1
2U(4000) +
1
2U(12000)
With two ships, our expected utility is
E[U(R2)] =
1
4U(4000) +
1
2U(8000) +
1
4U(12000)
The diversified strategy R2 is preferred is E[U(R2)] > E[U(R1)], i.e.
U(8000) ≥ 12U(4000) +
1
2U(12000)
which is true for any concave function!
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 87
Utility
One common model for utility functions is based on the idea that the increase in utility
from a constant payoff is inversely proportional to our current wealth.
That is, U ′(x) ∝ 1
x
.
So, if we have U ′(x) = b
x
for some b > 0 (need U ′(x) > 0), then
U(x) = a+ b log(x)
for constants a ∈ R and b > 0.
Existence of Utility Functions
The idea of expected utility maximisation is quite natural.
For risky games X and Y , let X ⪯ (≺)Y mean “I would (strictly) prefer to play Y
instead of X” and X ∼ Y mean “I prefer X and Y equally”.
Theorem 6.2 (John von Neumann & Oskar Morgenstern, 1947). If my preferences are
such that:
• For all X and Y , either X ⪯ Y or Y ⪯ X (completeness)
• If X ⪯ Y and Y ⪯ Z then X ⪯ Z (transitivity)
• If X ⪯ Y ⪯ Z then there exists p ∈ [0, 1] such that pX + (1− p)Z ∼ Y (continuity)
• For any Z and p ∈ [0, 1], X ⪯ Y if and only if pX + (1 − p)Z ⪯ pY + (1 − p)Z
(independence)
then there exists a function U such that X ≺ Y if and only if E[U(X)] < E[U(Y )].
Pricing Assets
How does utility theory work for pricing financial assets?
Suppose we have a utility function U and currently have wealth W . A financial asset
(e.g. stock) currently has price S, and in the future it will be worth X (random variable!).
Should we buy the asset?
• If we don’t buy the asset, we keep our wealthW and get utility U(W ) (no uncertainty,
so E[U(W )] = U(W )).
• If we buy the asset, our wealth becomes W − S +X and we get (random) utility
U(W − S +X).
So we should buy the asset if
E[U(W − S +X)] ≥ E[U(W )] = U(W )
Equivalently, the maximum price we would be willing to pay for X is the solution to
find S such that E[U(W − S +X)] = U(W )
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 88
Pricing Assets
Example: consider an investor with utility function U(x) = a + b log(x) for x > 0
(with a ∈ R and b > 0), and current wealth W > 0. A financial asset X pays M > 0 or
zero with equal probability. What is the maximum price S the investor would be willing
to pay for X?
We have X =M with probability 1/2 and X = 0 with probability 1/2. We are looking
for S such that
E[U(W − S +X)] = 12U(W − S +M) +
1
2U(W − S + 0) = U(W )
Solve for S via
a+ b log(W − S +M) + a+ b log(W − S) = 2(a+ b log(W )),
log((W − S +M)(W − S)) = log(W 2),
W 2 − 2WS + S2 +MW −MS = W 2
S2 − (2W +M)S +MW = 0
Pricing Assets
Example: the maximum price S satisfies S2 − (2W +M) +MW = 0, so
S =
2W +M ±
√
(2W +M)2 − 4MW
2
= 2W +M ±
√
4W 2 + 4WM +M2 − 4MW
2
= W + M2 ±
√
W 2 + M
2
4
Which root to pick? Discard larger ‘+’ root which gives S > W (investor can’t afford it!),
so maximum price is
S = W + M2 −
√
W 2 + M
2
4
Check that 0 < S < W , as we would hope (given W,M > 0).
Invariance
One important idea with utility functions is that they don’t have any standard ‘units’.
That is, the actual value of U(x) doesn’t matter, just the relative value (compared to the
other options).
Theorem 6.3. Two investors with utility functions U(x) and U˜(x) = aU(x) + b for some
a > 0 and b ∈ R will always make the same decisions.
Proof: suppose we have to choose between X and Y . We want to show
E[U(X)] ≥ E[U(Y )] ⇐⇒ E[U˜(X)] ≥ E[U˜(Y )]
Note that
E[U˜(X)]− E[U˜(Y )] = E[aU(X) + b]− E[aU(Y ) + b] = a(E[U(X)]− E[U(Y )]).
Since a > 0, we have E[U˜(X)] − E[U˜(Y )] ≥ 0 if and only if E[U(X)] − E[U(Y )], as
required.
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 89
Invariance
Two utility functions are called equivalent if they always lead to the same decisions
(like U and U˜ = aU + b in above theorem).
We also have
Corollary 6.4. Prices calculated using expected utility maximisation are unaffected by
positive linear transformations of U(x).
For example, in the S = W + M2 −
√
W 2 + M24 example above, the price didn’t depend
on the choices of a and b in the utility function, U(x) = a+ b log(x).
6.2 Risk Aversion
Risk Aversion
We have broadly described risk-aversion: people prefer certainty to risk, even when
the returns are the same.
Definition 6.5. An agent is risk-averse if they never prefer a risky payoff X with zero
expected value (rather than doing nothing), at any level of wealth, for any risky payoff X.
That is, for all W > 0 and all X with E[X] = 0 and Var(X) > 0, we have
E[U(W +X)] ≤ U(W + E[X]) = U(W )
Risk Aversion
Risk-averse agents have the same preference for payoffs with nonzero expected return.
Lemma 6.6. For any W > 0 and any risky payoff X with Var(X) > 0, a risk-averse
agent satisfies E[U(W +X)] ≤ U(W + E[X]).
Proof: apply definition with new wealth W˜ = W + E[X] and new risky payoff
X˜ = X − E[X]:
E[U(W +X)] = E[U(W˜ + X˜)] ≤ U(W˜ ) = U(W + E[X])
where the inequality comes from the definition of risk aversion.
Check that E[X˜] = 0 and Var(X˜) > 0
Concavity
We previously discussed risk aversion as U being concave (U ′′ < 0). What is the
connection with our new definition?
Theorem 6.7. An agent is risk-averse if and only if their utility function is concave.
Proof idea: first, we will sketch a proof. Suppose a random variable X takes two
values x1, x2 with probability 12 each. The key quantities are u := E[U(W + X)] =1
2U(W + x1) +
1
2U(W + x2) and v = U(W + E[X]) = U(W +
1
2x1 +
1
2x2). Risk aversion
means u ≤ v:
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 90
W+ x
1
W+ x
2
W+ E[X]
U(W+ x
1
)
U(W+ x
2
)
u= E[U(W+X)]
v=U(W+ E[X])
Concavity
We will prove the theorem assuming U is C2 for simplicity.
Proof (⇐): First, suppose U is concave. Let W be any wealth and X be a random
variable with E[X] = 0 and Var(X) > 0.
By Taylor’s Theorem, for any value of X there exists ξ(X) between W and W +X
such that
U(W +X) = U(W ) +XU ′(W ) + 12X
2U ′′(ξ(X))
Take expected value of both sides and use E[X] = 0 to get
E[U(W +X)] = U(W ) + 12E[X
2U ′′(ξ(X))]
Since X2 ≥ 0 and U ′′(ξ(X)) ≤ 0 always, the last term is negative, so E[U(W +X)] ≤
U(W ).
Hence the agent is risk-averse.
Concavity
Proof (⇒): suppose U is not concave (we will show the agent is not risk-averse).
That is, there exists a point with U ′′ > 0. Since U ′′ is continuous, there is actually an
interval x ∈ [W − δ,W + δ] with U ′′(x) > 0.
Consider any random variableX taking values in [−δ, δ] with E[X] = 0 and Var(X) > 0.
Taking expectations of the Taylor series (as above) we get
E[U(W +X)] = U(W ) + 12E[X
2U ′′(ξ(X))]
where ξ(X) is between W and W +X. That is, ξ(X) ∈ [W − δ,W + δ], so U ′′(ξ(X)) > 0.
Since Var(X) > 0, there is at least one value with X2 > 0, so E[X2U ′′(ξ(X))] > 0.
Hence E[U(W +X)] > U(W ) and the agent is not risk-averse.
Note: this result is a special form of Jensen’s inequality: E[ϕ(X)] ≤ ϕ(E[X]) if
and only if ϕ is concave (inequalities reversed for ϕ convex, which shows Var(X) =
E[X2]− E[X]2 ≥ 0)
Continuity
As an aside, note that a risk-averse utility function must be continuous (on any open
domain):
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 91
Measuring Risk-Aversion
We now have a nice characterisation of risk-aversion. Is there a way to quantify
statements like “I am more risk averse than you”? Two simple measures are:
• Absolute risk aversion:
A(W ) := −U
′′(W )
U ′(W )
• Relative risk aversion (if W > 0):
R(W ) := W · A(W ) = −W · U
′′(W )
U ′(W )
If A(W ), R(W ) > 0, then the agent is risk-averse, with larger values meaning “more
risk-averse” (negative values are “risk loving”).
Absolute risk aversion measures risk appetite for fixed payoffs (e.g. X = ±$10), relative
risk aversion measures risk appetite for payoffs as % of wealth (e.g. X = ±0.01W ).
Certainty Equivalence
When we have risk aversion, we have looked at pricing a risky asset X: find S such
that
E[U(W − S +X)] = U(W )
Now we look at a different notion of value.
The certainty equivalent value of a risky asset X is the value C such that
E[U(W +X)] = U(W + C)
That is, what is the risk-free value C that gives me as much happiness as the risky asset
X?
The risk premium is the amount π that you would pay in order to receive the fixed
value E[X] instead of the risky value X:
E[U(W +X)] = U(W + E[X]− π)
From the above definitions, we see that π = E[X]− C. If we are risk averse, then π > 0.
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 92
Certainty Equivalence
We have certainty equivalent value C and risk premium π defined by
E[U(W +X)] = U(W + C) = U(W + E[X]− π)
Since U is strictly increasing, U−1 is defined (at least on the range of U), so we get the
explicit formulae:
C = U−1(E[U(W +X)])−W and π = W + E[X]− U−1(E[U(W +X)])
(again, remember π = E[X]− C)
If π > 0, this means that we would strictly prefer to receive risk-free E[X] rather than
risky X. This is risk aversion!
Larger values of π indicate greater levels of risk aversion (π < 0 is risk-loving).
Certainty Equivalence
For a simple asset X which takes x1 and x2 each with probability 0.5, we get:
W+ x
1
W+ x
2
W+ E[X]
U(W+ x
1
)
U(W+ x
2
)
u
= E[U(W+X)]
v
=U(W+ E[X])
W+C
pi
Risk Premium
So we now have 3 ways to quantify risk aversion:
Risk Attitude Description Measure
Risk averse Prefer certainty to risk* A(W ), R(W ), π > 0 or U ′′(W ) < 0
Risk neutral Indifferent between certainty and risk* A(W ), R(W ), π = 0 or U ′′(W ) = 0
Risk loving Prefer risk to certainty* A(W ), R(W ), π < 0 or U ′′(W ) > 0
* For same expected return
For risk-neutral agents, U ′′(W ) = 0means U(W ) = aW+b for some a > 0. Since utility
functions are unaffected by positive linear transformations, we may assume U(W ) = W .
We know R(W ) = W · A(W ), but is there a connection between these and π?
Risk Premium
What is the connection between π and A(W ),R(W )?
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 93
Suppose the risky asset has small returns, X ≈ 0 (so E[X] ≈ 0). Define µ := E[X] for
convenience, and Taylor expansions around W + E[X] = W + µ give:
U(W + µ− π) ≈ U(W + µ)− πU ′(W + µ)
E[U(W +X)] ≈ E
[
U(W + µ) + (X − µ)U ′(W + µ) + 12(X − µ)
2U ′′(W + µ)
]
= U(W + µ) + E[X − µ]U ′(W + µ) + 12E[(X − µ)
2]U ′′(W + µ)
= U(W + µ) + 12 Var(X)U
′′(W + µ)
Risk premium π is defined by E[U(W +X)] = U(W + µ− π), so equate these expressions
to get
π ≈ 12 Var(X)A(W + µ) ≈
1
2 Var(X)A(W )
Insurance
Similar ideas as risk premium can be used to price insurance.
Suppose an investor has wealthW but will end up withW +X where loss X = −L < 0
with some probability p (and no loss X = 0 otherwise). They can pay an insurance
company a fixed premium P so that the company will pay them L if that loss happens
(so .
So, the investor would buy insurance if
U(W − P ) ≥ E[U(W +X)]
The highest price they would pay for insurance is Pmax such that U(W − Pmax) =
E[U(W +X)].
That is, Pmax = −C = π − E[X] = pL + π (since E[X] = −pL). So, a risk averse
investor (π > 0) would pay an insurance premium that is higher than the expected loss.
Example
Example: an investor has utility function U(W ) = log(W ) with initial wealth $10.
They are considering whether to invest in a risky asset X with payoff
X =
$5,with probability 0.5$0,with probability 0.5
The ‘fair price’ (by principle of expected return) is E[X] = 12(5) +
1
2(0) = $2.5.
The maximum price the investor would pay for X is (by principle of expected utility):
S such that
E[U(W − S +X)] = U(W )
1
2 log(10− S + 5) +
1
2 log(10− S + 0) = log(10)
log((15− S)(10− S)) = 2 log(10) = log(100)
(15− S)(10− S) = 100
which gives S ≈ $22.81 or S ≈ $2.19. Answer is S = $2.19 (cannot pay more than
W = $10).
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 94
Example
The certainty equivalent value is C such that
E[U(W +X)] = U(W + C)
1
2 log(10 + 5) +
1
2 log(10 + 0) = log(10 + C)
log(
√
15 · √10) = log(10 + C)
so C =
√
150−10 ≈ $2.25. The risk premium is π = E[X]−C = 2.5−(√150−10) ≈ $0.25.
Since π > 0, the investor is risk averse.
Check risk aversion: U(W ) = log(W ) so U ′(W ) = 1
W
and U ′′(W ) = − 1
W 2 . Since
U ′′(W ) < 0 we have risk aversion for all W . Absolute/relative risk aversion:
A(W ) = −U
′′(W )
U ′(W ) = −
−1/W 2
1/W =
1
W
R(W ) = W · A(W ) = 1
As expected, A(W ), R(W ) > 0 so the investor is indeed risk averse. Note A(W )→ 0 as
W →∞, so the investor is more willing to take a fixed $ risk as they get wealthier.
6.3 Limitations [not examinable]
Limitations [not examinable]
Expected utility maximisation is a good theory, but it has its limitations:
• The von Neumann/Morgenstern assumptions don’t necessarily hold (e.g. Allais
paradox, 1953)
• People don’t always behave rationally (behavioural economics)
This is like any part of applied mathematics: any theoretical representation of the real
world must be somewhat wrong.
“...all models are approximations. Essentially, all models are wrong, but some are useful.
However, the approximate nature of the model must always be borne in mind....”
— George Box, 1987
Framing [not examinable]
One way people don’t behave rationally is framing: context typically affects our
happiness.
• Who is happier?
– Yesterday, Jack had $1 million, but today he made a massive profit and now
has $5 million
– Yesterday, Jill had $9 million, but today she made a massive loss and now has
$5 million
• Would you rather:
– Earn $100k when all your friends earn $200k
CHAPTER 6. DECISION MAKING UNDER UNCERTAINTY 95
– Earn $100k when all your friends earn $50k
One extension of utility theory is prospect theory (Kahneman & Tversky, 1979):
• Utility function is S-shaped (care more about losses than gains)
• People tend to overestimate likelihood of rare events, so don’t use ‘expected’ max-
imisation (distort real probabilities → perceived probabilities)
Kahneman1 won the 2002 Nobel Prize for Economics for prospect theory (“for having
integrated insights from psychological research into... decision-making under uncertainty”).
1Tversky died in 1996 and Nobel Prizes are not awarded after death.
Chapter 7
Portfolio Theory
7.1 Investment Basics
Portfolio Theory
In finance, we make decisions to invest in a variety of assets. These are typically called
securities (tradeable financial assets). Common examples include:
• Shares in companies (on the stock exchange), aka equities
– Each share represents a fraction of company ownership
• Bonds (loans to governments or companies), aka debt
– Think of these like a term deposit at a bank: you lend them money now, they
pay you back with interest in the future.
• Currencies (Australian Dollars, Euros, etc)
• Commodities (oil, iron, wheat, pork, orange juice, ...)
– Gold often considered a currency, crypto currently considered a commodity
• Derivatives: contracts whose value indirectly depends on another security (e.g. op-
tions, futures)
These assets are traded in financial markets, and their values fluctuate (supply &
demand). A portfolio is an investment comprising a collection of securities.
Risky vs. Risk-Free
Most assets are risky: prices go up and down over time (e.g. share market).
• We model (unknown) future price changes as random variables
• If we invest in risky assets, we can potentially lose money
A small number of assets are considered risk-free.
• Typically government bonds: you lend them $100 and in T years they pay you
$100erT
• Only chance of losing money is essentially government collapse
96
CHAPTER 7. PORTFOLIO THEORY 97
– Risk-free usually refers to the investor’s country
– Sovereign default (don’t pay debts) rare but not impossible: e.g. Greece 2015,
Russia 2022
This means there are two key issues when deciding on an investment portfolio:
• How much the value will increase (returns)?
• What is the chance of losing money (risk)?
Portfolio theory tries to find a quantitative understanding of these trade-offs to make
good investment decisions.
Portfolio Features
In our modelling, we will have a collection of financial securities Si and we will invest
some fraction of our money wi ∈ R in each security.
• For example. wi = 0.2 might mean we have 20% of our money invested in shares of
some company (or in oil, ...)
• No requirements wi is a ‘round number’: we assume we can invest arbitrary fractions
of wealth (by buying fractions of shares, etc)
– If the total amount being invested is much larger than the price of 1 unit of the
security (e.g. 1 share, 1 barrel of oil), this is a fairly reasonable assumption.
– Assumes high liquidity: if you want to buy/sell, there is always someone happy
to do so (shares vs. art)
• We are allowed to ‘invest’ negative amounts:
– wi > 0 is called a long position (buy shares), wi < 0 is called a short position
(sell shares?)
– In practice, you borrow the asset from someone and sell it (but you have to
pay them back later with interest), called ‘short selling’
Returns
We will work on a one-period investment: we invest our wealth in different securities
at time t = 0, and close out the position at time t = 1 (i.e. sell everything to get back
cash).
If the price of an asset at time t is St, the (rate of) return on that asset over our time
period is
R := S1 − S0
S0
Since we invest at time t = 0, S0 is a known quantity, but S1 (and hence R) is a random
variable.
Example: suppose we have one risk-free bond with prices A0 = $100 and A1 = $110,
and one risky asset with prices S0 = $80 and
S1 =
$100, with probability 0.8$60, with probability 0.2
We invest $10,000 by buying 50 shares and putting the rest in bonds.
CHAPTER 7. PORTFOLIO THEORY 98
Returns
Our portfolio is:
• Shares: buy 50 shares at $80 each = $4,000 invested
• Bonds: invest the remaining $6,000 in bonds (i.e. buy 60 bonds at $100 each)
The returns on each asset are
RA =
A1 − A0
A0
= 110− 100100 = 0.1
RS =
S1 − S0
S0
=

100−80
80 = 0.25, with probability 0.8
60−80
80 = −0.25, with probability 0.2
Our portfolio value is
V0 = 50 · $80 + 60 · $100 = $10000
V1 =
50 · $100 + 60 · $110 = $11600, with probability 0.850 · $60 + 60 · $110 = $9600, with probability 0.2
Returns
So, our portfolio return is
R = V1 − V0
V0
=

11600−10000
10000 = 0.16, with probability 0.8
9600−10000
10000 = −0.04, with probability 0.2
Note: we invested 40% of our initial wealth in stocks and 60% in bonds, and our portfolio
return satisfies
R = 0.4RS︸︷︷︸
±0.1
+0.6RA︸︷︷︸
0.06
Our expected return is
E[R] = 0.8 · (0.16) + 0.2 · (−0.04) = 0.12
The variance on our return is
Var(R) = E[(R− E[R])2] = 0.8 · (0.16− 0.12)2 + 0.2 · (−0.04− 0.12)2 = 0.0064
so the standard deviation is σR =
√
0.0064 = 0.08.
7.2 Mean-Variance Portfolio Theory
Mean-Variance Portfolio Theory
Understanding and choosing portfolios is a complicated task. In this course we focus
on an important but simple model, Mean-Variance Portfolio Theory (MVPT). It was
developed by Harry Markowitz in 1952 and he received the Nobel Prize for Economics in
1990 for this work.
It is based on the following assumptions:
CHAPTER 7. PORTFOLIO THEORY 99
1. The market has perfect information and perfect competition
– Every investor knows the same things, lots of competition means ‘no free lunch’
2. The investment occurs over a single time period → FMAT3888
3. Risky asset returns are normally distributed with known means and (co)variances
4. Investors are rational and risk-averse
5. Investors make decisions only on the basis of the means and (co)variances of asset
returns
We have already discussed the validity #4, and the limitations of #2 and #5 should be
obvious. For #1, this is an ideal situation but usually not true (e.g. expert vs. inexperienced
investors). #3 is a common assumption that is often close enough to reality for practical
purposes (but not always, e.g. market downturns, electricity prices).
Diversification
As the name suggests, mean-variance portfolio theory uses two quantities to assess a
portfolio:
• The amount of returns is quantified as E[R] (expected portfolio return)
• The risk is quantified as Var(R) (variance of portfolio return)
Note that there is usually a trade-off between these: to get higher (expected) returns
typically requires taking on more risk (and vice versa).
An important concept (see utility theory) is diversification: by investing in multiple
risky assets, we can reduce the overall risk of our portfolio.
Example: consider two assets with the same initial price S10 = S20 = S0 and unknown
final prices S11 , S21 . For i = 1, 2, let Ri be the random return with means ri and standard
deviation σi. We have a portfolio of 1 share of asset 1 and h shares of asset 2.
What is the portfolio’s expected return and variance? If the correlation between the
two assets is ρ12 = −1, what value of h minimises the portfolio risk?
Example
Our portfolio has initial/final values
V0 = 1 · S10 + h · S20 = (1 + h)S0
V1 = 1 · S11 + h · S21
so the return is
R = V1 − V0
V0
= S
1
1 + hS21 − (1 + h)S0
(1 + h)S0
= S
1
1 − S0
(1 + h)S0
+ h(S
2
1 − S0)
(1 + h)S0
= 11 + hR1 +
h
1 + hR2
The expected return is
µ := E[R] = 11 + hE[R1] +
h
1 + hE[R2] =
1
1 + hr1 +
h
1 + hr2
The portfolio variance is
σ2 := Var(R) = E[(R− µ)2] = E
( 1
1 + hR1 +
h
1 + hR2 −
1
1 + hr1 −
h
1 + hr2
)2
CHAPTER 7. PORTFOLIO THEORY 100
Example
So, the portfolio variance is
σ2 = E
( 1
1 + hR1 +
h
1 + hR2 −
1
1 + hr1 −
h
1 + hr2
)2
= E [(R1 − r1 + hR2 − hr2)
2]
(1 + h)2
= E[(R1 − r2)
2] + h2E[(R2 − r2)2] + 2hE[(R1 − r1)(R2 − r2)]
(1 + h)2
= σ
2
1 + h2σ22 + 2hCov(R1, R2)
(1 + h)2
Now, if ρ12 = Cov(R1,R2)σ1σ2 = −1, then Cov(R1, R2) = −σ1σ2, so
σ2 = σ
2
1 + h2σ22 − 2hσ1σ2
(1 + h)2 =
(σ1 − hσ2)2
(1 + h)2
If h = σ1
σ2
, we get the minimal portfolio variance σ2 = 0 (portfolio has less risk than either
asset!).
Model Setup
Let’s construct the MVPT model: for a single investment time period (t = 0 to t = 1),
• We have n risky assets available to invest in, S1, . . . , Sn with associated (random)
returns R1, . . . , Rn (define the random vector R = (R1, . . . , Rn)T ∈ Rn).
• We have mean returns ri = E[Ri], with r = (r1, . . . , rn)T ∈ Rn
• The return variances are σ2i = Var(Ri)
• The covariance matrix is C ∈ Rn×n with Cij = Cov(Ri, Rj) (hence σ2i = Cii)
• The asset correlations are ρij = Cijσiσj
We start with wealth W0 which we will invest in assets S1, . . . , Sn.
We will invest fraction xi of our wealth (i.e. xiW0 dollars) in asset Si, so our portfolio
is defined by x = (x1, . . . , xn)T ∈ Rn.
Constraints
To invest our entire wealth, we impose the budget constraint
n∑
i=1
xi = 1
This defines a set of allowable portfolios,
D =
{
x ∈ Rn :
n∑
i=1
xi = 1
}
We will usually allow short-selling, xi < 0, which means we may also have xi > 1 for some
assets.
Our portfolio may have extra constraints. Common examples include:
CHAPTER 7. PORTFOLIO THEORY 101
• No short-selling allowed: xi ≥ 0 for all i = 1, . . . , n
• Maximum/minimum weights in each asset, Li ≤ xi ≤ Ui
• Sector constraints (e.g. at most 20% of wealth invested in tech stocks)
– These can usually be written as linear constraints, Ax ≤ b (e.g. bj is the
maximum weight of sector j, and Aij = 1 if asset i is in sector j, Aij = 0
otherwise)
Portfolio Returns
Theorem 7.1. If our portfolio allocation is x ∈ Rn, then the (random) return on this
portfolio is Rx = xTR. The expected portfolio return is µx = xTr and the portfolio
variance is σ2x = xTCx
Proof: we invest amounts xiW0 in asset Si, so our initial portfolio value is
V0 =
n∑
i=1
xiW0 = W0 (since
∑
i
xi = 1)
Asset Si increases in price by 1 +Ri (definition of return), so our final portfolio value is
V1 =
n∑
i=1
xiW0(1 +Ri)
So, the portfolio return is
Rx =
V1 − V0
V0
=
∑n
i=1 xiW0(1 +Ri)−
∑n
i=1 xiW0
W0
=
n∑
i=1
xiRi = xTR
Portfolio Returns
Proof: the expected portfolio return is
µx = E[Rx] =
n∑
i=1
xiE[Ri] =
n∑
i=1
xiri = xTr
The portfolio variance is
σ2x = Cov(Rx, Rx) = Cov(xTR,xTR) = xTCx
using the final theorem in the probability lectures (which said Cov(xT1R,xT2R) = xT1Cx2).
Note: since C is positive semidefinite, this confirms that σ2x ≥ 0 as expected.
Visualisation
We visualise different portfolios x by plotting the set of potential (σx, µx) pairs:
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
−0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
μ
Unconstrained “bullet”
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
−0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
μ
Constrained (no short selling)
CHAPTER 7. PORTFOLIO THEORY 102
• If two portfolios have the same risk σx, we prefer the one with higher expected
return µx
• If two portfolios have the same expected return µx, we prefer the one with lower
risk σx
That is, given two feasible portfolios, we prefer the one that is North-West of the other
(if any).
7.3 Optimising Portfolios
Efficient Frontier
Given the North-West rule, there are some important features of the feasible region:
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
−0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
μ
C
B
μ
• A/C are the portfolios with maximum/minimum return
• B is the minimum risk portfolio
• The curve ABC is the minimum variance frontier (MVF)
– For any feasible return, what is the minimum possible risk?
• The curve AB (a subset of the MVF) is the efficient frontier (EF):
– The set of optimal portfolios (no other feasible portfolios have higher return
and lower risk)
Efficient Frontier
Theorem 7.2. All rational, risk-averse investors will choose a portfolio on the efficient
frontier.
Proof (sketch): rational risk-averse investors have a utility function U(x) and they
pick portfolios to maximise E[U(V1)] (V1 is final portfolio value).
Since each Ri is normally distributed, the portfolio return Rx = xTR is normally
distributed, and so is final wealth V1 = V0(1 +Rx).
We can show that E[U(V1)] only depends on µx and σx (for fixed W0).
Using U ′ > 0 and U ′′ < 0 (risk aversion), we can then show that
∂E[U(V1)]
∂µx
> 0 and ∂E[U(V1)]
∂σx
< 0
If we are not on the efficient frontier, there is always another portfolio with smaller σx and
higher µx, which must give a higher value of E[U(V1)] (so the investor will never choose
this).
CHAPTER 7. PORTFOLIO THEORY 103
Efficient Frontier
All rational, risk-averse investors choose a portfolio on the EF (curve AB)
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
−0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
μ
C
B
μ
• Which point on the EF that maximises utility will depend on each investor
• Portfolio B is preferred by more risk-averse investors, A is preferred by less risk-averse
investors
• It is never rational to pick a portfolio on the inefficient frontier BC (we can always
increase returns without changing risk)
Utility Maximisation
For investors with a particular risk aversion, we can determine exactly which portfolio
on the EF they prefer.
We assume investors have a constant absolute risk aversion
A(W ) = −U
′′(W )
U ′(W ) = c > 0 for all W
Integrate both sides w.r.t. W we get
− log(U ′(W )) = cW + a and so U ′(W ) = e−a−cW , U(W ) = −1
c
e−a−cW + b.
That is, we have U(W ) = −ae−cW+b for some a > 0 and b ∈ R. Since utility maximisation
is unaffected by positive linear transformations, we may assume
U(W ) = −e−cW
This will allow us to maximise E[U(V1)] explicitly.
Utility Maximisation
For portfolio x, our final wealth is V1 = W0(1 +Rx), giving
E[U(V1)] = E[−e−cW0(1+Rx)] = E[−e−cW0−cW0Rx ] = −e−cˆE[e−cˆRx ]
where cˆ := cW0. Now we use the following facts:
• If Xi are normally distributed (possibly correlated), then any linear combination∑
i ciXi is also normally distributed. Hence, Rx ∼ N(µx, σ2x).
• If R ∼ N(µ, σ2) then E[e−tR] = e−t(µ−σ2t)/2 (Tutorial week 8, Q4)
CHAPTER 7. PORTFOLIO THEORY 104
All together, we get
E[U(V1)] = −e−cˆ−cˆ(µx−σ2xcˆ/2) = −e−cˆ−cˆ2(µx/cˆ−σ2x/2)
Since f(x) = −e−cˆ−cˆ2x is increasing in x, we can maximise E[U(V1)] by instead maximising
V (µx, σx) =
1
cˆ
µx − 12σ
2
x = tµx −
1
2σ
2
x where t = 1/cˆ = 1/(cW0)
Note: V is increasing in µx and decreasing in σx (c.f. efficient frontier theorem above).
Optimal Portfolios
If A(W ) = c, then utility maximisation is equivalent to maximising V (µx, σx) =
tµx − 12σ2x for t = 1/(cW0)
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
−0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
μ
C
B
μ
• As t→ 0 (i.e. c→∞, more risk averse), the investor chooses portfolios near B
• As t→∞ (i.e. c→ 0, less risk averse), the investor chooses portfolios near A
Optimal Portfolios
What if 0 < t < ∞? For a given t, maximising E[U(V1)] (i.e. maximising V ) is
equivalent to
min
µ,σ
Z(µ, σ) = −V (µ, σ) = −tµ+ 12σ
2,
or equivalently
min
x
Z(x) = −txTr + 12x
TCx
Since C is positive semidefinite (covariance matrix), this is a convex function in x.
If two portfolios x give the same value of Z, they have the same expected utility.
Hence, the contour lines Z = const are called indifference curves.
Optimal Portfolios
In our example, if t = 1, the indifference curves are:
CHAPTER 7. PORTFOLIO THEORY 105
0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300
σ
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
μ
C
B
μ
Z= −0.0009
Z= −0.0007
Z= −0.0005
The indifference curves are always convex and the EF must be concave (why?), so
there will always be a unique optimal portfolio.
If t = 0 (c = ∞), the indifference curves are vertical and Z decreases moving left
(optimum is B). If t =∞ (c = 0) the indifference curves are horizontal and Z decreases
moving up (optimum is A).
Example
Example: An investor with risk aversion parameter t = 1.2 is offered a choice between
two portfolios:
• P1 has µ1 = 0.1 and σ1 = 0.25
• P2 has µ2 = 0.12 and σ2 = 0.3
Which do they prefer? Which investor (i.e. what value of t) is indifferent between P1 and
P2?
The Z values for the two portfolios are:
Z1 = −tµ1 + 12σ
2
1 = −1.2(0.1) +
1
2(0.25)
2 = −0.08875
Z2 = −tµ2 + 12σ
2
2 = −1.2(0.12) +
1
2(0.3)
2 = −0.099
So, the investor prefers P2 (smaller Z value). An investor is indifferent if Z1 = Z2, so
solve for t:
−t(0.1) + 12(0.25)
2 = −t(0.12) + 12(0.3)
2 =⇒ t = 0.6875
Another Example
Example: Suppose the EF is given by a straight line µ = a+ bσ. What is the optimal
µ∗, σ∗ for an investor with risk aversion parameter t?
At the optimal solution, the indifference curve Z = −tµ+ 12σ2 is tangent to the EF
µ = a+ bσ.
The slope of the indifference curve is
µ = −1
t
Z + σ
2
2t =⇒
dµ
dσ
= σ
t
Hence, we want dµ
dσ
= b (slope of EF), so σ∗ = bt.
Substituting into EF, we get µ∗ = a+ bσ∗ = a+ b2t.
CHAPTER 7. PORTFOLIO THEORY 106
7.4 Two-Asset Optimal Portfolios
Two-Asset Case
We now look at adding in the budget constraint, ∑ni=1 xi = 1. First, let’s look at
2-asset portfolios (n = 2).
In this case, we can eliminate the budget constraint by choosing
x1 = x, and x2 = 1− x
and optimising over x ∈ R. The portfolio risk/return now only depends on x,
µx = xTr = x1r1 + x2r2 = xr1 + (1− x)r2 = r2 + (r1 − r2)x
σ2x = xTCx =
[
x1 x2
] [ σ21 ρσ1σ2
ρσ1σ2 σ
2
2
] [
x1
x2
]
= σ21x21 + σ22x22 + 2ρσ1σ2x1x2
= σ21x2 + σ22(1− x)2 + 2ρσ1σ2x(1− x)
where ρ is the correlation between R1 and R2.
Note: for each µ there is a unique x, and hence a unique σ. So, the feasible region is
just a curve in the (σ, µ) plane.
Degenerate Cases
There are 3 degenerate cases, where det(C) = 0: ρ = 1, ρ = −1, or one asset is
risk-free (σ1 = 0 or σ2 = 0; if both zero, just pick higher returns). We will look at these
first.
WLOG, assume r1 > r2 and σ1 > σ2 (higher returns = higher risk, else one asset
clearly better).
Case 1, ρ = 1: in this case,
µ = r2 + (r1 − r2)x
σ2 = σ21x2 + σ22(1− x)2 + 2σ1σ2x(1− x) = (σ1x+ σ2(1− x))2
Hence, since x = µ−r2
r1−r2 we can write
σ = |(σ1 − σ2)x+ σ2| =
∣∣∣∣σ1 − σ2r1 − r2 µ+ r1σ2 − r2σ1r1 − r2
∣∣∣∣
This forms a piecewise linear function in the (σ, µ) plane.
If we choose x = x0 = −σ2σ1−σ2 < 0 (short sell asset S
1), then we get µ = σ1r2−σ2r1
σ1−σ2 and
σ = 0. This is a risk-free portfolio.
Degenerate Cases
Case 1, ρ = 1: the feasible region in this case is
0
σ
0
μ
μ=1
μ=0
CHAPTER 7. PORTFOLIO THEORY 107
The solid line is x ∈ [0, 1] (no short selling). Note x = 1 is S1 at (σ1, µ1) and x = 0 is
S2 at (σ2, µ2).
The dashed lines are x < 0 and x > 1 (short sell asset S1/S2 respectively). This
includes the zero risk portfolio.
Degenerate Cases
Case 2, ρ = −1: similar to case 1, we get
µ = r2 + (r1 − r2)x
σ = |(σ1 + σ2)x− σ2| =
∣∣∣∣σ1 + σ2r1 − r2 µ− r2σ1 + r1σ2r1 − r2
∣∣∣∣
There is still a zero-risk portfolio, at x0 = σ2σ1+σ2 , with µ =
σ1r2+σ2r1
σ1+σ2 and σ = 0, but
0 < x < 1 so no short selling required.
0
σ
0
μ
μ=1
μ=0
Degenerate Cases
Case 3, σ2 = 0: then we get
µ = r2 + (r1 − r2)x and σ = |σ1x| =
∣∣∣∣ σ1r1 − r2µ− σ1r2r1 − r2
∣∣∣∣
The zero-risk portfolio is just x = 0 (asset S2)
0
σ
0
μ
μ=1
μ=0
The upwards line (µ ≥ r2) has slope dµdσ = r1−r2σ1 , which we call the price of risk (i.e. how
much higher return do you get for an extra unit of risk).
CHAPTER 7. PORTFOLIO THEORY 108
2-Asset General Case
Let’s now consider the typical 2-asset case (where det(C) ̸= 0).
Example: suppose r1 = 1, r2 = 12 , σ21 = 1, σ22 =
1
2 and ρ = 0. That is, we have
r =
[
1
1/2
]
and C =
[
1 0
0 1/2
]
Then, for a given allocation (x1, x2) = (x, 1− x), the portfolio mean and variance are
µ = r1x1 + r2x2 = x+
1
2(1− x) =
1
2(x+ 1)
σ2 = σ21x2 + σ22(1− x)2 + 2ρσ1σ2x(1− x) = x2 +
1
2(1− x)
2
The first equation gives x = 2µ− 1, which we substitute into the second equation to get
σ2 = (2µ− 1)2 + 12(1− (2µ− 1))
2 = 4µ2 − 4µ+ 1 + 12(4− 8µ+ 4µ
2) = 6µ2 − 8µ+ 3
2-Asset General Case
Example: completing the square, we have
σ2 = 6
(
µ− 23
)2
+ 13 or µ = ±
√
1
6
(
σ2 − 13
)
+ 23
This is a hyperbola: in standard form it looks like
σ2
1/3 −
(µ− 2/3)2
1/18 = 1 (but only take the σ > 0 part)
The vertex of the hyperbola (left-most point) is the minimum risk portfolio! Minimising
σ2 w.r.t. µ, we see this is at (σ, µ) =
(√
1/3, 2/3
)
. The asymptotes of the hyperbola are
(take σ →∞)
µ = ±
√
1
6 · σ +
2
3
2-Asset General Case
Example: visually, we have
0.0 0.5 1.0 1.5 2.0 2.5
σ
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
μ
μ=1
μ=0
MRP
What x value gives the MRP? Minimise σ2 w.r.t. x:
dσ2
dx
= 2x− (1− x) = 3x− 1
so the MRP occurs at x = 1/3 (since σ2 is convex in x).
CHAPTER 7. PORTFOLIO THEORY 109
2-Asset General Case
Example: For the above assets, find the optimal portfolio for an investor with risk
aversion parameter t.
At the optimum, we know that the indifference curve Z = −tµ+ 12σ2 is tangential to
the EF.
We saw previous that the slope of the indifference curve is
µ = −1
t
Z + σ
2
2t =⇒
dµ
dσ
= σ
t
The EF (top half of hyperbola) is given by (for σ ≥ 1/√3)
µ =
√
1
6
(
σ2 − 13
)
+ 23 =⇒
dµ
dσ
= 12
2σ√
6(σ2 − 1/3)
= σ√
6(σ2 − 1/3)
Hence we have t =
√
6(σ2 − 1/3), so the optimal portfolio has σ2(t) = t26 + 13 and
µ(t) = t6 +
2
3 (substitute σ
2(t) into EF). The allocation is x(t) = 2µ(t)− 1 = t3 + 13 .
7.5 General Optimal Portfolios
General Case
Now we consider the general n-asset case with the budget constraint ∑ni=1 xi = 1. We
will start by assuming all assets are risky and C is invertible; the case with a risk-free
asset is also important and we will study this later.
For a given investor with risk aversion parameter t, our optimal portfolio (maximum
expected utility) comes from solving
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe = 1
This is a quadratic program (QP) with one equality constraint (convex since C is positive
definite1). This type of problem is easy to solve!
General Case
Theorem 7.3. Define the constants
a = eTC−1e, b = rTC−1e, c = rTC−1r, d = ac− b2
and vectors
α = 1
a
C−1e and β = C−1r − b
a
C−1e
If C is invertible, the solution to
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe = 1
is x(t) = α+ tβ. The portfolio mean is µ(t) = b+dt
a
and variance is σ2(t) = 1+dt2
a
.
1If a matrix is positive semidefinite and invertible, then it is positive definite. Why? (we will prove
this later)
CHAPTER 7. PORTFOLIO THEORY 110
Optimal Portfolio Proof
Proof: the problem is convex, so KKT points are global minimisers. Writing the
constraint as g(x) = xTe− 1 = 0, we first compute gradients:
∂g
∂xi
= 1 =⇒ ∇g(x) = e
and for the quadratic objective,
Z(x) = −txTr + 12x
TCx = −t
n∑
i=1
xiri +
1
2
n∑
i=1
n∑
j=1
Cijxixj
For the double sum, any give xi appears in all Cij and Cji terms (but don’t double count
Cii). Hence
∂Z
∂xi
= −tri + 12
Cii(2xi) + 2 n∑
j=1
j ̸=i
Cijxj

which gives ∇Z(x) = −tr + Cx.
Optimal Portfolio Proof
Proof: so the KKT conditions for this problem are
∇Z(x) = −tr + Cx = −λe = −λ∇g(x)
g(x) = xTe− 1 = 0
The first equation gives us
x = tC−1r − λC−1e
Substituting into the second equation we get
trTC−1e− λeTC−1e = tb− λa = 1
so λ = (tb− 1)/a. We then substitute back into the expression for x to get
x(t) = tC−1r − tb− 1
a
C−1e = 1
a
C−1e+ t
(
C−1r − b
a
C−1e
)
= α+ tβ
and we get the expression for x(t) as expected.
Optimal Portfolio Proof
Proof: given x(t) = α+ tβ we compute the portfolio mean
µ(t) = rTx(t) = 1
a
rT (C−1e) + trT
(
C−1r − b
a
C−1e
)
= b
a
+ t
(
c− b
2
a
)
= b+ dt
a
CHAPTER 7. PORTFOLIO THEORY 111
as required. The variance of the portfolio is (recall Cx(t) = tr − λe and λ = (tb− 1)/a
from the KKT conditions)
σ2(t) = x(t)TCx(t) = x(t)T
(
tr − tb− 1
a
e
)
Using x(t)Tr = µ = b+dt
a
and x(t)Te = 1 (constraint) we conclude
σ2(t) = tµ− tb− 1
a
= bt+ dt
2
a
− tb− 1
a
= 1 + dt
2
a
and we are done.
Comments
Some things to note:
• For a given risk aversion parameter t, the optimal portfolio is x(t) = α+ tβ. So all
optimal portfolios lie on a line in Rn (the critical line)
• Since x(t)Te = 1 for all t, we have αTe = 1 (try t = 0) and βTe = 0 (any other t)
Lemma 7.4. If C is invertible, then a, c, d > 0.
Proof: C is positive semidefinite (all eigenvalues ≥ 0) and invertible (eigenvalues
̸= 0), so C is positive definite. This means C−1 is also positive definite.
• Proof 1: eigendecomposition C = QΛQT gives us C−1 = QΛ−1QT so all eigenvalues
of C−1 are inverses of eigenvalues of C (all > 0)
• Proof 2: for any y ̸= 0, let x = C−1y ̸= 0. Then yTC−1y = xTCC−1Cx > 0.
Comments
Proof: since C−1 is positive definite, we have
a = eTC−1e > 0 and c = rTC−1r > 0
We now only need to show d > 0. For this, we know
0 < (br − ce)TC−1(br − ce)
= b2rC−1r − 2bcrTC−1e+ c2eTC−1e
= b2(c)− 2bc(b) + c2(a)
= c(ac− b2) = cd
Since c > 0, we have d > 0.
CHAPTER 7. PORTFOLIO THEORY 112
Minimum Variance Frontier
For a risk aversion parameter t the optimal portfolio x(t) = α+ tβ has
µ(t) = b+ dt
a
and σ2(t) = 1 + dt
2
a
So, t = a
d
(
µ− b
a
)
and hence
σ2 = 1
a
+ a
d
(
µ− b
a
)2
This is (again) a hyperbola in the (σ, µ) plane, since a, d > 0.
As before, this is the minimum variance frontier. The efficient frontier is the upper
half of this hyperbola (µ ≥ b/a).
The minimum variance portfolio has σ2 = 1
a
when µ = b
a
(i.e. t = 0 so x = α).
Minimum Variance Frontier
Theorem 7.5. In the (σ, µ) plane, we have:
• The minimum variance frontier is the image of x(t), i.e. the hyperbola
σ2 − a
d
(
µ− b
a
)2
= 1
a
• The efficient frontier is the image of x(t) for t ≥ 0, i.e. the upper part of the MVF
(µ ≥ b/a).
• The minimum risk portfolio is the vertex of the hyperbola at t = 0: σ2 = 1
a
, µ = b
a
for x = α
• The critical line with t < 0 is the inefficient frontier (lower half of the hyperbola)
• For n = 2, the feasible region is the hyperbola. For n ≥ 3 the feasible region is the
interior of the hyperbola.
Minimum Variance Frontier
Visually, we have
0
σ
0
μ
MVP
μ= b/μ
σ=1/√
μ
CHAPTER 7. PORTFOLIO THEORY 113
Example
Example: consider a 3-asset portfolio with
r =
12
0
 and C =
1 0 00 2 0
0 0 1

Find (a) the MVF and EF, (b) minimum risk portfolio, (c) critical line (in the (x1, x2)
plane), and (d) optimal portfolio for an investor with t = 1/2.
We start by computing
C−1 =
1 0 00 1/2 0
0 0 1
 so a = eTC−1e = [1 1 1]
1 0 00 1/2 0
0 0 1

11
1
 = 1 + 12 + 1 = 52
Similarly, b = rTC−1e = 2 and c = rTC−1r = 3, giving d = ac− b2 = 72 .
Example
Example: Continuing, we also compute
α = 1
a
C−1e = 25
1 0 00 1/2 0
0 0 1

11
1
 =
2/51/5
2/5

and
β = C−1
(
r − b
a
e
)
=
1 0 00 1/2 0
0 0 1


12
0
− 45
11
1

 =
 1/53/5
−4/5

Thus the optimal portfolios are
x(t) = α+ tβ =
 2/5 + t/51/5 + 3t/5
2/5− 4t/5
 with µ(t) = b+ dt
a
= 4 + 7t5 and σ
2(t) = 1 + dt
2
a
= 2 + 7t
2
5
Example
Example: For (a), the MVF is the hyperbola
σ2 − a
d
(
µ− b
a
)2
= 1
a
=⇒ σ2 − 57
(
µ− 45
)2
= 25
The EF is the upper half of the hyperbola (µ ≥ 4/5).
(b) The minimum risk portfolio is at t = 0, with
x(0) = α =
2/51/5
2/5
 and µ(0) = b
a
= 45 , σ
2(0) = 1
a
= 25
CHAPTER 7. PORTFOLIO THEORY 114
Example
Example: (c) In the (x1, x2) plane, the critical line is the parametric curve
x1(t) =
2 + t
5 , x2(t) =
1 + 3t
5
The first equation gives t = 5x1 − 2, and substituting this into the second equation gives
the critical line 3x1 − x2 = 1.
(d) If t = 1/2, the optimal portfolio is
x(1/2) = α+ 12β =
1/21/2
0

with µ(t) = b+dt
a
= 32 and σ
2(t) = 1+dt2
a
= 34 .
Two-Fund Theorem
In practice, building a portfolio can be complicated (interacting with multiple brokers,
etc.). Fortunately, any efficient portfolio can be built from a combination of any two other
efficient portfolios.
So, if there are two efficient managed funds already available, any investor can just
split their investment between these rather than building their own portfolio from scratch.
Theorem 7.6 (Two-Fund Theorem). Suppose P and Q are any two distinct efficient
portfolios. Then any efficient portfolio is a linear combination of P and Q.
Proof: since P and Q are efficient and distinct, we have tP ̸= tQ and
xP = α+ tPβ and xQ = α+ tQβ
For any other risk aversion parameter t, define θ = t−tP
tQ−tP , so
(1− θ)xP + θxQ = α+ [(1− θ)tP + θtQ]β = α+ tβ = x(t)
7.6 Restricted Portfolios
Restricted Portfolios
Previously, we looked at the expected utility maximisation problem
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe = 1
What if we have constraints on our investment? The simplest case is bounds on each
variable:
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe = 1
L ≤ x ≤ U
Note: set Li = −∞ and/or Ui = +∞ if no constraints.
For example, ‘no short selling of asset i’ would be given by Li = 0 and Ui = +∞.
CHAPTER 7. PORTFOLIO THEORY 115
Example
Example: consider the previous 3-asset portfolio with
r =
12
0
 and C =
1 0 00 2 0
0 0 1

What is the optimal portfolio if we add the extra constraint x3 ≥ 0?
Method 1 (use unconstrained problem): we previously found the unconstrained
optimal allocation
x(t) =
 2/5 + t/51/5 + 3t/5
2/5− 4t/5

so, the constraint x3 ≥ 0 is satisfied for all t ≤ 1/2 (so this is also the constrained solution).
What do we do when t > 1/2?
Example
Method 1: when t > 1/2, the unconstrained solution violates the constraint x3 ≥ 0.
So, this constraint must be active at the new solution, x3 = 0.
This means we now have a 2-asset problem with
rˆ =
[
1
2
]
and Cˆ =
[
1 0
0 2
]
Since x3 = 0, we still have the constraint xˆTe = x1 + x2 = 1, so our new problem is
min
xˆ∈R2
Z(x) = −txˆT rˆ + 12 xˆ
T Cˆxˆ
s.t. xˆTe = 1
This is just the usual (unconstrained) portfolio optimisation, so the previous solution
method works here.
Example
Method 1: as usual, we compute
aˆ = eT Cˆ−1e =
[
1 1
] [1 0
0 1/2
] [
1
1
]
= 32
and similarly bˆ = rˆT Cˆ−1e = 2, cˆ = rˆT Cˆ−1rˆ = 3 and dˆ = aˆcˆ− bˆ2 = 12 . We also have
αˆ = 1
aˆ
Cˆ−1e =
[
2/3
1/3
]
and βˆ = Cˆ−1
rˆ − bˆ
aˆ
e
 = [−1/31/3
]
So we know that
[
x1
x2
]
= αˆ+ tβˆ and x3 = 0, so our constrained solution for t > 1/2 is
x(t) =
2/3− t/31/3 + t/3
0

When t = 1/2, the constrained and unconstrained solutions match, x(1/2) = (1/2, 1/2, 0)T .
CHAPTER 7. PORTFOLIO THEORY 116
Graphical Illustration
What does the bullet look like now?
σ2 − a
d
(
µ− b
a
)2
= 1
a
Here, we have two hyperbolae,
σ2 − 57
(
µ− 45
)2
= 25 and σ
2 − 3
(
µ− 43
)2
= 23
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
σ
−2
−1
0
1
2
3
4
μ
μ=0
μ=1/2
Graphical Illustration
At the t = 1/2 changeover point, the slopes of the hyperbolae are the same:
d
dµ
σ2 − a
d
(
µ− b
a
)2 = d
dµ
(1
a
)
2σdσ
dµ
− 2a
d
(
µ− b
a
)
= 0
dσ
dµ
= a
σd
(
µ− b
a
)
Using µ = b+dt
a
we get
dσ
dµ
= t
σ
At the changeover point, both curves have the same σ (since same x-coordinate) and both
have t = 0.5.
Note: the second derivatives d2σ
dµ2 =
a
dσ
are not the same on both curves (different a
and d).
KKT Conditions
We saw in the above example that the constrained critical line x(t) is actually two
connected line segments (t ≤ 1/2 and t > 1/2).
This is implied by the KKT conditions for the general constrained problem:
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe = 1
L ≤ x ≤ U
CHAPTER 7. PORTFOLIO THEORY 117
In our usual form (nonlinear optimisation lectures), we write these as
min
x∈Rn
Z(x) = −txTr + 12x
TCx
s.t. xTe− 1 = 0
Li − xi ≤ 0, ∀i = 1, . . . , n
xi − Ui ≤ 0, ∀i = 1, . . . , n
KKT Conditions
The KKT conditions for this problem are:
−tr + Cx = −λ0e+
n∑
i=1
λLiei −
n∑
i=1
λUiei = −λ0e+ λL − λU
xTe− 1 = 0
L ≤ x ≤ U
λL,λU ≥ 0
λLi(Li − xi) = 0 ∀i = 1, . . . , n
λUi(xi − Ui) = 0 ∀i = 1, . . . , n
Note: for missing constraints Li = −∞ or Ui = +∞, just set λLi = 0 or λUi = 0. For
each upper/lower bound pair, we have 3 possible cases:
• If xi = Li, then λLi = 0 and λUi > 0
• If xi = Ui, then λLi > 0 and λUi = 0
• If xi ∈ (Li, Ui) then λLi , λUi > 0
KKT Conditions
In this situation, there are up to 2n possible constraints Li ≤ xi ≤ Ui.
The critical line can now be split into many straight-line pieces x(t) = α+tβ, including
the unconstrained line and possibly parts of the constrained lines for xi = Li or xi = Ui.
Each straight line maps to (part of) a hyperbola in (σ, µ) space.
As we have seen, where the critical line changes segments, the associated hyperbolae
are smoothly intersecting.
Example
Example (same as before): consider the previous 3-asset portfolio with
r =
12
0
 and C =
1 0 00 2 0
0 0 1

What is the optimal portfolio if we add the extra constraint x3 ≥ 0?
Method 2 (KKT conditions): the constraints are xTe− 1 = 0 and −x3 ≤ 0 (in
our format), so the KKT conditions are
−tr + Cx = −λ0e+ λ1e3
xTe− 1 = 0
x3, λ1 ≥ 0
x3λ1 = 0
CHAPTER 7. PORTFOLIO THEORY 118
Example
Method 2: case (i) first suppose λ1 = 0 (i.e. unconstrained case). The KKT conditions
are now
−tr + Cx = −λ0e
xTe− 1 = 0
x3 ≥ 0
The first two lines give the usual unconstrained solution,
x(t) =
 2/5 + t/51/5 + 3t/5
2/5− 4t/5

This solution also satisfies x3 ≥ 0 whenever t ≤ 1/2, otherwise there is no solution in this
case.
Example
Method 2: case (ii) now suppose λ1 > 0 (i.e. constrained case). Then x3λ1 = 0
implies x3 = 0. The remaining KKT conditions (in component form) are
−t
12
0
+
 x12x2
x3
 = −λ0
11
1
+ λ1
00
1

x1 + x2 + x3 − 1 = 0
λ1 > 0
Subtracting the second component from the first component of the vector equation, we
get
−t(2− 1) + (2x2 − x1) = 0 =⇒ 2x2 − x1 − t = 0
Together with x1 + x2 − 1 = 0, we get
2x2 − (1− x2)− t = 0 =⇒ x2 = 13(1 + t)
and so x1 = 1− x2 = 13(2− t).
Example
Method 2: case (ii) so the constrained solution is
x(t) =
2/3− t/31/3 + t/3
0

and holds whenever λ1 > 0. The first component of the vector equation now gives
−t+ x1 = −λ0 =⇒ λ0 = t− x1 = −23 +
4t
3
Then the third component of the vector equation gives (since x3 = 0)
0 = −λ0 + λ1 =⇒ λ1 = λ0 = −23 +
4t
3
and so λ1 > 0 whenever t > 1/2.
CHAPTER 7. PORTFOLIO THEORY 119
Example
Method 2: all together, if t ≤ 1/2 then we are in case (i) and the solution is
x(t) =
 2/5 + t/51/5 + 3t/5
2/5− 4t/5

and if t > 1/2 then we are in case (ii) and the solution is
x(t) =
2/3− t/31/3 + t/3
0

This matches our answer from method 1 (fortunately!).
7.7 Risk-Free Asset
Risk-Free Asset
We now consider the case where we add a risk-free asset (e.g. government bonds). The
risk-free asset is S0 and the risky assets are still S1, . . . , Sn.
The parameters for the risky assets are the same: returns Ri, mean returns ri and
standard deviations σi with covariance matrix C (invertible).
The risk-free asset has return R0 which is deterministic: R0 = E[R0] = r0 and
Var(R0) = 0. This implies that (check from the definition!)
Cov(R0, Ri) = 0 ∀i = 1, . . . , n.
Difficulty: the full set of available investments has a singular covariance matrix
Cˆ =
[
0 0T
0 C
]
so all our previous work does not apply (assumed invertible covariance matrix).
Risk-Free Asset
For this section, everything including the risk-free asset will have a hat (like Cˆ) and
everything with just the risky assets won’t (like C). Suppose our portfolio over all assets
is xˆ = (x0,x) ∈ Rn+1 (column vector). The budget constraint is now
x0 +
n∑
i=1
xi = 1 =⇒ x0 = 1−
n∑
i=1
xi = 1− xTe
The portfolio mean return is then
µˆ = xˆT rˆ = x0r0 + xTr = (1− xTe)r0 + xTr = r0 + xTr
where r := r − r0e ∈ Rn is the excess return (i.e. how much extra return the risky assets
give compared to the risk-free asset).
Since the risk-free asset does not change the risk of the portfolio, the portfolio variance
is
σˆ2 = xˆT Cˆxˆ =
n∑
i=0
n∑
j=0
xˆixˆjCˆij =
n∑
i=1
n∑
j=1
xˆixˆjCˆij = xTCx
CHAPTER 7. PORTFOLIO THEORY 120
Problem Formulation
So, the budget constraint has allowed us to eliminate the variable x0 (i.e. only n
variables again, but no constraints — even nicer than the risky-only case!). The utility
maximisation problem (for parameter t ≥ 0) is now
min
x∈Rn
Z = −tµˆ+ 12 σˆ
2 = −t(r0 + xTr) + 12x
TCx
where the remaining decision variable is given by x0 = 1− xTe (budget constraint).
Theorem 7.7. If C is invertible then the solution to the above problem is
xˆ(t) =
[
1− teTC−1r
tC−1r
]
for all t ≥ 0, with portfolio mean and variance
µˆ(t) = r0 + ct and σˆ2(t) = ct2 where c = rTC−1r
Problem Formulation
Proof: the unconstrained problem is convex, since∇2Z(x) = C is positive semidefinite
everywhere. So, we can find a global minimiser by looking for a stationary point,
∇Z(x) = −tr + Cx = 0
and so x = tC−1r. Since x0 = 1− xTe we get the solution xˆ(t) as required.
We now compute
µˆ(t) = r0 + xTr = r0 + trTC−1r = r0 + ct
and
σˆ2(t) = xTCx = t2(C−1r)TC(C−1r) = t2rTC−1r = ct2
and we are done.
Capital Market Line
Since
µˆ(t) = r0 + ct and σˆ2(t) = ct2 where c = rTC−1r
and so σˆ =
√
c t, or
µˆ = r0 +
√
c σˆ
So the efficient frontier is a line in (σˆ, µˆ) space.
When we have a risk-free asset, the efficient frontier is called the capital market line
(CML).
Alternative form: noting the MVF with only risky assets, σ2 = a
d
(µ− b
a
)2 + 1
a
, we
can also write the CML as
µˆ = r0 + (σ0
√
d)σˆ
where σ20 = ad(r0 − ba)2 + 1a .
CHAPTER 7. PORTFOLIO THEORY 121
Capital Market Line
Derivation of alternative form: since C is positive definite and so is C−1 (see
earlier discussion), then c > 0. We then write
c = rTC−1r = (r − r0e)TC−1(r − r0e)
= rTC−1r − 2r0eTC−1r + r20eTC−1e
= c− 2r0b+ r20a
= a
(
r20 − 2
b
a
r0 +
b2
a2
)
+ c− b
2
a
= a
(
r0 − b
a
)2
+ d
a
= dσ0
after completing the square (remember, d = ac− b2).
Capital Market Line
Suppose r0 < ba (return of minimum risk portfolio from risky assets — why is this
reasonable?). Then the CML looks like
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
σ
−2
−1
0
1
2
3
4
μ
(σ
0
, r
0
)
μ (x
0
=0)
x
0
≥0
x
0
<0
The CML intersects the ‘risky-only’ EF at a single pointM , called the market portfolio.
Here, all our wealth is invested in risky assets (so x0 = 0). Portfolios with x0 < 0 (or
µ > µM) are called lending portfolios (short risk-free asset), and those with x0 ≥ 0 (or
r0 ≤ µ ≤ µM) are called borrowing portfolios (buy risk-free bonds).
Market Portfolio
Theorem 7.8. If r0 < b/a, then the CML is tangential to the risky-asset EF at a single
point M , called the market portfolio, which contains only risky assets (x0 = 0 at M). This
point satisfies
µM =
c− br0
b− ar0 , and σM =
σ0
√
d
b− ar0 , with tM =
1
b− ar0
Proof: the CML is given by
µˆ = r0 + σ0
√
d σˆ =⇒ σˆ = µˆ− r0
σ0
√
d
and the risky-asset EF is given by
σ2 = a
d
(
µ− b
a
)2
+ 1
a
=⇒ σ2 = aµ
2 − 2bµ+ c
d
(using c = (b2 + d)/a by definition of d)
CHAPTER 7. PORTFOLIO THEORY 122
Market Portfolio
Proof: At the intersection point, µˆ = µ = µM and σˆ = σ = σM ,
(µM − r0)2 = σ20(aµ2M − 2bµM + c) =⇒ (1− aσ20)µ2M − 2(r0 − bσ20)µM + (r20 − cσ20) = 0
For this quadratic equation Aµ2M +BµM + C = 0, the discriminant is
B2 − 4AC = 4(r0 − bσ20)2 − 4(1− aσ20)(r20 − cσ20)
= 4(r0 − 2br0σ20 + b2σ40)− 4(r20 − cσ20 − ar20σ20 + acσ40)
= 4σ20(ar20 − 2br0 + c) + 4σ40(b2 − ac)
Since d = ac − b2 and dσ20 = a(r0 − b/a)2 + d/a = ar20 − 2br0 + (b2 + d)/a, we get
B2 − 4AC = 0. So the quadratic has a unique root (i.e. the CML and EF intersect at
exactly one point).
Verifying the values of µM , σM and tM are for tutorials (week 13, Q3).
Market Portfolio
Proof: At M , for the CML we have
µˆ = r0 + σ0
√
d =⇒ dµˆ
dσˆ
= σ0
√
d
and for the EF we have
σ2 = a
d
(
µ− b
a
)2
+ 1
a
=⇒ 2σM = 2a
d
(
µM − b
a
)
dµ
dσ
or
dµ
dσ
= dσM
aµM − b =
dσ0
√
d/(b− ar0)
a(c− br0)/(b− ar0)− b(b− ar0)/(b− ar0) =
dσ0
√
d
a(c− br0)− b(b− ar0)
The denominator simplifies to ac− b2 = d, so dµ
dσ
= σ0
√
d = dµˆ
dσˆ
, so the CML and EF are
tangent at M .
Lastly, at M we have xTe = 1 (budget constraint for the EF) and so x0 = 1−xTe = 0.
One-Fund Theorem
With a risk-free asset, the efficient frontier is a line (CML). We know two points on that
line: the market portfolio M (no risk-free asset) and the risk-free asset (σ, µ) = (0, r0).
Theorem 7.9 (One-Fund Theorem). Every efficient portfolio (on the CML) can be
constructed as a linear combination of the (risky) market portfolio M the risk-free asset.
For a portfolio xˆ with given x0, the efficient portfolio has returns Rxˆ = x0R0 + (1−
x0)RM and so
µˆ = x0r0 + (1− x0)µM and σˆ2 = (1− x0)2σ2M
Using 1− x0 = σˆσM we get
µˆ =
(
1− σˆ
σM
)
r0 +
σˆ
σM
µM or µˆ = r0 +
µM − r0
σM
σˆ
which is an alternative expression for the CML. The slope (µM − r0)/σM is called the
Sharpe ratio and represents the extra (expected) return per unit of extra risk taken.