Module Code
MAT00031H
BA, BSc and MMath Examinations 2019/20
Department:
Mathematics
Title of Exam:
Statistical Pattern Recognition
Time Allowed:
2 hours
Allocation of Marks:
The marking scheme shown on each question is indicative only.
Instructions for Candidates:
Answer ALL questions.
Please write your answers in ink; pencil is acceptable for graphs and diagrams.
Do not use red ink.
Materials Supplied:
Answer booklet
Calculator
Do not write on this booklet before the exam begins.
Do not turn over this page until instructed to do so by an invigilator.
Page 1 (of 5)
MAT00031H
1 (of 3). (a) Explain what is meant by the term pattern recognition and give 3 real-
world examples of pattern recognition applications. You should mention
both supervised and unsupervised analyses in your answer. [6]
(b) (i) Explain the terms decision rule and decision region in the context of
classification.
(ii) Give the decision rule for a K-class problem in terms of the discrim-
inant functions, gk(x), for k = 1, ..., K.
(iii) Give a decision rule based solely on prior probabilities for a two-
class problem. What is the probability of error under this rule? [9]
(c) The Likelihood Ratio Test (LRT) that minimizes the Bayes’ risk for two
classes, C1 and C2, is given by
ΛBayes(x) =
p(x|C1)
p(x|C2)
C1
>
<
C2
(L12 − L22)P (C2)
(L21 − L11)P (C1) .
(i) Show that, with the 0/1 loss function, the LRT reduces to the Maxi-
mum A Posteriori (MAP) criterion.
(ii) If the priors are also equal, show that the MAP criterion reduces to
the Maximum Likelihood (ML) criterion. [5]
(d) Explain the reject option in Bayes’ classification and how it can be im-
plemented. Give a practical example for the use of this option. [5]
(e) Suppose the likelihoods, p(x|C1) and p(x|C2) for two classes, C1 and
C2, are defined as follows:
p(x|C1) = 1√
2pi
e−
x2
2 , ∀x
p(x|C2) = 1
4
, −2 < x < 2
Find the minimum error classification rule for this two-class problem,
assuming P (C1) = P (C2) = 0.5. [10]
(f) Given the following two-dimensional dataset:
Class 1: (0, 2), (0, 4), (1, 2), (2, 3)
Class 2: (2, 1), (3, 1), (3, 3), (4, 4)
classify a new point (2, 2) using k-nearest neighbour classification with
the Euclidean distance when k = 3 and when k = 5. [5]
Page 2 (of 5)
MAT00031H
2 (of 3). (a) Learning vector quantization (LVQ) and self-organising maps (SOM) are
both competitive learning algorithms.
(i) Explain the term competitive learning.
(ii) Describe the architecture of an LVQ in terms of a neural network and
explain how such a network is used in classification?
[4]
(b) The following equations are used in the context of SOMs:
d(n) = exp
(
−(x−w)
2
2σ2(n)
)
and
∆wjk = r(n) ∗ d(n) ∗ (xk − wjk).
State the meaning of each symbol and explain how the equations are used
in the self-organising process. [8]
(c) A SOM has three output neurons with current weights:
{N1 = (1,−1,−1)T ; N2 = (−1, 0, 1)T ; N3 = (1, 1, 0)T}.
Find the winning neuron for the input feature vector x = (−1, 1, 1)T and
give its updated weights using a learning rate of 0.5. [8]
(d) Given the 2-dimensional training data set
{(4, 3), (5, 4), (6, 3), (6, 4), (8, 5), (8, 6), (9, 4), (9, 5), (9, 6), (11, 5)},
Using the Parzen window function,
φ(u) =
{
1 if |uj| ≤ 1/2 for j = 1, 2
0 otherwise
(i) Estimate the density p(x, y) at the points (x = 7, y = 5) and
(x = 9, y = 5) with a bandwidth of 2.
(ii) Estimate the density p(x, y) at the points (x = 7, y = 5) and
(x = 9, y = 5) with a bandwidth of 4.
(iii) Giving your reasons, state which bandwidth you think is best.
[10]
Page 3 (of 5) Turn over
MAT00031H
3 (of 3). (a) Artificial neural networks are widely used machine learning algorithms.
(i) Give the motivation behind the development of artificial neural net-
works and explain briefly how this is achieved. [3]
(ii) Draw a diagram to show a simple perceptron, explaining any nota-
tion or symbols that you use. [5]
(b) The table below shows the summary statistics for petal length and petal
width measurements (in centimetres) of two different species of iris: 50
of the species versicolor and 50 of the species virginia.
versicolor virginia
variable mean variance mean variance grand mean
petal length 4.26 0.221 5.552 0.305 4.906
petal width 1.326 0.039 2.026 0.075 1.676
For G groups, with ng examples in group g and N = n1 + . . . + nG
examples in total, the between-groups variance is given by
VB =
1
N −G
G∑
g=1
ng(x¯g − x¯)2
and the within-groups variance by
VW =
1
N −G
G∑
g=1
(ng − 1)s2g
where x¯g and s2g are the sample mean and variance respectively for group
g and x¯ is the grand mean.
You may assume the following identity without proof:
Var
(
m∑
j=1
αjXj
)
=
m∑
j=1
α2j Var (Xj) + 2
m∑
j=1
j−1∑
k=1
αjαk Cov (Xj, Xk)
for random variables X1, . . . , Xm and constants α1, . . . , αm.
(i) Calculate the separation for each variable. [10]
(ii) Let X1 denote petal length and X2 denote petal width. Given the
within-groups and between-groups covariances of petal length and
petal width are CovW (X1, X2) = 0.061 and CovB(X1, X2) = 0.231
respectively, calculate the separation for the linear discriminant:
D = 0.31 ∗ (X1 − X¯1) + 0.96 ∗ (X2 − X¯2) [8]
Page 4 (of 5) continued on next page
continued from previous page MAT00031H
3 (of 3) cont.
(iii) Using the discriminant function in part (ii), classify the new feature
vector with measurements of 5.9cm and 2.3cm for petal length and
petal width respectively. [4]
Page 5 (of 5) End of examination.
SOLUTIONS: MAT00031H
1. (a) Pattern recognition involves the automated recognition of patterns or
regularities in the features of the objects being analysed that can be used
to identify groups or clustering of the objects (unsupervised analysis) or
to classify new objects (supervised analysis). Examples of applications
include medical diagnosis, speech recognition, fingerprint identification,
number-plate recognition and financial forecasting amongst many oth-
ers.
6 Marks
(b) (i) In classification, a decision rule is a rule or function that determines
which one of the possible classes a feature vector, x, is to be assigned
to. A decision region Rj is the region of feature space over which
the classifier assigns class Cj .
(ii) For discriminant functions, gk(x), for k = 1, ..., K, the decision rule
is: choose class Cj if
gj(x) > gk(x) ∀ k 6= j.
(iii) For a two-class problem, a decision rule based solely on prior prob-
abilities is:
Assign
{
class C1 if P (C1) > P (C2)
class C2 otherwise
The probability or error under this rule is P (error) = min(P (C1), P (C2)). 9 Marks
(c) (i) In the case of the 0/1 loss function, we have
Ljk =
{
0 if j = k
1 if j 6= k,
and the LRT becomes
ΛMAP (x) =
p(x|C1)
p(x|C2)
C1
>
<
C2
P (C2)
P (C1)
.
(ii) If the priors are also equal, then we have
ΛMAP (x) =
p(x|C1)
p(x|C2)
C1
>
<
C2
1
or
ΛML(x) = p(x|C1)
C1
>
<
C2
p(x|C2)
7
SOLUTIONS: MAT00031H
5 Marks
(d) The reject option in Bayes’ classification allows for no classification de-
cision to be made in regions of greatest uncertainty, where the greatest
posterior probability is significantly less than 1. This can be achieved by
specifying a threshold, θ, and rejecting inputs, x, for which the largest
posterior probability P (Ck|x) < θ. In medical diagnosis for example, it
may be better to automate classification of easy cases, but leave the more
difficult cases to a human expert.
5 Marks
(e) The minimum error classification rule is given by the Bayes’ rule. If
−2 < x < 2 then the Bayes’ rule is:
Choose C1 if p(x|C1) > p(x|C2)
i.e.
1√
2pi
e−
x2
2 >
1
4
or
g(x) > 0
where
g(x) = ln
(
4√
2pi
)
− x
2
2
giving the rule:
Choose C1 if g(x) > 0; otherwise choose C2.
If x < −2 or x > 2, we should always choose C1. Therefore we obtain:
x < −2 choose C1
−2 < x < −0.9668 choose C2
−0.9668 < x < 0.9668 choose C1
0.9668 < x < 2 choose C2
x > 2 choose C1 10 Marks
(f) The Euclidean distances from the point (2, 2) to each point in the dataset
are:
Class1 : 2, 2
√
2, 1, 1
Class2 : 1,
√
2,
√
2, 2
√
2.
When k = 3, the nearest neighbours are two from Class 1 and one from
Class 2 so that the point should be classified as Class 1. However, when
8
SOLUTIONS: MAT00031H
k = 5, the nearest neighbours are two from Class 1 and three from Class
2 so that the point should be classified as Class 2.
5 Marks
Remarks. Everything in this question should be straightforward and similar to
examples that have been seen. Total: 40 Marks
2. (a) (i) In competitive learning algorithms nodes (neurons) compete for the
right to respond to the input data and only one node is allowed to
respond to each input vector. The specialisation of each node in the
network is increased in each iteration.
(i) In terms of neural networks, an LVQ is a feed-forward network with
one hidden-layer of nodes (neurons), fully connected with the input
layer, and one output layer. In classification, the output class of the
winning node is assigned to the input vector. 4 Marks
(b) The equation for ∆wjk is used to update the weights of the SOM as
wnewjk = w
old
jk + ∆wjk
where n is the number of iterations and
d(n) = exp
(
−(x−w)
2
2σ2(n)
)
determines how nodes are updated. Only the winning neuron and nodes
in the neighborhood of the winning node are updated, where the neigh-
borhood is defined by its radius σ(n) so that d(n) allows nodes within
the neighborhood to be updated according to how close they are to the
winning node with those closest being updated by a greater amount as
x denotes the winning node and w the node in the neighborhood. σ de-
pends on n as it shrinks as the number of iterations increases. r(n) is
the learning rate which is also related to the number of iterations using
an exponential decay function. xk and wjk denote the kth weight of the
winning node and the jth node being uptated respectively.
8 Marks
(c) The distances from x to each output node are:
d1 =
√
(−1− 1)2 + (1− (−1))2 + (1− (−1))2 =
√
12
d2 =
√
(−1− (−1))2 + (1− 0)2 + (1− 1)2 = 1
d3 =
√
(−1− 1)2 + (1− 1)2 + (1− 0)2 =
√
5
9
SOLUTIONS: MAT00031H
so the winning neuron is N2. Its weights are updated according to
wnew2k = w
old
2k + 0.5 ∗ (xk − wold2k )
where
wnew21 = (−1) + 0.5 ∗ (−1− (−1)) = −1
wnew22 = 0 + 0.5 ∗ (1− 0) = 0.5
wnew23 = 1 + 0.5 ∗ (1− 1) = 1
so that we now have N2 = (−1, 0.5, 1).
8 Marks
(d) (i) When h = 2, we have
p(x = 7, y = 5) =
1
10.22
10∑
i=1
φ
(
7− xi
2
)
φ
(
5− yi
2
)
=
1
40
(0 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 0 + 0)
= 0.075
as only data points within 1 unit contribute to the sum. Similarly,
p(x = 9, y = 5) =
1
10.22
10∑
i=1
φ
(
9− xi
2
)
φ
(
5− yi
2
)
=
1
40
(0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 0)
= 0.125
(ii) When h = 4, we have
p(x = 7, y = 5) =
1
10.22
10∑
i=1
φ
(
7− xi
2
)
φ
(
5− yi
2
)
=
1
40
(0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0)
= 0.2
as data points within 2 units contribute to the sum. Similarly,
p(x = 9, y = 5) =
1
10.22
10∑
i=1
φ
(
9− xi
2
)
φ
(
5− yi
2
)
=
1
40
(0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1)
= 0.15
10
SOLUTIONS: MAT00031H
(iii) Plotting the plots shows the data roughly forms two clusters, with
the point (x = 9, y = 5) in one cluster and the point (x = 7, y = 5)
between the two clusters, so the density should be greater for the
point (x = 9, y = 5) than for the point (x = 7, y = 5), which is
true when h = 2, however when h = 4 the kernal function smooths
the density too much so that both clusters contribute to the density
at (x = 7, y = 5), making it greater than that at (x = 9, y = 5).
Therefore a badwidth of h = 2 is more appropriate than a bandwidth
of h = 4.
10 Marks
Remarks. Theory has been covered in lectures. Total: 30 Marks
3. (a) (i) Artificial neural networks aim to mimic the processes in the brain.
The inputs act like synapses and are multiplied by weights to provide
the strength of each signal from which some function determines the
activation of the neuron and hence the output. Just as real neurons
can send signals to other synapses, artificial networks can have mul-
tiple layers to process more complex problems.
3 Marks
(ii) The figure below shows a simple perceptron with inputs, x1, . . . xn.
A bias term, b is added to the weighted sum of the inputs (denoted
by Σ), where each input, xi is multiplied by a weight, wi, to give the
function, f . The step function, defined by
step(f) =
{
1 if f > 0
0 otherwise
and denoted
determines the output, a.
11
SOLUTIONS: MAT00031H
5 Marks
(b) (i) Using the formulae given, the between- and within-groups variances
for petal length are
VB(X1) =
50 ∗ (4.26− 4.906)2 + 50 ∗ (5.552− 4.906)2
98
= 0.426
and
VW (X1) = 49 ∗ (0.221 + 0.305)/98 = 0.263.
Similarly, for petal width we have
VB(X2) =
50 ∗ (1.326− 1.676)2 + 50 ∗ (2.026− 1.676)2
98
= 0.125
and
VW (X2) = 49 ∗ (0.039 + 0.075)/98 = 0.057.
Thus, the separation for petal length is
VB(X1)
VW (X1)
= 1.62
and for petal width, the separation is
VB(X2)
VW (X2)
= 2.19
10 Marks
(ii) Using the identity given we have
VarB(D) = (0.31)
2∗0.426+(0.96)2∗0.125+2∗0.31∗0.96∗0.231 = 0.294
and
VarW (D) = (0.31)
2∗0.263+(0.96)2∗0.0.057+2∗0.31∗0.96∗0.061 = 0.114
12
SOLUTIONS: MAT00031H
so that the separation is given by
VB(D)
VW (D)
=
0.294
0.114
= 2.58
8 Marks
(iii) The decision boundary is given by D = 0 and so as the class means
for both variables are greatest for the species Virginia, we would
classify examples with D > 0 as Virginia and those with D < 0 as
Versicolor. When petal length = 5.9cm and petal width = 2.3cm, we
have
D = 0.31 ∗ (5.9− 4.906) + 0.96 ∗ (2.3− 1.676) = 0.91 > 0
so that this example would be classified as Virginia.
4 Marks
Remarks. The theory has been covered in lectures and similar problems given on
problem sheets and discussed in seminars. Total: 30 Marks
13
