10/2021
THE UNIVERSITY OF HONG KONG
DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE
STAT3602 Statistical Inference
Mini-project
(assessment weighting: 15%)
Due date: 8 December, 2021
Complete both questions 1 and 2 with the help of computer software.
[Recommended: R package, downloadable from https://cran.r-project.org/bin/windows/base/.]
Disclaimer: The background story is entirely made-up and the data are computer-generated, although
the names are genuine. No offence.
Background
The STAT3602 class of students are the only single adults remaining in this world. They consist of
13 men and 9 women. Cupid , a matchmaking agency, offers a free trial of its new service to the class.
The service includes two parts:
• Cupid will pair three men off with three women in order to maximise the satisfaction level of
the whole class.
• Cupid will test for each individual whether he/she is the unanimous No. 1 choice of the opposite
sex.
Parameters
For i = 1, . . . , 13, man i has in mind three favourite women, namely woman αi,1 (most favourite),
woman αi,2 (second most favourite) and woman αi,3 (third most favourite), where αi,1, αi,2, αi,3 are
distinct integers picked from {1, 2, . . . , 9}.
Similarly, for j = 1, . . . , 9, woman j has in mind three favourite men, namely men βj,1, βj,2 and βj,3
(in descending order of priority), where βj,1, βj,2, βj,3 are distinct integers picked from {1, 2, . . . , 13}.
1
Define
θ =
α1,1 α1,2 α1,3
...
...
...
α13,1 α13,2 α13,3
β1,1 β1,2 β1,3
...
...
...
β9,1 β9,2 β9,3
,
which can be taken as a 22× 3 parameter matrix, unknown to Cupid.
The parameter space Θ consists of (9× 8× 7)13 × (13× 12× 11)9 ≈ 1.7468× 1064 elements.
Actions
Cupid is looking for the best way to pair three men off with three women. Each possible way (action)
can be identified with a 2× 3 matrix
a =
a11 a12 a13
a21 a22 a23
,
where a11 < a12 < a13 are three integers picked from {1, . . . , 9} and a21, a22, a23 are three distinct
integers picked from {1, . . . , 13}, meaning that women a11, a12, a13 are paired off with men a21, a22, a23,
respectively.
The action space A consists of 9× 8× 7× 13× 12× 11/3! = 144144 distinct actions.
Satisfaction score
The satisfaction score of each individual is set to be
5, if he/she is paired off with his/her most favourite partner,
3, if he/she is paired off with his/her second most favourite partner,
2, if he/she is paired off with his/her third most favourite partner,
1, if he/she is paired off with somebody not among his/her top three choices,
0, if he/she is not paired off with anybody.
Cupid aims to maximise the total satisfaction score of the whole class, i.e. the sum of satisfaction
scores of the 22 persons.
2
Data and related statistical model
To make an informed decision, Cupid monitors whether each person “follows” potential partners of
the opposite sex on a social media platform, and observes a dataset in the form of two matrices, of
dimensions 13× 9 and 9× 13, respectively:
(X,Y ) =
X1,1 · · · X1,9
...
...
...
X13,1 · · · X13,9
,
Y1,1 · · · Y1,13
...
...
...
Y9,1 · · · Y9,13
,
where Xi,j = 1{man i follows woman j} and Yj,i = 1{woman j follows man i}, for i = 1, . . . , 13 and
j = 1, . . . , 9, with 1{·} denoting the indicator function.
It is known that a person follows his/her most favourite partner with probability 0.9, second
most favourite partner with probability 0.75, third most favourite partner with probability 0.6, and
a partner not among the top three choices with probability 0.3. All followings are assumed to be
independent of each other. In other words, we have
P (αi,1,αi,2,αi,3)(Xi,j = 1) =
0.9, j = αi,1,
0.75, j = αi,2,
0.6, j = αi,3,
0.3, otherwise,
and P (βj,1,βj,2,βj,3)(Yj,i = 1) =
0.9, i = βj,1,
0.75, i = βj,2,
0.6, i = βj,3,
0.3, otherwise,
for i = 1, . . . , 13 and j = 1, . . . , 9, and that the 234 entries of (X,Y ) are independent of each other.
The actually observed data (x,y), together with names of the 13 men and 9 women, are saved in
two separate worksheets (“data X” and “data Y”) of the Excel file data.xlsx.
3
Question 1 (8%)
Cupid has a non-informative prior belief about θ, represented by the prior mass function
pi(θ) ∝ 1 for all θ ∈ Θ.
Based on the observed data (x,y), what is Cupid’s Bayesian decision on the pairings?
Some tips:
• Show that the posterior mass function satisfies
pi(θ|X,Y ) ∝
{
13∏
i=1
21Xi,αi,17Xi,αi,2 (3.5)Xi,αi,3
}{
9∏
j=1
21Yj,βj,17Yj,βj,2 (3.5)Yj,βj,3
}
1{θ ∈ Θ}.
• Show that under action a, man a2j has satisfaction score
41{αa2j ,1 = a1j}+ 21{αa2j ,2 = a1j}+ 1{αa2j ,3 = a1j}+ 1, j = 1, 2, 3,
and the other 10 men have satisfaction scores 0. Similar expressions can be established for
women.
• Since the posterior probabilities pi(θ|X,Y ) constitute an array of about 1.7468 × 1064
components, it is almost impossible to store all these entries in the computer. However, if
we can show that the 22 rows of the matrix θ are independent of each other (conditional
on (X,Y )), then we may simply store the posterior mass function of each row, so that
pi(θ|X,Y ) is easily recoverable as the product of these 22 mass functions. This saves a
lot of space, since the posterior mass functions of (αi,1, αi,2, αi,3) and (βj,1, βj,2, βj,3) are
arrays of only 9× 8× 7 = 504 and 13× 12× 11 = 1716 components, respectively.
• The R functions permutations(n,r) and combinations(n,r) are useful for generating
a full list of n(n− 1) · · · (n− r + 1) permutations and a full list of (n
r
)
combinations of r
integers picked without replacement from {1, . . . , n}, respectively.
• There are 144144 possible actions to be considered. Although the number seems big, we
may still go through them one by one using our computer and come up with the best action
within a few seconds (or at most a few minutes).
4
Sample R code for setting up posterior mass function of (αi,1, αi,2, αi,3):
space<-permutations(9,3)
size<-9*8*7
post pmf<-matrix(0,13,size)
for (i in 1:13){
for (t in 1:size){
post pmf[i,t]<- “formula for posterior probability that (αi,1, αi,2, αi,3) = space[t,]”
}
}
# row i of matrix post pmf stores the posterior mass function of (αi,1, αi,2, αi,3)
5
Question 2 (7%)
Cupid wishes to conduct a likelihood ratio test of
H0 : [Your name] is the No. 1 choice of everyone of the opposite sex vs H1 : no restriction.
Calculate an approximate p-value for Cupid.
Some tips:
• If you are male, only data y are relevant; if you are female, only data x are relevant.
• The likelihood ratio test statistic Λdata(H0, H1) involves maximisation of the likelihood
function over a large number of cases under either H0 or H1. You may ask your computer
to search for the maximum but that will be too computationally costly. On the other
hand, with a little bit of thinking, you should be able to derive a closed-form expression
for Λdata(H0, H1), which depends only on (i) whether you are followed by each potential
partner, and (ii) the number of people followed by each potential partner.
• Since you are dealing with a problem involving a discrete parameter space, approximating the
null distribution of 2 ln Λdata(H0, H1) by the chi-squared distribution is no longer justified.
Please do not consider it.
• Suppose that you are male (the female case can be treated similarly). The p-value is
officially defined as
max
H0
P
(
ΛY (H0, H1) ≥ Λy(H0, H1)
∣∣β1,1, . . . , β9,3).
Under H0 (i.e. βj,1 = [you] ∀ j), there are (12× 11)9 ≈ 1.2166× 1019 possible choices of
the parameter vector (β1,1, . . . , β9,3). This makes an exhaustive search for the maximum
impracticable. You are therefore suggested to propose one or several choices of the
parameter vector under H0, which you think might yield a big value for the probability
P
(
ΛY (H0, H1) ≥ Λy(H0, H1)
∣∣β1,1, . . . , β9,3) and take the biggest value as an approximation
to the p-value.
For the calculation of P
(
ΛY (H0, H1) ≥ Λy(H0, H1)
∣∣β1,1, . . . , β9,3) under a particular
choice of the parameter vector (β1,1, . . . , β9,3), you may ask your computer to generate
a large number (e.g. 100000) of random copies of the dataset Y under that choice of
parameter vector and approximate the probability by the proportion of cases satisfying
“ΛY (H0, H1) ≥ Λy(H0, H1)”.
6
Sample R code for generating 100000 copies of Λdata(H0, H1) under a specified null distribution:
m<-“number of people (including yourself) of same sex as you”
n<-“number of people of opposite sex”
lr stat<-function(data){ “formula for Λdata(H0, H1)” } # “data” in form of n×m matrix
nsim<-100000
null parameter<-“n× 3 matrix storing the chosen set of parameter values under H0”
null lr<-vector() # a vector to store 100000 random copies of Λdata(H0, H1)
for(k in 1:nsim){
null data<-matrix(0,n,m)
for(i in 1:n){
fav1<-rbinom(1,1,0.9)
fav2<-rbinom(1,1,0.75)
fav3<-rbinom(1,1,0.6)
fav0<-rbinom(m-3,1,0.3) # m− 3 independent Bernoulli (0.3) digits
null data[i,null parameter[i,]]<-c(fav1,fav2,fav3) # assign data to i’s top 3 favourites
others<-setdiff(c(1:m),null parameter[i,]) # extract labels not among i’s top three
null data[i,others]<-fav0 # assign data to people not among i’s top three
}
null lr[k]<-lr stat(null data) # calculate Λdata(H0, H1) based on k
th dataset
}
∗ Points to note ∗
• In the main text of your report, show and explain your steps, and display formulae in their
conventional mathematical form. Do not explain anything using computer code.
• Attach your computer code to your report as an appendix. Include brief comments on lines
which involve complicated operations.
********** END OF MINI-PROJECT **********
7
