TODO 2021,
Available from TODO:TODO BST,
Expected Duration: 2 hours,
Time Allowed: 4 hours,
Timed exam within 24 hours
DEGREE of MSc
—!!!— MOCK EXAM 2020-2021 —!!!—
INTRODUCTION TO DATA SCIENCE AND
SYSTEMS (M)
Answer all 3 questions
This examination paper is worth a total of 60 marks.
1. Linear algebra and probability
In the fabrication of semiconductors, an industrial manufacturer has a process that produces
silicon wafers. Sometimes, the wafers come out defective, and have to be discarded.
(a) Every wafer has a production cost c. Every correct wafer can be sold for v; defective wafers
cannot be sold and bring in no money. Assuming the probability of a defective wafer is
written P(D = 1), where D is a random variable that may take on the values 0 and 1, write
down the expression for the expected profit of manufacturing a single wafer. [4]
Solution:
(Bookwork for knowledge of the concepts, Problem solving otherwise)
Expectation is: sum over every state (product of (probability of states, value of states))
E[ f (X)] = ∑
x
P(X = x) f (x) [1 mark for correct concept of expectation]
The profit of a correct wafer is v− c. The profit of an incorrect wafer is −c. [1 mark for
correct idea of profit and loss]
E[ f (D)]= (correct probability)*(correct profit) + (defect probability)*(defect product)
E[ f (D)] = (1−P(D = 1))(v− c)+P(D = 1)(−c)
E[ f (D)] = (1−P(D = 1))(v− c)−P(D = 1)c [2 mark]
(b) The defect rate in one production line is 1:100000. The company learns of a new device that
can predict whether the wafer will be defective after only a few seconds with a reliability of
99%. This will allow the fabricator to pull the wafer and abort the production process before
a more expensive process begins. Explain how Bayes’ rule would help the fabricator decide
if this device is a worthwhile investment and give your recommendation based on these
figures. You do not have to provide an exact calculation but you should give approximate
figures. State any assumptions you make.
[7]
Solution: (Problem solving)
Writing T=0 to mean test is negative, and T=1 to mean test is positive (T=1 means that
the test indicates an error). Assuming that 99% reliability means that: [note: this is an
assumption and must be stated; other assumptions are valid answers]
P(T = 1|D = 1) = 0.99
P(T = 0|D = 0) = 0.99
1 CONTINUED OVERLEAF
then
P(T = 1|D = 0) = 0.01
P(T = 0|D = 1) = 0.01 [1 marks]
The probability that a wafer is defective given that the test is positive is:
P(D = 1|T = 1) = P(T = 1|D = 1)P(D = 1)/P(T = 1) [2 mark for Bayes’ Rule]
P(D= 1|T = 1) = 0.99∗0.00001/[P(T = 1|D= 1)P(D= 1)+P(T = 1|D= 0)P(D=
0)]
≈ 0.00001/[0.00001+0.01] [note: 0.99 has been rounded off to 1]
≈ 0.00001/0.01 [rounding 0.00001 off to 0.0]
≈ 0.001 [1 mark for any answer within an order of magnitude]
For every thousand wafers pulled when the test was positive, just one would be expected
to actually be faulty [1 mark]. Unless the cost of pulling the wafer is extremely small,
this test is nearly useless given the rarity of defects [2 mark].
(c) The production system is to be optimized to minimize the number of defects. The fabrication
process has two adjustable parameters: temperature and slice thickness (that have been
normalised so they are on a unit-less scale).
(i) Slice thickness (st) and temperature (temp) have been set every day at noon over ten
days. The raw data:
day 0 1 2 3 4 5 6 7 8 9
st 0.00 0.80 0.90 0.10 -0.80 -1.00 -0.30 0.70 1.00 0.40
temp 0.07 -0.84 -0.84 -0.02 0.88 0.97 0.14 -0.72 -1.08 -0.45
Make a sketch of a scatter plot of scale thickness (”x-axis”) vs temperature (”y-axis”)
and include an indication of the two principle components. Define what principal
components are and how they can be computed.
[5]
Solution: (Bookwork for defining the concepts, Problem solving otherwise )
- Principal components (PCs) indicate the major axes/direction of variation (other
sensible explanation also accepted) [1 for concept and explanation]
- PCs can be found by first computing the empirical covariance matrix, then
computing the eigenvectors (and values) of the covariance matrix. [1 for correct
explanation]
Correct sketch with the data points [1] and eigenvectors [2]:
2 CONTINUED OVERLEAF
(ii) You manger has formulated the optimisation problem and it turns out the optimisation
requires some vectorised code, which uses a matrix multiplication and inversion to
compute the equation:
y= A−1x
At the moment, the implementation uses the SVD. However, you have learned that A
is in fact a square diagonal matrix. Suggest a much faster approach, and write (a one
line) vectorised NumPy code that computes the value of y from A and x. The diagonal
of a matrix A is returned by np.diag(A). [4]
Solution:
The inverse of a diagonal matrix is just the reciprocal of its entries [1]. And
the product of a diagonal matrix with a vector is the elementwise product of
the diagonal with that vector. [1] Therefore a more efficient solution would be:
y = 1.0/np.diag(A) * x [2]
3 CONTINUED OVERLEAF
2. Linear algebra, visualisation and optimisation
Your data science team has been asked to analyse a subsystem for a car manufacturer. After
some experimentation it is clear that the system you are considering can be described by
the following set of coupled equations:
−14+ xα+ zγ = −yβ
2xα− yzβ +8 = −xγ+ xα
−zγ = −5− yβ
(1)
where x = 1,y = 2,z = 3 are scalar inputs to the system and the output of the system is de-
noted by c= f
(
[x,y,z]T , [α,β ,γ]T
)
= [14,−8,−5]>. b= [α,β ,γ]> is a vector containing
the parameters of the system.
(a) Convert the set of coupled equations in Eq. (1) into the matrix form Ab= c. [3]
Solution:
The question is a variation of (several) similar problems seen in class.
The general solution (and expected answer):x y zx −yz x
0 y −z
αβ
γ
=
14−8
−5
The following (plug-in) solution will be awarded partial marks (2 marks):1 2 31 −6 1
0 2 −3
αβ
γ
=
14−8
−5
(2)
Alternative solutions will be awarded partial marks proportial to their their correctness
and completeness.
(b) You are now asked to find the parameters, b, of the system using a numerical optimization
method without the availability of standard solvers and matrix inversion.
(i) Define an optimization problem that would allow you to solve a problem of the type
Ab= c with respect to b with the constraint that you cannot use matrix inversion but
have access to partial derivatives of A, b and c with respect to b. [2]
Solution:
This is a relatively easy question and follows the general form used in the material
and lectures.
L(b) = ‖Ab− c‖2
4 CONTINUED OVERLEAF
Thus, the optimizaiton problem is,
argmin
b∗
L(b)
Any sensible variations of the above model-solution will be accepted (including
other norms). Partial credit will be awarded for solutions with only minor problems.
(ii) State a form of the update equations for standard gradient descent which will allow
you to solve the optimization problem outlined in the previous question and explain
under which conditions your gradient descent optimization algorithm is guaranteed to
converge. [3]
Solution:
Largely bookwork (slight change of notation).
bk(t+1)← bk(t)+µt ∂L(b)∂bk ∀k (1 marks).
Gradient descent will approach a local minima if: the function is Lipschitz continu-
ous (1 mark) and the value of µ is adjusted appropriately w.r.t t (1 mark).
15 10 5 0 5 10
x
10
5
0
5
10
y
two datasets
Dataset 1
Dataset 2
Figure 1: A scatter plot illustrating two datasets. The two different datasets can clearly be identified
as two distinct clusters (as validated by the manager)
5 CONTINUED OVERLEAF
(c) Your manager has provided you with two datasets obtained on two different days each
containing several observation of x and y.Your manager has illustrated the observations in
Figure 1.
(i) Your manager asks you to summarise each of the datasets using a separate Normal
distribution for each dataset. Explain how you would parameterise the Normal distribu-
tions needed to model the (x,y) values from the two individual datasets, including a
description of the array shape of any parameters that the distribution would have. Give
rough number for their values. [7]
Solution:
Two normal distribution (one for each dataset) estimated from 2D grid positions
would have a 2 element (1/2 mark) mean vector (1 mark) and a (2,2) (1/2 mark)
element covariance matrix (1 mark).
D1: N( [7.5,7.5], [[2,0 ],[0,2]]) [2 mark]
D2: N( [-5,-5], [[3.5, 3], [3, 3.5]]) [2 mark]
(ii) Explain how eigendecomposition could be used on the parameters estimated in the
previous question to identify the major axis of variation. Draw a simple sketch to show:
the data points; the estimated normal distributions; the relevant eigenvectors (for each
dataset separately). [5]
Solution:
The eigendecomposition of the covariance matrices (1 mark) would reveal the
eigenvectors/principal components (1 mark). The eigenvector with largest absolute
corresponding eigenvector would correspond to the major axis/longest direction (1
mark). Diagram should have: an reproduction of the graf with some data points
(for each dataset) as dots or similar markers (1 mark); a normal distribution as an
ellipse or nested ellipses (1/2 mark per datset); the eigenvectors as a cross aligned
with the ellipse (1/2 mark per datset).
6 CONTINUED OVERLEAF
3. Database systems
(a) Consider a relation Student S(MatricNo, Name, Age) where the primary key (MatricNo) is
a string of variable length with the lengths of values uniformly distributed in the interval
[10,16] bytes, Age is a 32-bit integer, and Name is also a variable-length string with values
of length uniformly distributed in the interval [24,50] bytes. Assume that the relation
has rS = 1000 tuples, stored in a file on disk organised in 512-byte blocks. Assume that
variable-length fields have their size in bytes prepended to them in a 8-bit integer form.
Last, assume that student ages are uniformly distributed in the range [17,40].
(i) Compute the blocking factor and the number of blocks required to store this relation?
[3]
Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)
Records (tuples) are of variable length, uniformly distributed in the intervals given
above. To compute the blocking factor, we will then consider the average case for
each attribute: 13 bytes for each MatricNo on average, 37 bytes for each Name
on average. We will also need to factor in the space taken up by the size of a
variable-length field value being prepended to the field value itself.
Each record (tuple) then has: (1+13)+4+(1+37) = 56 bytes on average. Thus,
the blocking factor will be:⌊ block size
record size
⌋
=
⌊512
56
⌋
= 9 records per block.
The file will then contain:
nS =
⌈
num tuples
blocking f actor
⌉
=
⌈1000
9
⌉
= 112 blocks.
(1 mark for using the average size of variable-length fields, 1 mark for the compu-
tation of the blocking factor, 1 mark for the number of file blocks)
(ii) Consider the following SQL query:
SELECT Name FROM Student WHERE Age >= 25 and Age <= 30
Estimate the expected query processing cost in terms of number of block accesses,
when the relation file is organised as (a) a heap file, (b) a sequential file, ordered by
Age, and (c) a hash file using external hashing with Name as the hash key field, 128
buckets each containing one block, and no overflow buckets. [4]
Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)
Heap File:
We need to scan the entire file to find all the records that satisfy the range query;
hence, we need to access all 112 blocks of the file.
Sequential file:
The file is ordered on the query constraint attribute, hence we can use binary
search over the file blocks. We first need to locate the block storing the first record
7 CONTINUED OVERLEAF
with Age = 25. Doing so using binary search will result in a cost of on average
log2(112)≈ 7 block accesses. We then need to scan subsequent blocks sequentially,
until we find the block storing the last record with Age = 30. The values for the
Age field are distributed uniformly over [17,40], that is over 24 distinct values.
The range predicate constraint queries the range [25,30] or 6 values. Based on the
uniformity assumption, we expect 6/24 or 25% of all records to be returned by this
query; that is, we will need to read 25%×1000 = 250 records. Given a blocking
factor of 9 this number translates to 28 block accesses. The overall cost is then
expected to be 35 block accesses.
Hash file:
The file is hashed on Name, but the query constraint predicate is on Age. As such,
we cannot take advantage of the file’s hash-based organisation and we’ll thus need
to revert to a full scan. That is, this is equivalent to the data being stored in a heap
file, for an overall cost of 112 block accesses.
(1 mark for the answer for heap files, 2 marks for sequential files, 1 mark for hash
files.)
(b) Consider another relation Enrolments E with nE = 200 blocks and rE = 12000 records.
Assume that E has two attributes: a foreign key to Student’s MatricNo primary key (i.e.,
same type, format and size as S.MatricNo) and a foreign key to Course’s CourseNo primary
key (again, assume same type, format and size as S.MatricNo). Further assume that all
students have enrolled to at least one course, and that all courses have at least one student
enrolled. Last, assume that the memory of the database system can accommodate nB = 22
blocks for processing and the blocking factor for the join-results block is b f rRS = 5 records
per block. Consider the following equi-join query:
SELECT * FROM E, S WHERE E.MatricNo = S.MatricNo
(i) Assume that the query is processed using the nested-loop join algorithm. Estimate
the total expected cost (in number of block accesses) for the various strategies and
conclude which strategy is the most efficient. Show your work. [6]
Solution: (Bookwork for knowledge of the concepts/equations, Problem solving
otherwise)
Let NDV (a,A) be the number of distinct values of attribute a in relation A.
The join selectivity js of the above join query is:
js = 1/max(NDV (MatricNo,Student),NDV (MatricNo,Enrolments)),
with NDV (MatricNo,Student)= 1000 and NDV (MatricNo,Enrolments)= 1000.
Hence:
js = 1/max(1000,1000) = 1/1000 = 0.001.
8 CONTINUED OVERLEAF
If the outer loop in the nested-loop join scans over relation S, then the cost is:
nS+nE ×
⌈
nS
nB−2
⌉
+ js×rE×rSb f rRS = 112+200×
⌈ 112
22−2
⌉
+ 0.001×1000×120005 .
If the outer loop in the nested-loop join scans over relation E, then the cost is:
nE +nS×
⌈
nE
nB−2
⌉
+ js×rE×rSb f rRS = 200+112
⌈×20020 ⌉+ 0.001×1000×120005 .
The best strategy here is to scan relation S at the outer loop.
(2 marks for the computation of the join selectivity, 2 marks for each subsequent
cost estimation.)
(ii) Assume that there is a 6-level Clustering Index over the relation Enrolments for the key
MatricNo; i.e., level xE = 6. Also assume that there is no index over the relation Student
but that said relation is stored in a sequential file with MatricNo as the ordering field.
Propose two nested-loop-based join strategies using either the index or the physical
organisation of the files and explain which one is the best. [7]
Solution:
[Strategy 1]
Retrieve each tuple e from relation E and use binary search over the sequential file
to find the matching tuple s in relation S.
The number of blocks required per search is: log2(nS) = 7 block accesses.
Cost:
nE + rE × (log2(nS))+ js×rE×rSb f rRS = 200+12000×7+
0.001×1000×12000
5 = 86,600
block accesses.
[Strategy 2]
Retrieve each tuple s from relation S and use the Clustering index on E.MatricNo
to find the matching tuple e in relation E.
b f rE =
⌊ block size
record size
⌋
=
⌊
512
(1+13)+(1+13)
⌋
= 18
Cost: nS+ rS× (xE + dsE/b f rEe)+ js×rE×rSb f rRS =
nS+ rS× (xE + d(drE/NDV (MatricNo)e)/b f rE)e+ js×rE×rSb f rRS =
112+1000× (6+1)+ 0.001×1000×120005 = 9,512 block accesses.
Explanation: The cost to write out the result set is the same for both strategies, as
they produce the exact same result set, hence we can focus on the part that produces
this result set. In general we expect the strategy that uses the smallest of the two
tables in the outer loop to be the winner. What muddles the waters in this case
is that one relation has a 6-level index while the other needs to resort to binary
searching over its file. Querying S for equality via binary searching takes 7 block
accesses, whereas querying the primary index of E also takes 7 block accesses. As
9 CONTINUED OVERLEAF
such, individual equality queries over the two access paths are of equal cost, hence
using the smallest of the two relations (S) at the outer loop is expected to be the
winner, and so is in this case.
(2 marks for Strategy 1, 2 marks for Strategy 2, 3 marks for the explanation.)
10 END OF QUESTION PAPER
学霸联盟