xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

程序代写案例-EXAM 2020-2021

时间：2021-04-24

TODO 2021,

Available from TODO:TODO BST,

Expected Duration: 2 hours,

Time Allowed: 4 hours,

Timed exam within 24 hours

DEGREE of MSc

—!!!— MOCK EXAM 2020-2021 —!!!—

INTRODUCTION TO DATA SCIENCE AND

SYSTEMS (M)

Answer all 3 questions

This examination paper is worth a total of 60 marks.

1. Linear algebra and probability

In the fabrication of semiconductors, an industrial manufacturer has a process that produces

silicon wafers. Sometimes, the wafers come out defective, and have to be discarded.

(a) Every wafer has a production cost c. Every correct wafer can be sold for v; defective wafers

cannot be sold and bring in no money. Assuming the probability of a defective wafer is

written P(D = 1), where D is a random variable that may take on the values 0 and 1, write

down the expression for the expected profit of manufacturing a single wafer. [4]

Solution:

(Bookwork for knowledge of the concepts, Problem solving otherwise)

Expectation is: sum over every state (product of (probability of states, value of states))

E[ f (X)] = ∑

x

P(X = x) f (x) [1 mark for correct concept of expectation]

The profit of a correct wafer is v− c. The profit of an incorrect wafer is −c. [1 mark for

correct idea of profit and loss]

E[ f (D)]= (correct probability)*(correct profit) + (defect probability)*(defect product)

E[ f (D)] = (1−P(D = 1))(v− c)+P(D = 1)(−c)

E[ f (D)] = (1−P(D = 1))(v− c)−P(D = 1)c [2 mark]

(b) The defect rate in one production line is 1:100000. The company learns of a new device that

can predict whether the wafer will be defective after only a few seconds with a reliability of

99%. This will allow the fabricator to pull the wafer and abort the production process before

a more expensive process begins. Explain how Bayes’ rule would help the fabricator decide

if this device is a worthwhile investment and give your recommendation based on these

figures. You do not have to provide an exact calculation but you should give approximate

figures. State any assumptions you make.

[7]

Solution: (Problem solving)

Writing T=0 to mean test is negative, and T=1 to mean test is positive (T=1 means that

the test indicates an error). Assuming that 99% reliability means that: [note: this is an

assumption and must be stated; other assumptions are valid answers]

P(T = 1|D = 1) = 0.99

P(T = 0|D = 0) = 0.99

1 CONTINUED OVERLEAF

then

P(T = 1|D = 0) = 0.01

P(T = 0|D = 1) = 0.01 [1 marks]

The probability that a wafer is defective given that the test is positive is:

P(D = 1|T = 1) = P(T = 1|D = 1)P(D = 1)/P(T = 1) [2 mark for Bayes’ Rule]

P(D= 1|T = 1) = 0.99∗0.00001/[P(T = 1|D= 1)P(D= 1)+P(T = 1|D= 0)P(D=

0)]

≈ 0.00001/[0.00001+0.01] [note: 0.99 has been rounded off to 1]

≈ 0.00001/0.01 [rounding 0.00001 off to 0.0]

≈ 0.001 [1 mark for any answer within an order of magnitude]

For every thousand wafers pulled when the test was positive, just one would be expected

to actually be faulty [1 mark]. Unless the cost of pulling the wafer is extremely small,

this test is nearly useless given the rarity of defects [2 mark].

(c) The production system is to be optimized to minimize the number of defects. The fabrication

process has two adjustable parameters: temperature and slice thickness (that have been

normalised so they are on a unit-less scale).

(i) Slice thickness (st) and temperature (temp) have been set every day at noon over ten

days. The raw data:

day 0 1 2 3 4 5 6 7 8 9

st 0.00 0.80 0.90 0.10 -0.80 -1.00 -0.30 0.70 1.00 0.40

temp 0.07 -0.84 -0.84 -0.02 0.88 0.97 0.14 -0.72 -1.08 -0.45

Make a sketch of a scatter plot of scale thickness (”x-axis”) vs temperature (”y-axis”)

and include an indication of the two principle components. Define what principal

components are and how they can be computed.

[5]

Solution: (Bookwork for defining the concepts, Problem solving otherwise )

- Principal components (PCs) indicate the major axes/direction of variation (other

sensible explanation also accepted) [1 for concept and explanation]

- PCs can be found by first computing the empirical covariance matrix, then

computing the eigenvectors (and values) of the covariance matrix. [1 for correct

explanation]

Correct sketch with the data points [1] and eigenvectors [2]:

2 CONTINUED OVERLEAF

(ii) You manger has formulated the optimisation problem and it turns out the optimisation

requires some vectorised code, which uses a matrix multiplication and inversion to

compute the equation:

y= A−1x

At the moment, the implementation uses the SVD. However, you have learned that A

is in fact a square diagonal matrix. Suggest a much faster approach, and write (a one

line) vectorised NumPy code that computes the value of y from A and x. The diagonal

of a matrix A is returned by np.diag(A). [4]

Solution:

The inverse of a diagonal matrix is just the reciprocal of its entries [1]. And

the product of a diagonal matrix with a vector is the elementwise product of

the diagonal with that vector. [1] Therefore a more efficient solution would be:

y = 1.0/np.diag(A) * x [2]

3 CONTINUED OVERLEAF

2. Linear algebra, visualisation and optimisation

Your data science team has been asked to analyse a subsystem for a car manufacturer. After

some experimentation it is clear that the system you are considering can be described by

the following set of coupled equations:

−14+ xα+ zγ = −yβ

2xα− yzβ +8 = −xγ+ xα

−zγ = −5− yβ

(1)

where x = 1,y = 2,z = 3 are scalar inputs to the system and the output of the system is de-

noted by c= f

(

[x,y,z]T , [α,β ,γ]T

)

= [14,−8,−5]>. b= [α,β ,γ]> is a vector containing

the parameters of the system.

(a) Convert the set of coupled equations in Eq. (1) into the matrix form Ab= c. [3]

Solution:

The question is a variation of (several) similar problems seen in class.

The general solution (and expected answer):x y zx −yz x

0 y −z

αβ

γ

=

14−8

−5

The following (plug-in) solution will be awarded partial marks (2 marks):1 2 31 −6 1

0 2 −3

αβ

γ

=

14−8

−5

(2)

Alternative solutions will be awarded partial marks proportial to their their correctness

and completeness.

(b) You are now asked to find the parameters, b, of the system using a numerical optimization

method without the availability of standard solvers and matrix inversion.

(i) Define an optimization problem that would allow you to solve a problem of the type

Ab= c with respect to b with the constraint that you cannot use matrix inversion but

have access to partial derivatives of A, b and c with respect to b. [2]

Solution:

This is a relatively easy question and follows the general form used in the material

and lectures.

L(b) = ‖Ab− c‖2

4 CONTINUED OVERLEAF

Thus, the optimizaiton problem is,

argmin

b∗

L(b)

Any sensible variations of the above model-solution will be accepted (including

other norms). Partial credit will be awarded for solutions with only minor problems.

(ii) State a form of the update equations for standard gradient descent which will allow

you to solve the optimization problem outlined in the previous question and explain

under which conditions your gradient descent optimization algorithm is guaranteed to

converge. [3]

Solution:

Largely bookwork (slight change of notation).

bk(t+1)← bk(t)+µt ∂L(b)∂bk ∀k (1 marks).

Gradient descent will approach a local minima if: the function is Lipschitz continu-

ous (1 mark) and the value of µ is adjusted appropriately w.r.t t (1 mark).

15 10 5 0 5 10

x

10

5

0

5

10

y

two datasets

Dataset 1

Dataset 2

Figure 1: A scatter plot illustrating two datasets. The two different datasets can clearly be identified

as two distinct clusters (as validated by the manager)

5 CONTINUED OVERLEAF

(c) Your manager has provided you with two datasets obtained on two different days each

containing several observation of x and y.Your manager has illustrated the observations in

Figure 1.

(i) Your manager asks you to summarise each of the datasets using a separate Normal

distribution for each dataset. Explain how you would parameterise the Normal distribu-

tions needed to model the (x,y) values from the two individual datasets, including a

description of the array shape of any parameters that the distribution would have. Give

rough number for their values. [7]

Solution:

Two normal distribution (one for each dataset) estimated from 2D grid positions

would have a 2 element (1/2 mark) mean vector (1 mark) and a (2,2) (1/2 mark)

element covariance matrix (1 mark).

D1: N( [7.5,7.5], [[2,0 ],[0,2]]) [2 mark]

D2: N( [-5,-5], [[3.5, 3], [3, 3.5]]) [2 mark]

(ii) Explain how eigendecomposition could be used on the parameters estimated in the

previous question to identify the major axis of variation. Draw a simple sketch to show:

the data points; the estimated normal distributions; the relevant eigenvectors (for each

dataset separately). [5]

Solution:

The eigendecomposition of the covariance matrices (1 mark) would reveal the

eigenvectors/principal components (1 mark). The eigenvector with largest absolute

corresponding eigenvector would correspond to the major axis/longest direction (1

mark). Diagram should have: an reproduction of the graf with some data points

(for each dataset) as dots or similar markers (1 mark); a normal distribution as an

ellipse or nested ellipses (1/2 mark per datset); the eigenvectors as a cross aligned

with the ellipse (1/2 mark per datset).

6 CONTINUED OVERLEAF

3. Database systems

(a) Consider a relation Student S(MatricNo, Name, Age) where the primary key (MatricNo) is

a string of variable length with the lengths of values uniformly distributed in the interval

[10,16] bytes, Age is a 32-bit integer, and Name is also a variable-length string with values

of length uniformly distributed in the interval [24,50] bytes. Assume that the relation

has rS = 1000 tuples, stored in a file on disk organised in 512-byte blocks. Assume that

variable-length fields have their size in bytes prepended to them in a 8-bit integer form.

Last, assume that student ages are uniformly distributed in the range [17,40].

(i) Compute the blocking factor and the number of blocks required to store this relation?

[3]

Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)

Records (tuples) are of variable length, uniformly distributed in the intervals given

above. To compute the blocking factor, we will then consider the average case for

each attribute: 13 bytes for each MatricNo on average, 37 bytes for each Name

on average. We will also need to factor in the space taken up by the size of a

variable-length field value being prepended to the field value itself.

Each record (tuple) then has: (1+13)+4+(1+37) = 56 bytes on average. Thus,

the blocking factor will be:⌊ block size

record size

⌋

=

⌊512

56

⌋

= 9 records per block.

The file will then contain:

nS =

⌈

num tuples

blocking f actor

⌉

=

⌈1000

9

⌉

= 112 blocks.

(1 mark for using the average size of variable-length fields, 1 mark for the compu-

tation of the blocking factor, 1 mark for the number of file blocks)

(ii) Consider the following SQL query:

SELECT Name FROM Student WHERE Age >= 25 and Age <= 30

Estimate the expected query processing cost in terms of number of block accesses,

when the relation file is organised as (a) a heap file, (b) a sequential file, ordered by

Age, and (c) a hash file using external hashing with Name as the hash key field, 128

buckets each containing one block, and no overflow buckets. [4]

Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)

Heap File:

We need to scan the entire file to find all the records that satisfy the range query;

hence, we need to access all 112 blocks of the file.

Sequential file:

The file is ordered on the query constraint attribute, hence we can use binary

search over the file blocks. We first need to locate the block storing the first record

7 CONTINUED OVERLEAF

with Age = 25. Doing so using binary search will result in a cost of on average

log2(112)≈ 7 block accesses. We then need to scan subsequent blocks sequentially,

until we find the block storing the last record with Age = 30. The values for the

Age field are distributed uniformly over [17,40], that is over 24 distinct values.

The range predicate constraint queries the range [25,30] or 6 values. Based on the

uniformity assumption, we expect 6/24 or 25% of all records to be returned by this

query; that is, we will need to read 25%×1000 = 250 records. Given a blocking

factor of 9 this number translates to 28 block accesses. The overall cost is then

expected to be 35 block accesses.

Hash file:

The file is hashed on Name, but the query constraint predicate is on Age. As such,

we cannot take advantage of the file’s hash-based organisation and we’ll thus need

to revert to a full scan. That is, this is equivalent to the data being stored in a heap

file, for an overall cost of 112 block accesses.

(1 mark for the answer for heap files, 2 marks for sequential files, 1 mark for hash

files.)

(b) Consider another relation Enrolments E with nE = 200 blocks and rE = 12000 records.

Assume that E has two attributes: a foreign key to Student’s MatricNo primary key (i.e.,

same type, format and size as S.MatricNo) and a foreign key to Course’s CourseNo primary

key (again, assume same type, format and size as S.MatricNo). Further assume that all

students have enrolled to at least one course, and that all courses have at least one student

enrolled. Last, assume that the memory of the database system can accommodate nB = 22

blocks for processing and the blocking factor for the join-results block is b f rRS = 5 records

per block. Consider the following equi-join query:

SELECT * FROM E, S WHERE E.MatricNo = S.MatricNo

(i) Assume that the query is processed using the nested-loop join algorithm. Estimate

the total expected cost (in number of block accesses) for the various strategies and

conclude which strategy is the most efficient. Show your work. [6]

Solution: (Bookwork for knowledge of the concepts/equations, Problem solving

otherwise)

Let NDV (a,A) be the number of distinct values of attribute a in relation A.

The join selectivity js of the above join query is:

js = 1/max(NDV (MatricNo,Student),NDV (MatricNo,Enrolments)),

with NDV (MatricNo,Student)= 1000 and NDV (MatricNo,Enrolments)= 1000.

Hence:

js = 1/max(1000,1000) = 1/1000 = 0.001.

8 CONTINUED OVERLEAF

If the outer loop in the nested-loop join scans over relation S, then the cost is:

nS+nE ×

⌈

nS

nB−2

⌉

+ js×rE×rSb f rRS = 112+200×

⌈ 112

22−2

⌉

+ 0.001×1000×120005 .

If the outer loop in the nested-loop join scans over relation E, then the cost is:

nE +nS×

⌈

nE

nB−2

⌉

+ js×rE×rSb f rRS = 200+112

⌈×20020 ⌉+ 0.001×1000×120005 .

The best strategy here is to scan relation S at the outer loop.

(2 marks for the computation of the join selectivity, 2 marks for each subsequent

cost estimation.)

(ii) Assume that there is a 6-level Clustering Index over the relation Enrolments for the key

MatricNo; i.e., level xE = 6. Also assume that there is no index over the relation Student

but that said relation is stored in a sequential file with MatricNo as the ordering field.

Propose two nested-loop-based join strategies using either the index or the physical

organisation of the files and explain which one is the best. [7]

Solution:

[Strategy 1]

Retrieve each tuple e from relation E and use binary search over the sequential file

to find the matching tuple s in relation S.

The number of blocks required per search is: log2(nS) = 7 block accesses.

Cost:

nE + rE × (log2(nS))+ js×rE×rSb f rRS = 200+12000×7+

0.001×1000×12000

5 = 86,600

block accesses.

[Strategy 2]

Retrieve each tuple s from relation S and use the Clustering index on E.MatricNo

to find the matching tuple e in relation E.

b f rE =

⌊ block size

record size

⌋

=

⌊

512

(1+13)+(1+13)

⌋

= 18

Cost: nS+ rS× (xE + dsE/b f rEe)+ js×rE×rSb f rRS =

nS+ rS× (xE + d(drE/NDV (MatricNo)e)/b f rE)e+ js×rE×rSb f rRS =

112+1000× (6+1)+ 0.001×1000×120005 = 9,512 block accesses.

Explanation: The cost to write out the result set is the same for both strategies, as

they produce the exact same result set, hence we can focus on the part that produces

this result set. In general we expect the strategy that uses the smallest of the two

tables in the outer loop to be the winner. What muddles the waters in this case

is that one relation has a 6-level index while the other needs to resort to binary

searching over its file. Querying S for equality via binary searching takes 7 block

accesses, whereas querying the primary index of E also takes 7 block accesses. As

9 CONTINUED OVERLEAF

such, individual equality queries over the two access paths are of equal cost, hence

using the smallest of the two relations (S) at the outer loop is expected to be the

winner, and so is in this case.

(2 marks for Strategy 1, 2 marks for Strategy 2, 3 marks for the explanation.)

10 END OF QUESTION PAPER

学霸联盟

Available from TODO:TODO BST,

Expected Duration: 2 hours,

Time Allowed: 4 hours,

Timed exam within 24 hours

DEGREE of MSc

—!!!— MOCK EXAM 2020-2021 —!!!—

INTRODUCTION TO DATA SCIENCE AND

SYSTEMS (M)

Answer all 3 questions

This examination paper is worth a total of 60 marks.

1. Linear algebra and probability

In the fabrication of semiconductors, an industrial manufacturer has a process that produces

silicon wafers. Sometimes, the wafers come out defective, and have to be discarded.

(a) Every wafer has a production cost c. Every correct wafer can be sold for v; defective wafers

cannot be sold and bring in no money. Assuming the probability of a defective wafer is

written P(D = 1), where D is a random variable that may take on the values 0 and 1, write

down the expression for the expected profit of manufacturing a single wafer. [4]

Solution:

(Bookwork for knowledge of the concepts, Problem solving otherwise)

Expectation is: sum over every state (product of (probability of states, value of states))

E[ f (X)] = ∑

x

P(X = x) f (x) [1 mark for correct concept of expectation]

The profit of a correct wafer is v− c. The profit of an incorrect wafer is −c. [1 mark for

correct idea of profit and loss]

E[ f (D)]= (correct probability)*(correct profit) + (defect probability)*(defect product)

E[ f (D)] = (1−P(D = 1))(v− c)+P(D = 1)(−c)

E[ f (D)] = (1−P(D = 1))(v− c)−P(D = 1)c [2 mark]

(b) The defect rate in one production line is 1:100000. The company learns of a new device that

can predict whether the wafer will be defective after only a few seconds with a reliability of

99%. This will allow the fabricator to pull the wafer and abort the production process before

a more expensive process begins. Explain how Bayes’ rule would help the fabricator decide

if this device is a worthwhile investment and give your recommendation based on these

figures. You do not have to provide an exact calculation but you should give approximate

figures. State any assumptions you make.

[7]

Solution: (Problem solving)

Writing T=0 to mean test is negative, and T=1 to mean test is positive (T=1 means that

the test indicates an error). Assuming that 99% reliability means that: [note: this is an

assumption and must be stated; other assumptions are valid answers]

P(T = 1|D = 1) = 0.99

P(T = 0|D = 0) = 0.99

1 CONTINUED OVERLEAF

then

P(T = 1|D = 0) = 0.01

P(T = 0|D = 1) = 0.01 [1 marks]

The probability that a wafer is defective given that the test is positive is:

P(D = 1|T = 1) = P(T = 1|D = 1)P(D = 1)/P(T = 1) [2 mark for Bayes’ Rule]

P(D= 1|T = 1) = 0.99∗0.00001/[P(T = 1|D= 1)P(D= 1)+P(T = 1|D= 0)P(D=

0)]

≈ 0.00001/[0.00001+0.01] [note: 0.99 has been rounded off to 1]

≈ 0.00001/0.01 [rounding 0.00001 off to 0.0]

≈ 0.001 [1 mark for any answer within an order of magnitude]

For every thousand wafers pulled when the test was positive, just one would be expected

to actually be faulty [1 mark]. Unless the cost of pulling the wafer is extremely small,

this test is nearly useless given the rarity of defects [2 mark].

(c) The production system is to be optimized to minimize the number of defects. The fabrication

process has two adjustable parameters: temperature and slice thickness (that have been

normalised so they are on a unit-less scale).

(i) Slice thickness (st) and temperature (temp) have been set every day at noon over ten

days. The raw data:

day 0 1 2 3 4 5 6 7 8 9

st 0.00 0.80 0.90 0.10 -0.80 -1.00 -0.30 0.70 1.00 0.40

temp 0.07 -0.84 -0.84 -0.02 0.88 0.97 0.14 -0.72 -1.08 -0.45

Make a sketch of a scatter plot of scale thickness (”x-axis”) vs temperature (”y-axis”)

and include an indication of the two principle components. Define what principal

components are and how they can be computed.

[5]

Solution: (Bookwork for defining the concepts, Problem solving otherwise )

- Principal components (PCs) indicate the major axes/direction of variation (other

sensible explanation also accepted) [1 for concept and explanation]

- PCs can be found by first computing the empirical covariance matrix, then

computing the eigenvectors (and values) of the covariance matrix. [1 for correct

explanation]

Correct sketch with the data points [1] and eigenvectors [2]:

2 CONTINUED OVERLEAF

(ii) You manger has formulated the optimisation problem and it turns out the optimisation

requires some vectorised code, which uses a matrix multiplication and inversion to

compute the equation:

y= A−1x

At the moment, the implementation uses the SVD. However, you have learned that A

is in fact a square diagonal matrix. Suggest a much faster approach, and write (a one

line) vectorised NumPy code that computes the value of y from A and x. The diagonal

of a matrix A is returned by np.diag(A). [4]

Solution:

The inverse of a diagonal matrix is just the reciprocal of its entries [1]. And

the product of a diagonal matrix with a vector is the elementwise product of

the diagonal with that vector. [1] Therefore a more efficient solution would be:

y = 1.0/np.diag(A) * x [2]

3 CONTINUED OVERLEAF

2. Linear algebra, visualisation and optimisation

Your data science team has been asked to analyse a subsystem for a car manufacturer. After

some experimentation it is clear that the system you are considering can be described by

the following set of coupled equations:

−14+ xα+ zγ = −yβ

2xα− yzβ +8 = −xγ+ xα

−zγ = −5− yβ

(1)

where x = 1,y = 2,z = 3 are scalar inputs to the system and the output of the system is de-

noted by c= f

(

[x,y,z]T , [α,β ,γ]T

)

= [14,−8,−5]>. b= [α,β ,γ]> is a vector containing

the parameters of the system.

(a) Convert the set of coupled equations in Eq. (1) into the matrix form Ab= c. [3]

Solution:

The question is a variation of (several) similar problems seen in class.

The general solution (and expected answer):x y zx −yz x

0 y −z

αβ

γ

=

14−8

−5

The following (plug-in) solution will be awarded partial marks (2 marks):1 2 31 −6 1

0 2 −3

αβ

γ

=

14−8

−5

(2)

Alternative solutions will be awarded partial marks proportial to their their correctness

and completeness.

(b) You are now asked to find the parameters, b, of the system using a numerical optimization

method without the availability of standard solvers and matrix inversion.

(i) Define an optimization problem that would allow you to solve a problem of the type

Ab= c with respect to b with the constraint that you cannot use matrix inversion but

have access to partial derivatives of A, b and c with respect to b. [2]

Solution:

This is a relatively easy question and follows the general form used in the material

and lectures.

L(b) = ‖Ab− c‖2

4 CONTINUED OVERLEAF

Thus, the optimizaiton problem is,

argmin

b∗

L(b)

Any sensible variations of the above model-solution will be accepted (including

other norms). Partial credit will be awarded for solutions with only minor problems.

(ii) State a form of the update equations for standard gradient descent which will allow

you to solve the optimization problem outlined in the previous question and explain

under which conditions your gradient descent optimization algorithm is guaranteed to

converge. [3]

Solution:

Largely bookwork (slight change of notation).

bk(t+1)← bk(t)+µt ∂L(b)∂bk ∀k (1 marks).

Gradient descent will approach a local minima if: the function is Lipschitz continu-

ous (1 mark) and the value of µ is adjusted appropriately w.r.t t (1 mark).

15 10 5 0 5 10

x

10

5

0

5

10

y

two datasets

Dataset 1

Dataset 2

Figure 1: A scatter plot illustrating two datasets. The two different datasets can clearly be identified

as two distinct clusters (as validated by the manager)

5 CONTINUED OVERLEAF

(c) Your manager has provided you with two datasets obtained on two different days each

containing several observation of x and y.Your manager has illustrated the observations in

Figure 1.

(i) Your manager asks you to summarise each of the datasets using a separate Normal

distribution for each dataset. Explain how you would parameterise the Normal distribu-

tions needed to model the (x,y) values from the two individual datasets, including a

description of the array shape of any parameters that the distribution would have. Give

rough number for their values. [7]

Solution:

Two normal distribution (one for each dataset) estimated from 2D grid positions

would have a 2 element (1/2 mark) mean vector (1 mark) and a (2,2) (1/2 mark)

element covariance matrix (1 mark).

D1: N( [7.5,7.5], [[2,0 ],[0,2]]) [2 mark]

D2: N( [-5,-5], [[3.5, 3], [3, 3.5]]) [2 mark]

(ii) Explain how eigendecomposition could be used on the parameters estimated in the

previous question to identify the major axis of variation. Draw a simple sketch to show:

the data points; the estimated normal distributions; the relevant eigenvectors (for each

dataset separately). [5]

Solution:

The eigendecomposition of the covariance matrices (1 mark) would reveal the

eigenvectors/principal components (1 mark). The eigenvector with largest absolute

corresponding eigenvector would correspond to the major axis/longest direction (1

mark). Diagram should have: an reproduction of the graf with some data points

(for each dataset) as dots or similar markers (1 mark); a normal distribution as an

ellipse or nested ellipses (1/2 mark per datset); the eigenvectors as a cross aligned

with the ellipse (1/2 mark per datset).

6 CONTINUED OVERLEAF

3. Database systems

(a) Consider a relation Student S(MatricNo, Name, Age) where the primary key (MatricNo) is

a string of variable length with the lengths of values uniformly distributed in the interval

[10,16] bytes, Age is a 32-bit integer, and Name is also a variable-length string with values

of length uniformly distributed in the interval [24,50] bytes. Assume that the relation

has rS = 1000 tuples, stored in a file on disk organised in 512-byte blocks. Assume that

variable-length fields have their size in bytes prepended to them in a 8-bit integer form.

Last, assume that student ages are uniformly distributed in the range [17,40].

(i) Compute the blocking factor and the number of blocks required to store this relation?

[3]

Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)

Records (tuples) are of variable length, uniformly distributed in the intervals given

above. To compute the blocking factor, we will then consider the average case for

each attribute: 13 bytes for each MatricNo on average, 37 bytes for each Name

on average. We will also need to factor in the space taken up by the size of a

variable-length field value being prepended to the field value itself.

Each record (tuple) then has: (1+13)+4+(1+37) = 56 bytes on average. Thus,

the blocking factor will be:⌊ block size

record size

⌋

=

⌊512

56

⌋

= 9 records per block.

The file will then contain:

nS =

⌈

num tuples

blocking f actor

⌉

=

⌈1000

9

⌉

= 112 blocks.

(1 mark for using the average size of variable-length fields, 1 mark for the compu-

tation of the blocking factor, 1 mark for the number of file blocks)

(ii) Consider the following SQL query:

SELECT Name FROM Student WHERE Age >= 25 and Age <= 30

Estimate the expected query processing cost in terms of number of block accesses,

when the relation file is organised as (a) a heap file, (b) a sequential file, ordered by

Age, and (c) a hash file using external hashing with Name as the hash key field, 128

buckets each containing one block, and no overflow buckets. [4]

Solution: (Bookwork for knowledge of the concepts, Problem solving otherwise)

Heap File:

We need to scan the entire file to find all the records that satisfy the range query;

hence, we need to access all 112 blocks of the file.

Sequential file:

The file is ordered on the query constraint attribute, hence we can use binary

search over the file blocks. We first need to locate the block storing the first record

7 CONTINUED OVERLEAF

with Age = 25. Doing so using binary search will result in a cost of on average

log2(112)≈ 7 block accesses. We then need to scan subsequent blocks sequentially,

until we find the block storing the last record with Age = 30. The values for the

Age field are distributed uniformly over [17,40], that is over 24 distinct values.

The range predicate constraint queries the range [25,30] or 6 values. Based on the

uniformity assumption, we expect 6/24 or 25% of all records to be returned by this

query; that is, we will need to read 25%×1000 = 250 records. Given a blocking

factor of 9 this number translates to 28 block accesses. The overall cost is then

expected to be 35 block accesses.

Hash file:

The file is hashed on Name, but the query constraint predicate is on Age. As such,

we cannot take advantage of the file’s hash-based organisation and we’ll thus need

to revert to a full scan. That is, this is equivalent to the data being stored in a heap

file, for an overall cost of 112 block accesses.

(1 mark for the answer for heap files, 2 marks for sequential files, 1 mark for hash

files.)

(b) Consider another relation Enrolments E with nE = 200 blocks and rE = 12000 records.

Assume that E has two attributes: a foreign key to Student’s MatricNo primary key (i.e.,

same type, format and size as S.MatricNo) and a foreign key to Course’s CourseNo primary

key (again, assume same type, format and size as S.MatricNo). Further assume that all

students have enrolled to at least one course, and that all courses have at least one student

enrolled. Last, assume that the memory of the database system can accommodate nB = 22

blocks for processing and the blocking factor for the join-results block is b f rRS = 5 records

per block. Consider the following equi-join query:

SELECT * FROM E, S WHERE E.MatricNo = S.MatricNo

(i) Assume that the query is processed using the nested-loop join algorithm. Estimate

the total expected cost (in number of block accesses) for the various strategies and

conclude which strategy is the most efficient. Show your work. [6]

Solution: (Bookwork for knowledge of the concepts/equations, Problem solving

otherwise)

Let NDV (a,A) be the number of distinct values of attribute a in relation A.

The join selectivity js of the above join query is:

js = 1/max(NDV (MatricNo,Student),NDV (MatricNo,Enrolments)),

with NDV (MatricNo,Student)= 1000 and NDV (MatricNo,Enrolments)= 1000.

Hence:

js = 1/max(1000,1000) = 1/1000 = 0.001.

8 CONTINUED OVERLEAF

If the outer loop in the nested-loop join scans over relation S, then the cost is:

nS+nE ×

⌈

nS

nB−2

⌉

+ js×rE×rSb f rRS = 112+200×

⌈ 112

22−2

⌉

+ 0.001×1000×120005 .

If the outer loop in the nested-loop join scans over relation E, then the cost is:

nE +nS×

⌈

nE

nB−2

⌉

+ js×rE×rSb f rRS = 200+112

⌈×20020 ⌉+ 0.001×1000×120005 .

The best strategy here is to scan relation S at the outer loop.

(2 marks for the computation of the join selectivity, 2 marks for each subsequent

cost estimation.)

(ii) Assume that there is a 6-level Clustering Index over the relation Enrolments for the key

MatricNo; i.e., level xE = 6. Also assume that there is no index over the relation Student

but that said relation is stored in a sequential file with MatricNo as the ordering field.

Propose two nested-loop-based join strategies using either the index or the physical

organisation of the files and explain which one is the best. [7]

Solution:

[Strategy 1]

Retrieve each tuple e from relation E and use binary search over the sequential file

to find the matching tuple s in relation S.

The number of blocks required per search is: log2(nS) = 7 block accesses.

Cost:

nE + rE × (log2(nS))+ js×rE×rSb f rRS = 200+12000×7+

0.001×1000×12000

5 = 86,600

block accesses.

[Strategy 2]

Retrieve each tuple s from relation S and use the Clustering index on E.MatricNo

to find the matching tuple e in relation E.

b f rE =

⌊ block size

record size

⌋

=

⌊

512

(1+13)+(1+13)

⌋

= 18

Cost: nS+ rS× (xE + dsE/b f rEe)+ js×rE×rSb f rRS =

nS+ rS× (xE + d(drE/NDV (MatricNo)e)/b f rE)e+ js×rE×rSb f rRS =

112+1000× (6+1)+ 0.001×1000×120005 = 9,512 block accesses.

Explanation: The cost to write out the result set is the same for both strategies, as

they produce the exact same result set, hence we can focus on the part that produces

this result set. In general we expect the strategy that uses the smallest of the two

tables in the outer loop to be the winner. What muddles the waters in this case

is that one relation has a 6-level index while the other needs to resort to binary

searching over its file. Querying S for equality via binary searching takes 7 block

accesses, whereas querying the primary index of E also takes 7 block accesses. As

9 CONTINUED OVERLEAF

such, individual equality queries over the two access paths are of equal cost, hence

using the smallest of the two relations (S) at the outer loop is expected to be the

winner, and so is in this case.

(2 marks for Strategy 1, 2 marks for Strategy 2, 3 marks for the explanation.)

10 END OF QUESTION PAPER

学霸联盟