xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

程序代写案例-STAT 2132

时间：2021-04-22

STAT 2132 take-home final exam

Due Thursday, 4/29/21 at 11:59pm on Canvas.

This exam is to be completed on your own. Since no one is watching you to make sure you

follow that direction, I am trusting you to abide by the honor code and not collaborate with your

classmates. I will be available via email and will hold my regularly scheduled office hours on

Monday and Tuesday next week.

Please be concise with your answers, and do not turn in any R code (I will not read it if you

do). All numerical answers should be rounded to three decimal places. The exam contains three

questions, and all are related to the experiment described on the next page. Good luck!!

1

Anthracyclines are a class of important chemotherapy drugs used to treat many different

types of cancers. However, anthracyclines have several non-trivial side effects, one of which is

anthracycline-induced cardiotoxicity (ACT). Some patients with ACT do not manifest symptoms,

while ACT in others can lead to congestive heart failure (CHF). The likelihood of a patient develop-

ing ACT-induced CHF is dependent on the dose of anthracycline administered during chemother-

apy, and is as low as 5% for small doses and as high as 48% for larger doses (Swain et al., 2003).

Assume your collaborator treated cardiomyocytes (heart cells) derived from m = 46 unrelated

individuals with five dosages of an anthracycline to study how the expression of various genes

in the heart changes in response to anthracycline. While the full data originally included 230

samples (five samples for each of the 46 individuals), 13 were removed after an initial QC step.

The RData file “Genes.RData”, which you can load into R using the command load(file =

"Genes.RData"), contains two objects:

• Data: A 217 × 2 data frame.

– cell line: The cell line from which the sample was derived. Each cell line comes

from a unique individual.

– conc: The concentration (in micromoles) of anthracycline administered to the car-

diomyocyte sample. A concentration of 0 indicates no anthracycline was administered.

• Y: A 12, 317 × 217 matrix containing the log-transformed expression of p = 12, 317 genes

from the n = 217 samples. The columns of Y align with the rows of Data.

Throughout this problem, let Y ∈ Rp×n be the expression data matrix, and let g = 1, . . . , p index

gene and i = 1, . . . , n index sample. For each sample, define r(i) ∈ {1, . . . ,m = 46} to be the

individual used to generate sample i. The gth row, ith column and (g, i)th element of Y will be

denoted as Yg∗ ∈ Rn, Y∗i ∈ Rp and Ygi ∈ R, respectively.

1) Consider the model for a single gene g. Assume individuals are randomly drawn from a pop-

ulation and treat dose as a factor variable with five ordered levels, where level 1 corresponds

to a dose of 0 and level 5 corresponds to a dose of 5 micromoles. Suppose

Ygi =

5∑

d=1

1

{

sample i was given dose d

}

βgd + δgr(i) + egi

= xTi βg + δgr(i) + egi, i = 1, . . . , n (1)

where βg =

(

βg1 · · · βg5

)T ∈ R5, δg1, . . . , δgm i.i.d∼ N (0, τ2g), eg1, . . . , egn i.i.d∼ N (0, σ2g) and egi is

independent of δg j for all i = 1, . . . , n and j = 1, . . . ,m. Note that βg, τ2g and σ

2

g all depend

on the gene g, but that the covariates xi are shared across genes.

(a) What is the interpretation of βgd, τ2g and σ

2

g? Is βgd identifiable in (1)?

(b) Express the intraclass correlation for each gene g in terms of the parameters in Model

(1).

2

(c) For parts (c)-(f), use only the individuals with 5 dosage measurements (there should

be 37 such individuals). Let β¯g = 15

∑5

d=1 βgd be the mean effect for gene g. What is

the generalised least squares (GLS) estimate for the mean-centered coefficients ∆g =

βg−15β¯g, ∆ˆg, in terms of the Ygi’s? Your solution SHOULD NOT depend on τ2g or σ2g.

(d) Express Var

(

∆ˆg

)

in terms of the parameters in Model (1), and show that it does not

depend on τ2g.

(e) Derive an analytic, unbiased estimator for σ2g, and prove it is unbiased. Use this to

derive an expression for V̂ar

(

∆ˆg

)

, an estimate for Var

(

∆ˆg

)

. (Hint: consider each

individual’s expression Ygr∗ = (Ygi){i:r(i)=r∗} ∈ R5 for each r∗ ∈ {1, . . . ,m}, and derive

the model for Q5Ygr∗ , where Q5 ∈ R5×5 is the orthogonal projection matrix for 15’s

orthogonal compliment.)

(f) Using only the 37 individuals with 5 dosage measurements and your solutions to (c)

and (e), compute ∆ˆg and V̂ar

(

∆ˆg

)

for g = 1, . . . , p in R. Use these to derive a P-value

Pg for the null hypothesis H0,g : βg1 = · · · = βgd for each gene g = 1, . . . , p. Using

Bonferroni to control the FWER at a level α = 0.05, how many of H0,g do you reject?

What does this tell you about the relationship between of anthracycline dosage and

gene expression?

(g) Additional analyses reveal that there is substantial correlation between the p genes. If

the P-values Pg are super-uniform when H0,g is true (i.e. P(Pg ≤ u) ≤ u under H0,g),

will the Bonferroni procedure control the FWER at the desired level α? If yes, prove

it. If not, give a counter-example.

2) Use all n = 217 observations to answer all remaining questions. Your collaborator believes

that one of the mechanisms through which anthracycline induces ACT is by inhibiting apop-

tosis (i.e. cell death). You therefore decide to investigate the expression of the gene JADE1

(g = 3884), which is known to promote apoptosis in healthy patients.

(a) Using whatever method you deem most appropriate, estimate τ2g, σ

2

g and the intraclass

correlation for g = 3884.

(b) Consider the null hypothesis H0 : βg1 = · · · = βg5, and take the alternative hypothesis

to be the negation of H0. Use a likelihood ratio test to test H0 for g = 3884. What do

you conclude?

(c) An important assumption you made in the previous part was that Yg∗ for g = 3884

was normally distributed. Does that assumption seem reasonable? Use a QQ-plot to

justify your answer. (Hint: make sure that the QQ-plot is not corrupted by the possible

correlation in the data. Try multiplying the residuals with an appropriate matrix to

mitigate possible correlation.)

(d) Using whatever method you deem most appropriate, estimate and find a 95% confi-

dence interval for βg5−βg1 for g = 3884. What can you conclude about the relationship

between anthracycline dosage and JADE1 expression? Does this support your collab-

orator’s hypothesis that anthracycline induces ACT by inhibiting apoptosis?

3

(e) Now suppose you want to determine simultaneous 95% confidence intervals for all

points in

S =

cTβg : 5∑

d=1

ci = c

T15 = 0

for g = 3884.

(i) Suppose Yg∗ ∼ N

(

Xβg,Σg

)

, where Σg is a known positive definite matrix. Show

that there exists a matrixA such thatAYg∗ ∼ N

(

(AX)βg, In

)

.

(ii) By plugging in the estimate for Σg you determined in part (a) for the true Σg, use

part (i) and Scheffe´’s procedure to obtain simultaneous 95% confidence intervals

for all points in S.

(iii) Part (ii) ignores the uncertainty in the estimate for Σg when computing the simul-

taneous confidence intervals. Would you expect a method that properly accounts

for the uncertainty in said estimate to return wider or narrower confidence intervals

than those computed in part (ii)? Explain.

(f) For interpretability purposes, your collaborator would prefer a model where E(Yg∗)

is linear in dose. Use a likelihood ratio test to test the null hypothesis that E(Ygi) =

µg + dose(i)γg for g = 3884 and i = 1, . . . , n, where dose(i) ∈ {0, 0.625, 1.25, 2.5, 5} is

the anthracycline dosage applied to sample i. What is your conclusion?

3) Assume Model (1) is correct. Model (1) implies Yg∗ can be written as

Yg∗ ∼ N

(

Xβg, τ

2

gB + σ

2

gIn

)

, g = 1, . . . , p.

(a) Derive an expression forB.

(b) Estimating τ2g and σ

2

g for each gene g = 1, . . . , p is very computationally expensive.

The rate limiting steps are computing the inverse and determinant of Var

(

Yg∗

)

∈ Rn×n,

both of which require O

(

n3

)

operations for unstructured matrices. This implies naive

algorithms to estimate τ21, σ

2

1, . . . , τ

2

p, σ

2

p, like ML and REML, will have O

(

pn3

)

run-

times. Here you will construct a more computationally efficient algorithm to estimate

τ2g, σ

2

g for all g = 1, . . . , p.

(i) Show that there exists a non-random matrixQ ∈ Rn×(n−5) with orthonormal columns

such that

Y˜g∗ = QTYg∗ ∼ N

(

0, τ2gB˜ + σ

2

gIn−5

)

, g = 1, . . . , p.

What is B˜ in terms ofQ andB?

(ii) Find a non-random unitary matrix U ∈ R(n−5)×(n−5) such that U T Y˜g∗ ∈ Rn−5 is nor-

mally distributed with independent entries for all g = 1, . . . , p. Derive expressions

for E

(

U T Y˜g∗

)

and Var

(

U T Y˜g∗

)

(the latter should be a diagonal matrix).

4

(iii) If p >> n, explain why estimating τ2g, σ

2

g for all g = 1, . . . , p using U

T Y˜g∗ is more

efficient than just using Yg∗ (or Y˜g∗). When answering this question, consider how

many operations it requires to invert and calculate the determinant of Var

(

U T Y˜g∗

)

.

5

学霸联盟

Due Thursday, 4/29/21 at 11:59pm on Canvas.

This exam is to be completed on your own. Since no one is watching you to make sure you

follow that direction, I am trusting you to abide by the honor code and not collaborate with your

classmates. I will be available via email and will hold my regularly scheduled office hours on

Monday and Tuesday next week.

Please be concise with your answers, and do not turn in any R code (I will not read it if you

do). All numerical answers should be rounded to three decimal places. The exam contains three

questions, and all are related to the experiment described on the next page. Good luck!!

1

Anthracyclines are a class of important chemotherapy drugs used to treat many different

types of cancers. However, anthracyclines have several non-trivial side effects, one of which is

anthracycline-induced cardiotoxicity (ACT). Some patients with ACT do not manifest symptoms,

while ACT in others can lead to congestive heart failure (CHF). The likelihood of a patient develop-

ing ACT-induced CHF is dependent on the dose of anthracycline administered during chemother-

apy, and is as low as 5% for small doses and as high as 48% for larger doses (Swain et al., 2003).

Assume your collaborator treated cardiomyocytes (heart cells) derived from m = 46 unrelated

individuals with five dosages of an anthracycline to study how the expression of various genes

in the heart changes in response to anthracycline. While the full data originally included 230

samples (five samples for each of the 46 individuals), 13 were removed after an initial QC step.

The RData file “Genes.RData”, which you can load into R using the command load(file =

"Genes.RData"), contains two objects:

• Data: A 217 × 2 data frame.

– cell line: The cell line from which the sample was derived. Each cell line comes

from a unique individual.

– conc: The concentration (in micromoles) of anthracycline administered to the car-

diomyocyte sample. A concentration of 0 indicates no anthracycline was administered.

• Y: A 12, 317 × 217 matrix containing the log-transformed expression of p = 12, 317 genes

from the n = 217 samples. The columns of Y align with the rows of Data.

Throughout this problem, let Y ∈ Rp×n be the expression data matrix, and let g = 1, . . . , p index

gene and i = 1, . . . , n index sample. For each sample, define r(i) ∈ {1, . . . ,m = 46} to be the

individual used to generate sample i. The gth row, ith column and (g, i)th element of Y will be

denoted as Yg∗ ∈ Rn, Y∗i ∈ Rp and Ygi ∈ R, respectively.

1) Consider the model for a single gene g. Assume individuals are randomly drawn from a pop-

ulation and treat dose as a factor variable with five ordered levels, where level 1 corresponds

to a dose of 0 and level 5 corresponds to a dose of 5 micromoles. Suppose

Ygi =

5∑

d=1

1

{

sample i was given dose d

}

βgd + δgr(i) + egi

= xTi βg + δgr(i) + egi, i = 1, . . . , n (1)

where βg =

(

βg1 · · · βg5

)T ∈ R5, δg1, . . . , δgm i.i.d∼ N (0, τ2g), eg1, . . . , egn i.i.d∼ N (0, σ2g) and egi is

independent of δg j for all i = 1, . . . , n and j = 1, . . . ,m. Note that βg, τ2g and σ

2

g all depend

on the gene g, but that the covariates xi are shared across genes.

(a) What is the interpretation of βgd, τ2g and σ

2

g? Is βgd identifiable in (1)?

(b) Express the intraclass correlation for each gene g in terms of the parameters in Model

(1).

2

(c) For parts (c)-(f), use only the individuals with 5 dosage measurements (there should

be 37 such individuals). Let β¯g = 15

∑5

d=1 βgd be the mean effect for gene g. What is

the generalised least squares (GLS) estimate for the mean-centered coefficients ∆g =

βg−15β¯g, ∆ˆg, in terms of the Ygi’s? Your solution SHOULD NOT depend on τ2g or σ2g.

(d) Express Var

(

∆ˆg

)

in terms of the parameters in Model (1), and show that it does not

depend on τ2g.

(e) Derive an analytic, unbiased estimator for σ2g, and prove it is unbiased. Use this to

derive an expression for V̂ar

(

∆ˆg

)

, an estimate for Var

(

∆ˆg

)

. (Hint: consider each

individual’s expression Ygr∗ = (Ygi){i:r(i)=r∗} ∈ R5 for each r∗ ∈ {1, . . . ,m}, and derive

the model for Q5Ygr∗ , where Q5 ∈ R5×5 is the orthogonal projection matrix for 15’s

orthogonal compliment.)

(f) Using only the 37 individuals with 5 dosage measurements and your solutions to (c)

and (e), compute ∆ˆg and V̂ar

(

∆ˆg

)

for g = 1, . . . , p in R. Use these to derive a P-value

Pg for the null hypothesis H0,g : βg1 = · · · = βgd for each gene g = 1, . . . , p. Using

Bonferroni to control the FWER at a level α = 0.05, how many of H0,g do you reject?

What does this tell you about the relationship between of anthracycline dosage and

gene expression?

(g) Additional analyses reveal that there is substantial correlation between the p genes. If

the P-values Pg are super-uniform when H0,g is true (i.e. P(Pg ≤ u) ≤ u under H0,g),

will the Bonferroni procedure control the FWER at the desired level α? If yes, prove

it. If not, give a counter-example.

2) Use all n = 217 observations to answer all remaining questions. Your collaborator believes

that one of the mechanisms through which anthracycline induces ACT is by inhibiting apop-

tosis (i.e. cell death). You therefore decide to investigate the expression of the gene JADE1

(g = 3884), which is known to promote apoptosis in healthy patients.

(a) Using whatever method you deem most appropriate, estimate τ2g, σ

2

g and the intraclass

correlation for g = 3884.

(b) Consider the null hypothesis H0 : βg1 = · · · = βg5, and take the alternative hypothesis

to be the negation of H0. Use a likelihood ratio test to test H0 for g = 3884. What do

you conclude?

(c) An important assumption you made in the previous part was that Yg∗ for g = 3884

was normally distributed. Does that assumption seem reasonable? Use a QQ-plot to

justify your answer. (Hint: make sure that the QQ-plot is not corrupted by the possible

correlation in the data. Try multiplying the residuals with an appropriate matrix to

mitigate possible correlation.)

(d) Using whatever method you deem most appropriate, estimate and find a 95% confi-

dence interval for βg5−βg1 for g = 3884. What can you conclude about the relationship

between anthracycline dosage and JADE1 expression? Does this support your collab-

orator’s hypothesis that anthracycline induces ACT by inhibiting apoptosis?

3

(e) Now suppose you want to determine simultaneous 95% confidence intervals for all

points in

S =

cTβg : 5∑

d=1

ci = c

T15 = 0

for g = 3884.

(i) Suppose Yg∗ ∼ N

(

Xβg,Σg

)

, where Σg is a known positive definite matrix. Show

that there exists a matrixA such thatAYg∗ ∼ N

(

(AX)βg, In

)

.

(ii) By plugging in the estimate for Σg you determined in part (a) for the true Σg, use

part (i) and Scheffe´’s procedure to obtain simultaneous 95% confidence intervals

for all points in S.

(iii) Part (ii) ignores the uncertainty in the estimate for Σg when computing the simul-

taneous confidence intervals. Would you expect a method that properly accounts

for the uncertainty in said estimate to return wider or narrower confidence intervals

than those computed in part (ii)? Explain.

(f) For interpretability purposes, your collaborator would prefer a model where E(Yg∗)

is linear in dose. Use a likelihood ratio test to test the null hypothesis that E(Ygi) =

µg + dose(i)γg for g = 3884 and i = 1, . . . , n, where dose(i) ∈ {0, 0.625, 1.25, 2.5, 5} is

the anthracycline dosage applied to sample i. What is your conclusion?

3) Assume Model (1) is correct. Model (1) implies Yg∗ can be written as

Yg∗ ∼ N

(

Xβg, τ

2

gB + σ

2

gIn

)

, g = 1, . . . , p.

(a) Derive an expression forB.

(b) Estimating τ2g and σ

2

g for each gene g = 1, . . . , p is very computationally expensive.

The rate limiting steps are computing the inverse and determinant of Var

(

Yg∗

)

∈ Rn×n,

both of which require O

(

n3

)

operations for unstructured matrices. This implies naive

algorithms to estimate τ21, σ

2

1, . . . , τ

2

p, σ

2

p, like ML and REML, will have O

(

pn3

)

run-

times. Here you will construct a more computationally efficient algorithm to estimate

τ2g, σ

2

g for all g = 1, . . . , p.

(i) Show that there exists a non-random matrixQ ∈ Rn×(n−5) with orthonormal columns

such that

Y˜g∗ = QTYg∗ ∼ N

(

0, τ2gB˜ + σ

2

gIn−5

)

, g = 1, . . . , p.

What is B˜ in terms ofQ andB?

(ii) Find a non-random unitary matrix U ∈ R(n−5)×(n−5) such that U T Y˜g∗ ∈ Rn−5 is nor-

mally distributed with independent entries for all g = 1, . . . , p. Derive expressions

for E

(

U T Y˜g∗

)

and Var

(

U T Y˜g∗

)

(the latter should be a diagonal matrix).

4

(iii) If p >> n, explain why estimating τ2g, σ

2

g for all g = 1, . . . , p using U

T Y˜g∗ is more

efficient than just using Yg∗ (or Y˜g∗). When answering this question, consider how

many operations it requires to invert and calculate the determinant of Var

(

U T Y˜g∗

)

.

5

学霸联盟