BENV0139-无代写|学霸联盟

BENV0139-无代写

时间：2023-05-08

BENV0139 Statistics for Heritage Science
Academic Year 2021-22
All unseen. Open book exam. Time: two hours. The results are submitted as a single pdf, which can include
photographs or scans of written answers, as well as calculations and text typed in the computer. Total points:
20.
Q1. Pest Traps
Pests are routinely monitored in museums using sticky traps. We have a dataset of pest monitoring in historic
houses in England that includes the following information: traps is the number of traps inside each house.
catches is the average number of pests caught per month. NT indicates if the house is managed by the
National Trust (1) or not (0). This table shows the head (the first few lines) of the data set:
## traps catches NT
## 1 24 41.79120 0
## 2 31 44.54088 0
## 3 10 32.46029 0
## 4 26 41.55216 0
## 5 26 47.01443 1
## 6 42 40.09175 0
a. What type of variable is each data column in the table above? [1]
traps is continuous discrete,
catches is continuous,
NT is categorical. "Boolean" would also be accepted.
b. The mean of catches is 40.093, with a variance of 32.097. How many decimal places should be used to
precisely report catches? Justify your answer including any assumptions made, calculations undertaken,
and logical reasoning. [2]
Firstly, we need to obtain the standard deviation from the variance: σ =
√
σ2 = 5.665. A good rule of thumb
is to quote the standard deviation to one significant figure and the mean to the same number of decimal
places as the standard deviation. In this case, you’d say (40. ± 6.). However, there is flexibility and if you
say (40.1 ± 5.7), that would be OK.
c. Calculate the likelihood of catching 44.10 insects in a month. Justify your answer including any
assumptions made, calculations undertaken, and logical reasoning. [2]
The right answer is that the probability of catching any exact number of insects is zero (because the area
under the probabiliy distribution of a single point is zero). However, if the question was properly written, we
could answer the probability of finding a number of insects equal or larger than x. In this case we know σ
and the mean, µ. We want to calculate the probability of a value being equal or higher than x = 44.102.
To that end, we must calculate z = x−µσ = 0.708. In this case, because we are interested in a difference in
a single direction (bigger or equal to x), we are looking at an one-tailed test. Then, we need to look up z
in a table for the z-statistic, such as the one provided in the course notes. Or, if you use R, you can use
"1-pnorm(z)". You will find that the probability to find x insects or more is 0.240.
d. You want to find out if the number of catches is significantly different in properties managed by the
National Trust as compared to those that are not. Which statistical test would you use, and why? [1]
1
Student’s t-test is ideally suited to this problem. The t-test is used to compare the mean of two samples. In
this case, it would be necessary to split the dataset in two, by using the categorical variable NT .
2
Q2. RH before and after
A museum has recently installed humidity control in a storage room. We have measurements of the relative
humidity (RH, in percent) before (freq.before) and after (freq.after) the installation. This data has been
summarised in a frequency table:
## RH freq.before freq.after
## 1 1 to 10 0 0
## 2 11 to 20 0 0
## 3 21 to 30 0 0
## 4 31 to 40 1 0
## 5 41 to 50 5 3
## 6 51 to 60 9 26
## 7 61 to 70 10 1
## 8 71 to 80 5 0
## 9 81 to 90 0 0
## 10 91 to 100 0 0
a. Find the mode and the range of the relative humidity before and after heating was implemented. [1]
The mode is the most frequent value, which will correspond with the row with the highest value of the
frequency before or after. This value can be reported as the bin, (i.e. 51 to 60), or the centerpoint of this bin
(i.e. 55.5). To report the range, you need to indicate the minimum and the maximum values, also using the
centerpoint of the bins.
b. Estimate the mean and standard deviation of the distribution of relative humidity in the museum before
and after the installation. [2]
The best way to do this is to calculate the centerpoint of each bin, xi, as well as the total number of
measurements, N =
∑
ni . Then, the average will be µ =
∑
xini
N . To calculate the standard deviation, it is
useful to firstly calculate (xi − µ)2 for each row, and then calculate σ =
√
1
N−1
∑
(xi − µ)2. Because you are
usint the centerpoint, the results will not be as precise as if you were using the raw data, which you don´t
have. But they should be close to (58.9 ± 9.9) before heating and (53.4 ± 3.8) after heating.
c. Determine whether the installation of environmental control has significantly reduced the humidity.
Justify your answer including any assumptions made, calculations undertaken, and logical reasoning. [3]
This is a direct application of a t-test. It is based on the assumption that the data is normally distributed. It
is used when we want to know if two distributions are different. The Null Hypothesis is that they are the
same, so we test to determine whether that is likely to be true or not. The first step is to calculate the t-value
t = µ1−µ2
S
√
1
N1
+ 1N2
, where
S =
√
(N1−1)σ21+(N2−1)σ22
N1+N2−2 .
With these equations you will obtain a t-value of t = 2.8 You will also need the degrees of freedom, which are
N1 +N2 − 2 = 58.0. In this case, you are testing a difference between two means and therefore you need
to use a two-tailed distribution. You can use any table of t-values or any statistical software to find which
probability your t-value corresponds to. In any case, you will find a value of p = 0.00717470. You need to
compare this with a threshold of significance. p < 0.05 is commonly used. Therefore, you can say that the
difference is significant at the p<0.05 level. It is likely that heating has reduced the relative humidity.
3
Q3. Cracks in Paintings
The scatterplot below presents pairwise experimental relationships between several variables (organised in
a data.frame, df ) relevant to crack formation. We want to predict the number of cracks that will appear
as a function of three variables: 1) the age of the painting (age in years), the tangential tension within the
stretcher (tension in mN), and the thickness of the paint layer (thickness in cm).
cracks
0
20
40
60
80
100
120
50 100 150 200 250
0.00
0.02
0.04
0.06
0.08
0 20 40 60 80 120
age
tension
−2 −1 0 1 2
0.00 0.04 0.08
50
100
150
200
250
−2
−1
0
1
2
thickness
We conduct a multiple linear regression on df and obtain the following output:
##
## Call:
## lm(formula = cracks ~ age + tension + thickness, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.696 -16.550 -0.324 21.563 54.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.2323 9.5799 11.089 1.37e-14 ***
## age 1.3495 0.1710 7.892 4.27e-10 ***
## tension -0.6231 4.4948 -0.139 0.890349
## thickness -1228.1336 300.7319 -4.084 0.000175 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.61 on 46 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.7273
## F-statistic: 44.56 on 3 and 46 DF, p-value: 1.172e-13
a. Write the equation that can be used to predicts the number of cracks that will form. [2]
The equation is:
4
cracks = 106.2 + 106.2 × age - 0.6 × tension - 1228.1 × thickness
b. Classify each of the variables as independent or dependent. [1]
The variable "cracks" is dependent on all the other ones, which are independent.
c. Which variables are positively and/or negatively associated with the independent variable? Justify your
answer.[1]
There are two ways to answer this question. One is looking at the plot, and observing the direction of the
correlations. Another one is looking at the signs of the estimated coefficients, and classify as negatively
associated those with a negative symbol, and positively associated those with a positive symbol.
d. How good is this correlation? Justify your answer. [2]
This question can be answered by evaluating the R squared value. A very good correlation would have a
R-squared close to 1. In this case, the R-squared is lower than 0.8 and therefore it is not excellent. You can
also comment on the linearity of the relationships shown in the plot: it is well-known that non-linear data
can sometimes lead to models which large R-Squared values, which look misleadingly good. In this case, most
of the data shows linear relationships, except the thickness, which may be logarithmic.
e. Suggest a way to improve the model, assuming that you are not able to collect any more data. Justify
your answer including any assumptions made, and logical reasoning. Calculations are not required. [2]
This open-ended question has many possible answers. You only need to propose solutions, there is no need
to perform any calculations. You could propose to create a new model after eliminating variabels that are
poorly correlated, such as the tension. You could propose a method to identify which variables could be part
of the model, for example, running several regressions and observing the changes in R-squared. You could
also propose to modify some of the variabels. Note, for example, that the relationship between thickness and
cracks shown in the plot is not-linear. A possibility would be to create another model where thickness is
replaced by log(thickness).