r studio代写-ENVS3023/6034|学霸联盟

r studio代写-ENVS3023/6034

时间：2021-05-10

ENVS3023/6034 –Advanced
Quantitative Methods
Long answer question practise – some tips
Last edited 5 May 2021
1
Quail.xls
 The quail dataset provides records of presence or
absence (coded as 1 or 0 in variable Presence).
You have been asked by a hunting group who like
to shoot quail (a small game bird) whether
topographic variation (variables coded topov1 to
topov10 for fine to large scale variations) affects
quail presence. Their plan is to make the ground
more variable so that they have places to hide
when shooting the bird but they don’t want to do
this if the quail will avoid the area. What is your
advice to them?
2
What the questions says
 …records of presence or absence
 …variables coded topov1 to topov10 for
fine to large scale variations
 [want] …places to hide when shooting
the bird
 What is your advice to them?
3
Creating a research question
 Is the presence of quail influenced by
topographic variation when everything else
is taken into account?
 We are just asked for advice, not a model
or equation
 Interpretation is important but mainly by
the researcher, not the hunter
4
Possible approaches?
 GLM – if assumptions can be met
 GAM – good but possibly slow
 MARS –binomial response is possible (read the
earth pdf)
 CART – not suitable as single tree
 RF and SGB – possible because n = 2154 – but you
need to know how to specify in RF
 NN – not really interpretable unless you add PDPs
5
Results from GLM model
6
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.7954362 0.9979657 5.807 6.35e-09 ***
meanNDVI -0.0789547 0.0096221 -8.206 2.30e-16 ***
spriNDVI 0.0564619 0.0052579 10.739 < 2e-16 ***
wintNDVI -0.0281757 0.0070006 -4.025 5.70e-05 ***
topov1 0.2092300 0.3059940 0.684 0.494120
topov2 1.0601105 0.7800388 1.359 0.174131
topov5 -1.0383771 1.4579828 -0.712 0.476340
topov10 -2.5709852 1.2681614 -2.027 0.042628 *
alt 0.0018815 0.0001809 10.401 < 2e-16 ***
townden -2.7884717 0.7536330 -3.700 0.000216 ***
rivden -1.1856096 0.4275608 -2.773 0.005555 **
Running randomForest on binary
response
7
randomForest(as.factor(Presence) ~
meanNDVI...
To answer the question
 It’s easy to forget this!
 Go back and read it again to check
 What would your advice be to the hunters?
8
Body fat in men.csv
Excess body fat is bad for health but is not simple to
measure directly because the worst place to carry it
is around the internal organs, hidden from view.
Your task is to explore the data provided to see
whether it is possible to predict body fat (variable
Percent body fat) from various characteristics and
body measurements (variables Age to Wrist
circumference). The aim is to develop a simple way
(perhaps a rule of thumb?) for medical practitioners
and affected individuals to assess body fat.
9
What the questions says
 … explore the data provided to see whether
it is possible to predict body fat
 The aim is to develop a simple way for
medical practitioners and affected
individuals to assess body fat
10
Creating a research question
 Can we predict body fat from the variables
provided?
 We need to do this simply enough to be able
to communicate the results
 What two techniques stand out as
appropriate?
11
Possible approaches?
 (G)LM – if assumptions can be met. Link is identity
function so = linear regression. n = 252
 GAM – too complex
 MARS – possible but complex
 CART – good for visualization but low predictive
power?
 RF and SGB – too complex
 NN - too complex
12
Run as LM
 % body fat = 0.996*abdomen + 0.473* forearm –
1.506 wrist – 0.136*weight – 34.854
 Why not round coefficients to simplify?
 % body fat = abdomen + ½ forearm – 1½ wrist –
1½ (weight/10) – 35
 Correlations in both cases are r = 0.857
 My simplified formula is much easier for
practictioners to use.
13
Speeding violations.csv
These are data from the US and the focus is
on the police allegedly stopping people on the
basis of race. Your task is to discover
whether there is evidence for a difference in
tendency to drive too fast (variables Speed
and Overlimit) according to race (variable
Race).
14
What the questions says
 …discover whether there is evidence for a
difference in tendency to drive too fast
according to race...
 In other words, is race a significant
predictor of speed (the response)
 There are four categorical predictors plus
date and time. Speed could also be a
continuous predictor but it must be related
to “overlimit” so probably best to omit.
15
Data cleaning
 Possible problem here is that missing
values have been assigned **
 Cannot search and replace in Excel so use
the .csv directly and do it in Notepad
 Or in R use Speedy_clean <-
Speeders[(Speeders$Speed!="**"),]
 Otherwise leave it and rely on listwise
deletion to take care of it.
16
Possible approaches?
 GLM – possible using dummy coding (automatic
in R)
 GAM - possible using dummy coding (automatic in R)
 MARS – also possible if no missing values
 CART – not suitable as single tree
 RF and SGB – possible if we specify categorical
predictors as factors (n = 6929)
 NN – possible if predictors standardized.
17
Example prediction from lm in R
18
Conclusion
 Linear model and even neural nets have low
predictive power for this problem
 Little evidence that race has an effect on
speeding over the limit
 Might suggest that the observed
disproportionate stopping of certain groups
is racism.
19
Mammal sleep.xlsx
How much do different species of mammal
sleep and why are there differences? Explore
the data provided to determine which
variables seem to influence sleep duration
(three possible variables) and come up with
hypotheses to explain your findings.
20
What the questions says
 …explore the data provided to determine
which variables seem to influence sleep
duration
 …come up with hypotheses to explain your
findings
21
Creating a research question
 How well can we predict sleep duration
from the variables provided?
 Which variables are most important as
predictors?
 So we need both predictive power and
interpretability to some extent (enough to
create hypotheses).
22
Possible approaches?
 (G)LM – best if assumptions can be met
 GAM – slow and at the limit as n = 62
 MARS – possible even with n = 62
 CART – not suitable (unreliable) as single tree
 RF and SGB – too few points
 NN – too few points
 We should try a linear model because it is the
most interpretable and n is small. 23
Issues
 Information on two datasheets so awkward
and needs tidying
 Excel is good for sorting this out and
getting a feel for the data
 Remember: we need to check for linearity if
we are to build a simple linear model, so
make some plots.
24
General tips
 Make sure you read and then answer the question!
 Answers need to include text, R code, plots etc.
 Use resampling techniques (e.g. cross-validation)
wherever appropriate
 Look at model fits and residuals when appropriate
 Interpret any output as fully as you can
 Write a conclusion.
25