r代写-STAT 503|学霸联盟

r代写-STAT 503

时间：2021-10-13

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 1 STAT 503 – Statistical Methods for Biology Tutorial 5: Probability distributions and simulation Tutorial 5 introduces R functions for probability distributions and random simulation (what is sometimes called Monte Carlo simulation). These functions include the probability mass or density functions for different distributions, as well as the corresponding cumulative distribution functions, quantile functions, and functions that simulate random draws from a random variable. Working through this tutorial, you will learn: 1) The nomenclature of probability distribution functions in R 2) How to obtain values from the cumulative distribution function and mass function for a probability distribution given values for its parameters. 3) How to find the quantiles of a probability distribution. 4) How to simulate data from a variety of distributions. 5) About the sample() function, which allows you to randomly sample from a dataset 6) How to use the replicate() function to repeat a simulation experiment 7) How to use sample() and replicate() together for bootstrap resampling 8) (OPTIONAL) How to set up a simulation to help plan data collection 9) (OPTIONAL) How to simulate the mechanism underlying the normal distribution. You will also get more practice using mutate() and generating plots, and will use a simulation experiment to explore the origins of the normal distribution. Please attempt all of the exercises and ask for help if you encounter any problems or confusion. 1. Getting started In this tutorial, we will use the dplyr and ggplot2 packages. Remember that when you attach the dplyr package, it may mask a few items from the base and stats packages. This is okay. There is no data to load with this tutorial. 2. Probability distributions in R R contains probability and simulation functions for many different distributions, and all of distribution functions follow a standard naming convention: there is a one-letter prefix that tells you what the function does, followed by a root name that identifies the name of the distribution. For instance, the root name for the normal distribution in R is norm, and the binomial distribution is binom. R has four different functions for each distribution. These are introduced below using the binomial distribution, but the syntax is similar for other distributions as well. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 2 2.1 An introduction using the binomial distribution For each distribution, R has four different functions, which are listed for the binomial distribution below:  pbinom() – the cumulative distribution function. The prefix "p" is for "probability." It is used for the cdf instead of the mass function because the cdf can be used to find probabilities for any distribution, whether it is discrete or continuous.  dbinom() – the "density" function. For discrete distributions, the "d" function returns the value of the probability mass function, which is conceptually analogous to the density function for a continuous distribution.  qbinom() – the "quantile" function. This function returns the value of the variable that corresponds to a given quantile of the distribution.  rbinom() – the "random" simulation function. The random simulator draws random numbers from the distribution. We will use it quite a bit later in this tutorial. To see how each of these functions works, we'll set up an example using the binomial distribution with sample size ݊ ൌ 6 and probability of success ݌ ൌ 0.3. First, we'll calculate the probability mass function and cumulative distribution function manually. Then we'll see how to get the same results using the R functions listed above. 2.1.1 Probability functions – Start by loading dplyr. It is not required for the distributions, but we will use it to organize our work. Once it is loaded, create a data frame with 1 column, named x, containing the integers from 0 to 6 (see code below). Next, add a column for the probability mass function. We'll call this column px_manual. You can add it to bin_example using the mutate() function: library(dplyr) n <- 6 p <- 0.3 bin_example <- data.frame(x = 0:n) bin_example <- mutate( bin_example, px_manual = choose(n, x) * p^x * (1 - p)^(n - x) ) (if you need a refresher on mutate(), see Tutorial 2, Section 3.5). In the code above, I defined objects for ݊ and ݌, and set them equal to 6 and 0.3, respectively. I could have put 6 and 0.3 directly into the code wherever I need them, but defining them as objects this way means that if I want to change the values, I only need to edit the code in one place. This minimizes my opportunity to make mistakes. For example, I cannot accidentally change ݊ in the equation for the mass function, but forget to change it in the original definition for the variable ݔ. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 3 As a check to make sure that we've done everything correctly, we can get the sum of ݌ሺݔሻ. It should be 1 (and it is, more or less – if you actually test the equality, R will return FALSE; it is off by about 3 ൈ 10ିଵ଺ due to the limitations of computer precision): > bin_example x px_manual 1 0 0.117649 2 1 0.302526 3 2 0.324135 4 3 0.185220 5 4 0.059535 6 5 0.010206 7 6 0.000729 > sum(bin_example$px_manual) [1] 1 Next, we will use the cumsum() function to calculate the cumulative distribution function (cdf). Recall that the cdf of a random variable ܺ is defined as ܨሺݔሻ ൌ ܲሺܺ ൑ ݔሻ and describes the probability that the random variable will take a value less than or equal to ݔ. The name cumsum stands for cumulative sum; the function adds the values of a vector up in order of appearance, returning a running summation: bin_example <- mutate( bin_example, Fx_manual = cumsum(px_manual) ) Running this and looking at the result, we have: > bin_example x px_manual Fx_manual 1 0 0.117649 0.117649 2 1 0.302526 0.420175 3 2 0.324135 0.744310 4 3 0.185220 0.929530 5 4 0.059535 0.989065 6 5 0.010206 0.999271 7 6 0.000729 1.000000 Notice that in the first line, px_manual and Fx_manual are equal, but in the second row, Fx_manual is the sum of the first two values in px_manual, the third value of Fx_manual is the sum of the first three px_manual values, etc. Now we have values for both ݌ሺݔሻ and ܨሺݔሻ that we've calculated by hand. Let's compare these to the equivalent functions from R. We'll add the values from both R functions in one mutate() call: bin_example <- mutate( bin_example, px_R = dbinom(x = x, size = n, prob = p), Fx_R = pbinom(q = x, size = n, prob = p) ) Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 4 The arguments in these functions have different names (x for a generic variable value in dbinom(), versus q for a quantile in pbinom()), but are really the same thing. They both represent the value of the variable at which you want to evaluate the function. This is always true: the first argument for any probability function in R that starts with the prefix d is x, and the first argument for any function starting with the prefix p is q. After the first argument come one or more arguments that specify the parameters for the particular distribution. These will be different for different distributions. For the binomial, the parameters are the sample size (݊) and the probability of success (݌). Looking at the results for the binomial distribution, we see that the values perfectly match our manual calculations: > bin_example x px_manual Fx_manual px_R Fx_R 1 0 0.117649 0.117649 0.117649 0.117649 2 1 0.302526 0.420175 0.302526 0.420175 3 2 0.324135 0.744310 0.324135 0.744310 4 3 0.185220 0.929530 0.185220 0.929530 5 4 0.059535 0.989065 0.059535 0.989065 6 5 0.010206 0.999271 0.010206 0.999271 7 6 0.000729 1.000000 0.000729 1.000000 So, if we get the same answer, why do we need dbinom() or pbinom()? First, they're convenient. If you are working with large datasets, they may also run more quickly and do a better job of managing your computer's memory than manual calculations do. Finally, although we can easily perform the manual calculations for the binomial, we cannot do this with other distributions. 2.1.2 Bad arguments – You may be wondering what happens if you put in an impossible value for ݔ, ݊, or ݌. Try different inputs and find out. For example, > dbinom(x = -2, size = 6, prob = 0.3) [1] 0 > dbinom(x = 2, size = -1, prob = 0.3) [1] NaN Warning message: In dbinom(x = 2, size = -1, prob = 0.3) : NaNs produced > dbinom(x = 2, size = 6.3, prob = 0.3) [1] NaN Warning message: In dbinom(x = 2, size = 6.3, prob = 0.3) : NaNs produced > dbinom(x = 2, size = 6, prob = 2) [1] NaN Warning message: In dbinom(x = 2, size = 6, prob = 2) : NaNs produced In the first row, I gave R a value of ݔ that falls outside the domain of the distribution, and it correctly returned a probability density of 0. In the others, I gave it nonsense parameter values, and it returned NaN, which stands for "Not a Number," and is treated by R like a missing value. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 5 2.1.3 The lower.tail and log arguments – Remember that p functions return the cumulative distribution function for a variable, ܨሺݔሻ. All cdf functions in R have a lower.tail argument that has a default value of TRUE. Changing it to lower.tail = FALSE will return 1 െ ܨሺݔሻ. For example, > pbinom(q = 3, size = 6, prob = 0.3) [1] 0.92953 > pbinom(q = 3, size = 6, prob = 0.3, lower.tail = FALSE) [1] 0.07047 In addition, both the probability (p) and density (d) functions have a log argument, which defaults to FALSE. If changed to log = TRUE, the natural log of the probability will be returned. This is especially useful if you need to program a maximum likelihood estimation. > pbinom(q = 3, size = 6, prob = 0.3, log = TRUE) [1] -0.0730762 > exp(-0.0730762) [1] 0.92953 2.1.4 Quantile functions – Recall that to find any quantile ݍ of a random variable, you can set ܨሺݔሻ ൌ ݌ and then solve for ݔ, where 0 ൑ ݌ ൑ 1. This implies that the quantile function is the inverse of the cumulative distribution function. When we call pbinom(), our first argument is a value for ݌, and we get a the smallest value of ݔ for which ܲሺܺ ൑ ݔሻ ൌ ݌ back. For qbinom() and all quantile functions in R, the first argument is the probability that we want the quantile for, and we get ࢞ back. For example, # p returned from pbinom() above > qbinom(p = 0.92953, size = 6, prob = 0.3) [1] 3 The relationship between the quantiles and cumulative distribution function is illustrated in Figure 1. Note that, because the binomial is a discrete random variable, we may get the same result for several different values of ݌, > qbinom(p = seq(0.86,0.95,0.01), size = 6, prob = 0.3) [1] 3 3 3 3 3 3 3 4 4 4 and the level of resolution will increase if ݊ is bigger, > qbinom(p = seq(0.86,0.95,0.01), size = 600, prob = 0.3) [1] 192 193 193 194 194 195 196 197 198 199 Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 6 Figure 1: The curves show cumulative distribution functions for two populations: ଵܺ ∼ ܤ݅݊ሺ100, 0.2ሻ (black) and ܺଶ ∼ ܤ݅݊ሺ100, 0.7ሻ (red). The horizontal dotted line corresponds to a cumulative probability of 0.9. Ninety-percent (90%) of the values for the first population fall at or below 25, which is the value of ݔ where the black curve crosses the horizontal line. Similarly, 90% of the second population falls at or below 76. These values can be obtained by running qbinom(0.9, size = 100, prob = 0.2) for the first population, or qbinom(0.9, size = 100, prob = 0.7) for the second. pbinom(25, size = 100, prob = 0.20) reverses the process, returning the value of the y-axis where the black line crosses the black, vertical, dashed line (the actual value returned is 0.9125). Code for this plot may be found in the Appendix. 2.1.5 Random number generation and simulation – The function rbinom() returns a vector of random draws from ܺ ∼ ܤ݅݊ሺ݊, ݌ሻ. We can use simulations like this in many ways. For instance, after we have fit a model, we can use simulations to help us visualize its implications or make predictions. We can also use simulations to help us determine whether a particular model is appropriate for a given dataset, and we can use simulations to help plan studies. Finally, we can use simulation to help understand statistical concepts. The first argument for all random generation functions is named n and specifies the number of draws that you want to generate. For example, > rbinom(n = 5, size = 6, prob = 0.3) [1] 3 0 2 3 1 Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 7 gives me a vector of 5 numbers drawn from the binomial distribution with sample size ݊ ൌ 6 and probability of success ݌ ൌ 0.3. Since these are random numbers, you will probably not get the same vector that I got. 2.2 The normal distribution The probability, density, quantile, and random generation functions for other distributions work very similarly to pbinom(), dbinom(), qbinom(), and rbinom(). The only difference, other than the distribution name, is that each distribution will have its own parameter values. For the normal distribution, these are the mean (mean) and standard deviation (sd). Thus, > dnorm(3, mean = 1.2, sd = 1) [1] 0.07895016 evaluates the probability density function, ݂ሺݔሻ, for the normal distribution with a mean of 1.2 and a standard deviation of 1 at ݔ ൌ 3. The corresponding cumulative distribution function, > pnorm(3, mean = 1.2, sd = 1) [1] 0.9640697 returns the probability that a value drawn at random from ܺ ∼ ܰሺ1.2, 1ሻ would be less than or equal to 3 (in other words, it returns the value of ܨሺݔሻ). The values returned by the three functions are summarized in Figure 2. Figure 2: Illustration of the normal density, probability, and quantile functions, dnorm(), pnorm(), and qnorm(), for ݔ ~ ܰሺߤ ൌ 1.2, ߪ ൌ 1ሻ. Code for the figure is in the Appendix. pnorm(3, 1.2, 1) is the area of the shaded region dnorm(3, 1.2, 1) is the height of the curve at x = 3qnorm(0.9641, 1.2, 1) is the upper bound of the shaded region 0.0 0.1 0.2 0.3 0.4 -4 -2 0 2 4 6 x fx  Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 8 The quantile function works as it did with the binomial distribution. Applying it, we can see that the empirical rule (95% of the outcomes within 2 standard deviations and 68% within 1 standard deviation of the mean) is only a rough approximation: > qnorm(c(0.025, 0.16, 0.5, 0.84, 0.975), mean=0, sd=1) [1] -1.9599640 -0.9944579 0.0000000 0.9944579 1.9599640 We can simulate values from the normal distribution using rnorm(). For example, > x <- rnorm(100000, mean = 1.2, sd = 1) > qplot(x, col=I('black'), fill=I('cornflowerblue')) creates the histogram in Figure 3. All of the normal distribution functions in R use the default parameter values mean = 0 and sd = 1. Therefore, rnorm(1000) draws 1000 random values from the standard normal distribution. Figure 3: Histogram of 100,000 random draws from the normal distribution with a mean of 1.2 and a standard deviation of 1. 0 3000 6000 9000 12000 -2 0 2 4 6 x Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 9 2.3 Other distributions R provides functions for a large number of other distributions. Those that we may encounter in this class include the Poisson distribution, which is used to model the number of events that occur in some interval or unit (e.g., the number of tornados in a year, or the number of trees in a hectare of land), the ߯ଶ distribution (sometimes spelled out as chi-square), which describes the distribution of sample sums-of-squares (the numerator of the sample variance) for draws from a standard normal distribution, the Student's t distribution, and the F-distribution. Brief descriptions for several of these other distributions appear in Table 1. Although they are here primarily for future reference, I encourage you to spend some time playing with the simulation functions to see how each one behaves under different parameterizations (e.g., generate 10,000 random draws, and then draw a histogram, as I did for the normal distribution above). Table 1: Some important statistical distributions and their R representations. Distribution R name Parameters Application Uniform unif min – the minimum possible value max – the maximum possible value The uniform distribution is helpful for randomizing sample selection and treatments, and for simulation Poisson pois lambda – the expected number of events in a given interval A discrete distribution that models count data Student's t t df – degrees of freedom (݊ െ 1, inherited from the sample variance) The distribution of the sample mean given a normal population and unknown variance Chi-square chisq df – degrees of freedom Chi-square models the distribution of a sum-of-squares It appears in several statistical tests. F f df1 – numerator degrees of freedom df2 – denominator degrees of freedom F is the ratio of two chi-square random variables. Along with Student's t, it is the main distribution used in ANOVA and regression. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 10 3. Random sampling from a fixed set of values: sample() In addition to functions that allow random simulation from a known distribution, base R includes the sample() function, which provides a simple mechanism for randomly sampling from a particular set of values, either with or without replacement. 3.1 Random sampling without replacement Suppose that we have a vector of integers from 1 to 10: x <- 1:10 > x [1] 1 2 3 4 5 6 7 8 9 10 To simulate a simple random sample from x with sample size = 3, we can call, > sample(x, size = 3) [1] 9 4 6 > sample(x, size = 3) [1] 7 4 9 (this is a random simulation, so your results will vary). Each time we call sample(x, size = 3), a new set of 3 values will be randomly sampled from the vector x. This will work on non-numeric values as well. For example: > sample(letters, size = 3) [1] "q" "f" "r" 3.2 Permutation If the size argument is set equal to the length of x, or if it is omitted, then the result will be a vector that contains all of the values in x, but in a new, randomly selected order: > sample(x, size = length(x)) [1] 6 3 7 2 10 8 9 1 5 4 > sample(x) [1] 5 4 6 10 9 1 8 7 2 3 This type of sampling, in which we select a sample with the same size as the original set of values, and do so without replacement so that the result is a random reordering of the original set, is called permutation. 3.3 Sampling with replacement By default, sample() draws values without replacement, so attempting to set size to a value greater than the length of x will generate an error. However, setting the optional argument replace = TRUE will instruct sample() to draw values with replacement. When sampling with replacement, we may or may not get duplicate values when size ൑ length(x), > sample(x, size = 3, replace = TRUE) Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 11 [1] 8 1 2 > sample(x, size = 3, replace = TRUE) [1] 1 2 2 but duplicates are guaranteed if size > length(x): > sample(x, size = 10, replace = TRUE) [1] 5 5 4 1 3 8 1 4 4 4 > sample(x, size = 13, replace = TRUE) [1] 7 4 10 10 3 10 9 10 2 2 1 10 8 3.4 Probability-weighted samples All of the examples thus far have used simple random samples either with or without replacement. Although we will not use it in this class, the sample() function also allows you to set the probability or frequency associated with of each element of the vector that is being sampled. For instance, > sample(1:4, size = 20, replace = TRUE, prob = (1:4)/10) [1] 7 4 10 10 3 10 9 10 2 2 1 10 8 Here, I've instructed sample() to give me a vector of 20 values, where each element is an integer from 1 to 4, but the value 1 only has a 10% chance of occurring, whereas 2 has a 20% chance, 3 has a 30% chance, and 4 has a 40% chance. This can be seen by evaluating the argument given to the prob argument: > (1:4)/10 [1] 0.1 0.2 0.3 0.4 In the example that we've just looked at, I specified a proper probability mass function (the values sum to 1), but this is not strictly necessary. For instance, suppose we have data on ABO blood types in 5000 individuals, where 130 are type O, 2654 are type A, 1842 are type AB, and 374 are type B, and we would like to simulate a draw of 20 individuals from this population. One option would be to create a vector of 5000 values and take a simple random sample: abo <- c( rep("O", 130), rep("A", 2654), rep("AB", 1842), rep("B", 374) ) > sample(abo, size = 20) # note that replace = FALSE by default [1] "A" "AB" "AB" "B" "A" "A" "AB" "O" "O" "AB" "A" "AB" [13] "A" "A" "AB" "AB" "A" "AB" "A" "A" Alternatively, we can get the same random simulation process by calling, sample( c("O","A","AB","B"), # values size = 20, # desired sample size replace = TRUE, # required b/c size > length(x) prob = c(130, 2654, 1842, 374) # frequencies in population ) Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 12 3.5 Example applications for sample() The sample function can be useful in a variety of situations. For example, as we have just seen, it can be used to simulate sampling from a known population. This allows us to get a preview of what our data might look like. As demonstrated in lecture, sample() can be used with the replicate() function (see Section 3.6) to see how well a particular estimation method might be affected by sample size or other aspects of sample design. 3.5.1 Randomizing treatment assignments – In experiments, sample() can be used to randomly assign individuals to treatments. For example, suppose that we are conducting a drug study in which we are looking at four different dosage levels, one of which is a control. Therefore, we have four treatments. For the sake of argument, suppose that we also have 100 subjects that have been selected to participate in the study, and that they have each been assigned a unique ID number from 1 to 100. The following code will randomly assign each individual to a treatment: treatment_levels <- c("control", "group 1", "group 2", "group 3") data <- data.frame( id = sample(1:100), treatment = factor(treatment_levels)) ) write.csv(data, file="treatment_assignments.csv", row.names=FALSE) The first statement creates a character vector that identifies each of the four treatment levels. Then, in the second statement, a data frame is created that has subject ID in the first column and a factor indicating treatment group in the second column. However, sample() has been called on the ID numbers to randomly permute them (see Section 3.2). The vector treatment_levels is converted into a factor, and is also recycled so that there are a total of 100 rows in the data frame (note that this would not work if 100/4 was not a whole number; it would produce an error). Finally, write.csv() outputs the result to a CSV file (you would need to use the standard procedures to make sure the file is written to the directory where you want it). If desired, arrange() could be called to sort the data frame by treatment and ID before saving it. 3.5.2 Bootstrap resampling – One common use of the sample() function is in bootstrap resampling. The theory underlying this methodology is that, assuming that you have a representative sample of independent units from a population, you can obtain an approximate confidence interval for almost any population parameter by repeatedly resampling from the observed data to build an empirical representation (i.e., a histogram) of the sampling distribution for the parameter's estimator. To generate a single bootstrap sample, we begin with some data. Here. I'll simulate some "data" that consist of 25 observations from a normally distributed variable with mean 34.9 and standard deviation of 2.71 (feel free to supply your own units): x <- rnorm(25, mean = 34.9, sd = 2.71) Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 13 > x [1] 30.68825 31.84423 33.18473 34.61263 33.14514 38.19584 [7] 40.99194 34.96162 32.17812 31.61524 40.35282 33.64944 [13] 30.15063 35.29006 36.36103 36.80707 41.95605 34.12107 [19] 30.75378 36.62187 37.20257 30.07368 33.98218 33.51193 [25] 31.98624 Once we have the data, we can sample from it with replacement to generate a new, simulated data vector that represents a plausible alternative dataset - that is, one that we could reasonably imagine having drawn from the population instead of the data that we really saw, given our actual data. xboot <- sample(x, replace = TRUE) > xboot [1] 30.68825 34.96162 37.20257 34.61263 35.29006 34.61263 [7] 35.29006 34.12107 31.84423 36.36103 33.18473 40.35282 [13] 35.29006 34.96162 34.61263 36.80707 32.17812 30.07368 [19] 34.61263 34.12107 34.61263 36.36103 33.98218 33.51193 [25] 36.62187 (compare these values to the original values for x, above). Taking the mean and standard deviation of the bootstrap sample gives us one new "observation" of the sample mean and sample standard deviation. The statistics calculated from the bootstrap sample are similar to the original values, but not identical to them: > mean(x) [1] 34.56953 > sd(x) [1] 3.334223 > mean(xboot) [1] 34.65073 > sd(xboot) [1] 2.140218 By repeating this process many times, we can build a frequency distribution for plausible values of the estimators, ̅ݔ and ݏ௫. The easiest way to accomplish this is the replicate() function, as demonstrated in the next section. 4. Repetition of a random simulation: replicate() In most situations, it is not very useful to simulate a set a values only one time. For example, by itself, the single bootstrap sample generated in Section 3.5.2 does not provide us with very much useful information. To generate confidence intervals, we would need to repeat the process of drawing a new bootstrap sample many times, obtain estimates of the mean and standard deviation for each one, and then find the empirical quantiles from those means and standard deviations. The replicate() function makes it very simple to code this kind of exercise. However, before seeing how to use replicate(), it is helpful to see how this might be done if we didn't have replicate() available. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 14 4.1 The for loop A standard approach to this sort of task that works in almost every computer language (including R) is the for loop. As an example, look at the following code and its output: for(i in 1:3){ print(i + 1) } [1] 2 [1] 3 [1] 4 Notice that in the output, we have three separate 1-element vectors, equal to 1 + 1, 2 + 1, and 3 + 1. The for loop specifies a variable called an iterator, which I've named i, that takes on each value in a specified vector (here, the integers from 1 to 3), one at a time. Each time the iterator is assigned a new value, the block of code inside the curly braces gets executed. In this code, the first time through the loop, i is set to 1, so R prints i + 1 = 1 + 1 = 2. The second time through, it uses i = 2, and so on until it reaches the end of the last possible value of i. It is worth noting that the words "for" and "in" are reserved words in R. You cannot assign objects to these names. To generate a set of bootstrap samples using the for loop, we first create a list object to store our results, and then run the bootstrap sample code from Section 3.5.2 inside the loop, with a few minor modifications: xboot <- list() # an empty list named xboot for(i in 1:3){ xboot[[i]] <- sample(x, replace = TRUE) } > length(xboot) [1] 3 In the code above, we repeat the process of sampling x with replacement 3 times, and append each of the resulting vectors of 25 simulated observations to the list xboot. Notice the double square brackets used to index xboot; this notation is necessary for the indexing to work properly with lists. The result is a list of 3 vectors, each with 25 elements: > xboot [[1]] [1] 38.19584 36.36103 34.96162 34.96162 34.12107 36.36103 [7] 31.61524 30.07368 33.51193 40.35282 31.61524 30.07368 [13] 31.61524 31.84423 30.07368 33.14514 34.12107 33.51193 [19] 31.98624 35.29006 36.36103 34.61263 37.20257 40.35282 [25] 33.64944 [[2]] [1] 33.98218 38.19584 34.61263 35.29006 36.62187 36.62187 [7] 37.20257 31.98624 40.35282 36.80707 33.14514 38.19584 [13] 38.19584 34.61263 31.84423 34.12107 40.35282 37.20257 [19] 30.07368 34.96162 36.62187 36.80707 34.61263 35.29006 [25] 36.36103 Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 15 [[3]] [1] 31.84423 34.61263 30.15063 33.51193 30.07368 33.64944 [7] 30.68825 34.61263 40.99194 37.20257 31.98624 35.29006 [13] 34.12107 36.36103 40.35282 41.95605 33.18473 33.98218 [19] 35.29006 31.98624 30.75378 35.29006 31.84423 34.96162 [25] 32.17812 To make xboot more compact, we can rearrange it into a matrix using the simplify2array() function. This will give us a 25 ൈ 3 matrix in which each column represents one of our 3 replicated samples: xboot <- simplify2array(xboot) > xboot [,1] [,2] [,3] [1,] 38.19584 33.98218 31.84423 [2,] 36.36103 38.19584 34.61263 [3,] 34.96162 34.61263 30.15063 [4,] 34.96162 35.29006 33.51193 [5,] 34.12107 36.62187 30.07368 [6,] 36.36103 36.62187 33.64944 [7,] 31.61524 37.20257 30.68825 [8,] 30.07368 31.98624 34.61263 [9,] 33.51193 40.35282 40.99194 [10,] 40.35282 36.80707 37.20257 [11,] 31.61524 33.14514 31.98624 [12,] 30.07368 38.19584 35.29006 [13,] 31.61524 38.19584 34.12107 [14,] 31.84423 34.61263 36.36103 [15,] 30.07368 31.84423 40.35282 [16,] 33.14514 34.12107 41.95605 [17,] 34.12107 40.35282 33.18473 [18,] 33.51193 37.20257 33.98218 [19,] 31.98624 30.07368 35.29006 [20,] 35.29006 34.96162 31.98624 [21,] 36.36103 36.62187 30.75378 [22,] 34.61263 36.80707 35.29006 [23,] 37.20257 34.61263 31.84423 [24,] 40.35282 35.29006 34.96162 [25,] 33.64944 36.36103 32.17812 And finally, we can use the apply() function to calculate the mean and standard deviation for each column (for the means, colMeans() is actually faster, but we'll use apply() here). apply() executes a named function (FUN) on each row (MARGIN = 1) or column (MARGIN = 2) of a matrix: xboot_means <- apply(xboot, MARGIN = 2, FUN = mean) xboot_sds <- apply(xboot, MARGIN = 2, FUN = sd) > xboot_means [1] 34.23884 35.76285 34.27505 > xboot_sds [1] 2.894935 2.487915 3.205830 4.2 replicate() If all we want to do is repeat the same random process many times, then the replicate() function simplifies our code considerably. It takes two arguments: n, which specifies the number of times that code should be replicated, and a block of Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 16 code to replicate. If the code includes more than one line, it must be enclosed in curly braces ({}). Using replicate, only one line of code is required to generate 3 bootstrap samples and reformat them to a matrix: xboot2 <- replicate(3, sample(x, replace = TRUE)) > xboot2 [,1] [,2] [,3] [1,] 33.64944 33.51193 36.36103 [2,] 37.20257 35.29006 37.20257 [3,] 33.51193 30.15063 31.98624 [4,] 36.80707 36.80707 32.17812 [5,] 30.15063 36.80707 30.75378 [6,] 30.15063 33.18473 33.64944 [7,] 33.14514 37.20257 36.62187 [8,] 30.75378 30.15063 30.68825 [9,] 36.36103 33.51193 41.95605 [10,] 31.61524 33.14514 36.62187 [11,] 36.80707 31.61524 36.36103 [12,] 31.84423 30.68825 34.12107 [13,] 31.61524 30.68825 37.20257 [14,] 30.75378 36.80707 41.95605 [15,] 33.98218 31.84423 33.14514 [16,] 32.17812 30.07368 36.36103 [17,] 31.98624 33.18473 34.12107 [18,] 33.51193 31.61524 35.29006 [19,] 30.07368 36.80707 33.51193 [20,] 33.64944 34.12107 41.95605 [21,] 33.18473 38.19584 38.19584 [22,] 40.99194 36.62187 33.98218 [23,] 34.61263 40.99194 36.80707 [24,] 40.35282 38.19584 34.96162 [25,] 36.80707 33.14514 31.61524 xboot2_means <- apply(xboot2, MARGIN = 2, FUN = mean) xboot2_sds <- apply(xboot2, MARGIN = 2, FUN = sd) > xboot2_means [1] 33.82794 34.17429 35.50429 > xboot2_sds [1] 3.027593 3.027248 3.215012 In this example, replicate() allows us to write cleaner, more readable code, but with more complex processes, it can also produce large performance gains. In R, for loops are inefficient and should only ever be used as a last resort. Whenever possible, take advantage of R's native vectorization (it's behavior of executing a function elementwise on a vector), or use functions like replicate() that are designed to process data efficiently. 4.3 Generating a bootstrap confidence interval Bootstrap resampling is typically used to generate approximate confidence intervals when other methods (e.g., Student's t distribution or maximum likelihood) won't work. However, to complete this section, we will calculate the bootstrap interval for the mean and standard deviation of x, and then compare them with the intervals that we Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 17 get using the classical Student's t and ߯ଶ sampling distributions for the mean and variance. # use replicate and sample to resample x 5000 times xboot3 <- replicate(5000, sample(x, replace = TRUE)) # calculate the mean and standard deviation for each resample xb3mean <- colMeans(xboot3) #alternative to apply for means xb3sd <- apply(xboot3, MARGIN = 2, FUN = sd) # visualize the approximate sampling distributions qplot(xb3mean , fill = I('steelblue'), col = I('black'), xlab='Mean') qplot(xb3sd , fill = I('steelblue'), col = I('black'), xlab='Standard deviation') # bootstrap confidence intervals ci_mean_boot <- quantile(xb3mean, c(0.025, 0.975)) ci_sd_boot <- quantile(xb3sd, c(0.025, 0.975)) # classical confidence intervals ci_mean <- mean(x) + qt(c(0.025, 0.975), df = 24) * sd(x)/sqrt(25) ci_sd <- sqrt((var(x) * 24) / qchisq(c(0.975, 0.025), df = 24)) # combine the different Cis into a matrix using the rbind() # (for row bind) function results <- rbind(ci_mean, ci_mean_boot, ci_sd, ci_sd_boot) # add column names colnames(results) <- c('2.5%', '97.5%') # print the results results In the console, we get, > results 2.5% 97.5% ci_mean 33.193228 35.945827 ci_mean_boot 33.365305 35.888282 ci_sd 2.603456 4.638411 ci_sd_boot 2.366881 4.029368 As you can see, the bootstrap confidence intervals are slightly narrower than the classical intervals, but are approximately correct for both the mean and standard deviation. The histograms for the approximated sampling distributions appear in Figures 4 and 5. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 18 Figure 4: Histogram for 5000 bootstrap samples of the mean using an original dataset of 25 observations from ݔ ~ ܰሺ34.9, 2.71ሻ. Figure 5: Histogram for 5000 bootstrap samples of the standard deviation using an original dataset of 25 observations from ݔ ~ ܰሺ34.9, 2.71ሻ. 0 100 200 300 400 500 32 33 34 35 36 37 Mean 0 100 200 300 400 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Standard deviation Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 19 5. (OPTIONAL) Simulating random processes At the beginning of Section 2.1.5, I listed several examples of ways that simulations can be used to address problems in biology. This section provides two brief demonstrations that can be used to start exploring R's simulation tools. The coding is somewhat more advanced than in previous sections. Working through this part of the tutorial is entirely optional. 5.1 Planning a study Suppose that we are conducting a study on immune response, and need to quantify the average white blood cell (WBC) count for each of several patients. For each patient, we plan to collect a blood sample and then prepare several slides by diluting the blood in a buffer to a ratio of 1:20 and then placing an aliquot into an 0.1 μL well on the slide. Then we will manually count the number of WBCs on each slide in our microscope's field of view using a fixed magnification. By preparing several slides in this way, we hope to estimate each patient's true WBC count to േ 500 cells/μL at a confidence level of 95%. We will then use the point estimate at the center of this interval as the patient's response value in later analyses. From the Law of Large Numbers, we know that the precision of our estimate will improve (i.e., the margin of error in our confidence interval will decline) as the number of slides per patient increases. The question that we have when planning the study is, how many slides do we need to prepare for each patient to be reasonably certain of achieving our target precision? One way to answer this question is to make some assumptions about the variation that we will see from slide to slide, and then simulate the experiment. 5.1.1 Setting up the simulation – To plan our simulation, we need to select a distribution that will represent our raw data, and then mimic any manipulations that we expect to perform on the raw data. In this study, we will multiply our raw counts by 200 to account for dilution and convert the volume units to microliters: ݓ cells μL⁄ ൌ ݔ ൈ 20 ൈ 0.1ିଵ Typical WBC counts range from ݓ ൌ 4500-11000 cells/μL, or approximately ݔ ൌ 22-55 cells/slide. Assuming that the blood is well mixed, so that WBCs occur randomly over the surface of the slide, the number of WBCs in an 0.1 μL aliquot is a good example of a variable that we can expect to follow a Poisson distribution. Therefore, our simulation can use the following steps: 1. Make up a "true" WBC count for the simulated patient, ߤ, and select the number of slides that will be examined, ݊. 2. Use the rpios() function to randomly generate a set of ݊ simulated observations, where the expected number of cells per slide is ߣ ൌ ߤ/200. The variance of the Poisson distribution is equal to its mean, so variance will increase as ߣ increases. 3. Calculate the margin of error for the mean count. 4. Multiply by 200 to convert the margin of error on the count into a margin of error on the estimated WBC count, ̂ߤ. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 20 Executing the four steps listed above will produce one simulated observation. To be reasonably certain that our margin of error will fall below 500 cells/μL, we will want to repeat steps 2-4 many times for each of several possible values of ݊, and then look at an upper quantile for the simulated margins of error (we want the upper quantile because we want to be certain that our real margin of error will fall below 500 cells/μL). In addition, it would be wise to repeat the procedure for a number of different values of ߤ. The following code defines a function, count_wbc(), that will execute steps 1-4. I have given the function four arguments: n_slide is the number of slides, true_wbc is the "true" WBC count, dilute is the dilution factor, and vol_adj is the volume conversion factor. Notice that I've assigned default values for the last two arguments, but not for n_slide and true_wbc. count_wbc <- function( n_slide, # number of slides to count true_wbc, # patient's true WBC count (cells/uL) dilute = 20, # dilution factor vol_adj = 0.1 # well volume (uL) ){ # Step 1: n_slide and true_wbc defined by user # Step 2a. Convert true_wbc into mean raw count lambda <- true_wbc * (vol_adj/dilute) # Step 2b. Simulate counts from n slides using # the Poisson distribution raw_counts <- rpois(n = n_slide, lambda = lambda) # Step 3. Calculate the margin of error for a CI on the mean. # Here, I'm using the t-distribution (there are other # options, but this is simple and should work okay # in this situation) margin <- qt(0.975, df = n_slide - 1) * sd(raw_counts)/sqrt(n_slide) # Step 4. Convert the margin of error on raw counts # into standard WBC units, and return result margin * (dilute/vol_adj) } To test out our simulation function, we can give it some inputs and make sure it runs (more detailed tests are also possible. For instance, we could comment out steps 3 and 4 and return the raw_counts, to make sure that they are being generated properly). > count_wbc(5, 5000) [1] 2043.283 The function works. If we rerun it, we'll get a different result (try it). 5.1.2 Running the simulation – Our goal is to estimate an upper quantile for the margin of error under a range of sample sizes, and for several true WBC count values. To do this, we'll use a pair of nested for loops. The outer loop will iterate over a range of true Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 21 WBC counts, and for each true count, the inner loop will iterate over different numbers of cells. First, we need to create a data frame to hold our results. nslides <- 5:50 wbc_counts <- seq(from = 4500, to = 11000, length = 5) wbc_results <- expand.grid( list(slides = nslides, true = wbc_counts, margin = NA) ) The first two statements here define vectors that describe the possible values for the number of slides and the true WBC counts that we will use in the simulation. Then the expand.grid() function creates a data frame with two columns, named slides and true, that contains one row for every possible combination of the two variables. A column for the margin is also added, but all of the values are currently missing (NA). > head(wbc_results) # view the first few rows slides true margin 1 5 4500 NA 2 6 4500 NA 3 7 4500 NA 4 8 4500 NA 5 9 4500 NA 6 10 4500 NA Next, we'll set up the loops to fill in values for the margins. For each combination of slide number and true count, we'll use replicate() to find the 90% of 200 simulated margins of error. Notice that I reuse nslides and wbc_counts here to supply the values that we will loop over. To change the simulation, I only need to change the definitions for these vectors above; no other changes to the code are required. for(w in wbc_counts){ for(n in nslides){ # run 200 simulations for number of slides and true count margins <- replicate( 200, count_wbc(n_slide = n, true_wbc = w) ) # identify the row for this number of slides and true count # (row_to_fill is a logical vector with one element for each # row, only one of which is TRUE) row_to_fill <- with(wbc_results, slides == n & true == w) # assign the 90th percentile of margins to the margin column # in this row of wbc_results wbc_results$margin[row_to_fill] <- quantile(margins, 0.9) } } Finally, we'll plot the results. When we define the aesthetics in ggplot(), we can set x = slides and y = margin to see how the upper quantile of the margin of error changes as a function of sample size. Setting color = factor(true) draws a different line for each true WBC count, and color codes them (see next page and Figure 6): Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 22 ggplot( wbc_results, aes(x = slides, y = margin, color = factor(true)) ) + geom_line() + geom_hline(yintercept = 500) + labs( x = "Number of slides per patient", y = "90% of simulated margins of error (cells/uL)", color = "True count" #\n means "new line" ) + theme_grey(base_size = 16) 5.1.3 Interpreting the results – Based on Figure 6, we would need to plan on 45 slides per patient to achieve a precision of േ 500 cells/μL at 95% certainty for the upper range of WBC counts. If this is not logistically feasible, then we might consider whether this level of precision is truly necessary for our study. For instance, if we are willing to accept a precision of േ 1000 cells/μL, then we could use as few as 15 slides per patient. In addition, it is worth remembering that Figure 6 shows the 90th percentile of the simulations. Following its guidance, we would have approximately a 10% chance of missing our precision target for any given patient. Figure 6: Upper (90%) quantiles on the margins of error for mean per patient WBC count in cells/μL as a function of the number of slides counted per patient and the patient's true WBC count (line colors). 500 1000 1500 2000 2500 10 20 30 40 50 Number of slides per patient 90 % o f s im ul at ed m ar gi ns o f e rr or (c el ls /u L) True count 4500 6125 7750 9375 11000 Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 23 5.2 Why is the normal distribution so common in nature? The normal distribution is one of the most commonly encountered distributions in biology. That is partly because it has many mathematically useful properties, but it is also because many, many natural phenomena at least approximate a normal distribution. In this exercise, we'll use R's simulation functions to explore why this happens. 5.2.1 Simulating multiple additive sources of variation For this exercise, we will start by drawing random numbers from a uniform distribution covering the range from -1 to 1. In the console, type, > u <- runif(1000, min = -1, max = 1) > qplot(u, col=I('black'), fill=I('cornflowerblue')) (feel free to use different colors). This will create a vector of 1000 numbers, uniformly distributed between -1 and 1. I have saved them to an object named u, and then drawn a histogram of them. Next, we'll use replicate() to generate several uniform random variables. For example, > u <- replicate(3, runif(5, min = -1, max = 1)) yields five observations from each of three uniform random variables (again, all between -1 and 1). Looking at u, we see a matrix, as described in Section 4.2: > u [,1] [,2] [,3] [1,] 0.3554091 -0.12545295 -0.3769121 [2,] -0.9152614 -0.43220805 0.8321133 [3,] -0.2715022 -0.83193156 0.2449853 [4,] -0.3156245 -0.02044867 0.8503464 [5,] -0.6154813 0.43419977 0.2871340 Remember that because these are random numbers, your numbers will not match mine. Right now we have 3 different sources of variability, each of which is uniformly distributed. We can think of each row as an individual (for example, a single plant) and each column as a different input on some response variable (e.g., water availability, phosphorous, sunlight, etc.). The next step will be to add the inputs together. To do this, we can use the function rowSums() to get the sum across each row in u (there is also a colSums() function): > x <- rowSums(u) > x [1] -0.1469560 -0.5153562 -0.8584485 0.5142733 0.1058524 Here, the variable x has five values, each of which is equal to the sum of three random uniform draws. This approximates the process that we might see in nature when one value (plant biomass, for instance) is the result of many small additions and subtractions. Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 24 5.2.2 Additive variation and the normal distribution Make a new script, and write code to generate the following histograms: 1. 1000 draws from x1, where x1 is equal to a uniform distribution between -1 and 1 (see above). 2. 1000 values of x2, where x2 is the sum of 2 Unif(-1, 1) variables. 3. 1000 values of x5, where x5 is the sum of 5 Unif(-1, 1) variables. 4. 1000 values of x50, where x50 is the sum of 50 Unif(-1, 1) variables. Based on these observations, what happens to the distribution of ܺ as new sources of variation are added? Repeat the simulation experiment, this time using draws from the exponential distribution with rate 1 instead of the uniform distribution. To do this, replace the code that says, runif(1000, min = -1, max = 1) with code reading, rexp(1000, rate = 1) How do your results this time compare with the first experiment? Finally, repeat the experiment using normal distributions that have different means and variances. To do this, we will use a more complicated set of code inside the replicate function: u <- replicate(3, { m <- runif(1, min = -5, max = 5) s <- sqrt(rchisq(1, df = 1)) rnorm(1000, mean = m, sd = s) }) Here, I've written three lines of code into the replicate function. The curly brackets ({}) that surround the code tell R that it should all be treated as one block of instructions and processed together, so the whole set of three lines will be replicated. The first line draws a random uniform value between -5 and 5 which will be used for a mean. The second value draws a random value from the chi-square distribution, to use as a variance, and then takes its square root to get a standard deviation. The third line uses the results from the first two statements to generate 1000 random normal draws with mean m and standard deviations s. If you do not understand what is happening here, make histograms for a few columns from u, and pay close attention to the x-axis. Generate the same four histograms, defining x as the sum of 1, 2, 5, or 50 sources of variability. Do you notice any patterns here as the number of sources increases (hint: watch the x-axis, and remember the expectation and variance of the sum of two random variables)? Lichti/STAT 503 – Statistical Methods for Biology Modified: 2019‐13‐03 25 Appendix: Code for Figures 1 and 2 The following ggplot2 code was used to generate Figure 1. df <- data.frame( x = 0:100, Fx1 = pbinom(0:100, 100, 0.2), Fx2 = pbinom(0:100, 100, 0.7) ) ggplot(df )+ labs( x = expression(italic(x)), y = expression( italic(F)(italic(x)) ) ) + geom_step(aes(x = x, y = Fx1)) + geom_step(aes(x = x, y = Fx2), col='red') + annotate('text', x = 10, y=0.78, size = 5, label = expression(italic(X)[1] %~% Bin(100, 0.2))) + annotate('text', x = 53, y=0.20, size = 5, col = 'red', label = expression(italic(X)[2] %~% Bin(100, 0.7))) + geom_hline(yintercept=0.9, linetype='dotted') + geom_vline(xintercept=qbinom(0.9, 100, 0.2), linetype='dashed') + geom_vline(xintercept=qbinom(0.9, 100, 0.7), linetype='dashed', col='red') + theme_bw(base_size=18) The following ggplot2 code was used to generate Figure 2. norm_example <- data.frame(x = seq(-3.8, 5.2, 0.01)) %>% mutate( fx = dnorm(x, mean = 1.2, sd = 1), Fx = pnorm(x, mean = 1.2, sd = 1) ) ggplot(norm_example, aes(x = x, y = fx)) + labs( y = expression(paste(italic(f), group("(",italic(x),")"))), x = expression(italic(x)) ) + lims(x = c(-4,6), y=c(0, 0.4)) + geom_area(mapping = aes(x = ifelse(x<3, x, 0)), fill='steelblue2') + geom_line(color='grey20', lwd=1) + geom_segment(aes(x=3, y=0,xend=3,yend = dnorm(3,1.2,1))) + geom_point(aes(x = 3, y = dnorm(3, 1.2, 1)), color = 'red', size=3) + annotate('text', x = -1.6, y=0.32, size = 5, col = 'steelblue4', label = 'pnorm(3, 1.2, 1) is the area\n of the shaded region')+ annotate('text', x = 4.3, y=0.1, size = 5, col = 'red', label = 'dnorm(3, 1.2, 1) is\n the height of the\n curve at x = 3') + annotate('text', x = 1.2, y=0.04, size = 5, col = 'black', label = 'qnorm(0.9641, 1.2, 1)\n is the upper bound of\n the shaded region') + theme_bw(base_size=18)

学霸联盟