STAT 778 Lab 4
Make utilities
Go along with the notes at your own pace. Seek help from me or TA at our office hours or
send us an e-mail message if you have any question. All codes referenced in this Lab are in
/usr/local/shared/STAT778/lab4/
1 What is the default RNG in R
Search in Google using the proper key phrase, such as, “What is the default RNG in R,” you should
then see https://astrostatistics.psu.edu/su07/R/html/base/html/Random.html in the list. Read
at your leisure and summarize it for later use.
2 Velleman’s Experiment and Make Utility
In this section we mimic parts of simulation studies by Johnstone and Velleman (1985, JASA) for the
comparison of three regression estimates of the slope of a simple linear regression, and introduce the make
utility. Most source code and write-ups are modified from Faraway’s. The design for the regression is
Yi = Xi + σi
where xi = i/n and var(i) = 1. The estimators are least squares (LS), least absolute deviations (LAD)
and a Huber-type estimator (Huber) where βˆ is computed by minimizing
∑
ρ(Yi − α− βXi), where ρ(x) =
{
(x/σˆ)2 if |x| < σˆ
|x|/σˆ otherwise
In other words, the Huber-type estimator is a hybrid of the first two estimators and we substitute the “line
resistant estimator” by the Huber estimator in this demonstration. (σˆ is an estimate of the error SD).
Since we want to check the performance of three estimators over a fairly “complete crossing”, there will
be many subroutines in separate files for this simulation experiment. Now copy the relevant code to your
home directory.
mkdir simul — make a new directory
cp /usr/local/shared/STAT778/lab4/* simul - copy the content of “lab4” to your “simul” directory
cd simul; ls — put two Unix commands (separated by “;”) in one line.
The code is written in C but there is nothing special about it and it could just as easily be written in
Fortran or any practical language. First take a look at the file rgenl.c, which uses the system random
number generator drand48() with a period of 248 from an Unix workstation, such as zeus.vse.gmu.edu.
You are advised to use your own RNG or one from “Numerical Recipes” (NR, thereafter) if you don’t trust
a built-in RNG. There are seven different distributions allowed for. The file reg.c has the least squares
code which uses a naive algorithm. huber.c has the Huber-type estimator, computed by iteratively re-
weighted least-squares. The LAD code is in medfit.c, rofunc.c and shell.c. LAD regression is difficult
because the absolute value function is not differentiable at zero. This code is adapted from one in NR.
The main program is in main.c which reads in the input values, sets up the x and y and calls the three
estimating subroutines. Finally, constant.h is a header file which defines the maximum allowable sample
size.
1
The make utility becomes handy1 for a program of this size with many subroutines in separate files.
It keeps track of which files have been changed since the last compile and what commands are used to
compile these files etc. The necessary information on what is going on is recorded in a file, usually called
Makefile. Read Makefile in the directory simul and type
more Makefile
make
to see what happens. make does the first target (here prog) by default while others can be specified, e.g.
make clean
cleans out the files generated by make. Some of source files depend on the file constant.h which specifies
the maximum allowable sample size. These files are main.c, huber.c and rofunc.c. Now if you change
constant.h these three files will be automatically recompiled when you make. This beats searching through
source files for all the necessary dependencies! To demonstrate type
make
Now edit constant.h to change the maximum sample size and type make again. Type man make for more
details on make.
Let’s try running the program. Input is expected in 4 fields, the scale of the added error σ, the sample
size, the choice of error distribution (see rgenl.c for which distribution corresponds to which number)
and the number of replications. Input can be entered at the terminal or from a file. Now try running it
prog 1.0 10 2 40
What error distribution does the choice 2 correspond to? If we want to try Beta(2,2) error distribution,
what’s the input (of third field) should be?
It is often a good idea to first run a program with a small simulation size (i.e. number of replications) to
see if it is working as expected and time it for predicting how long this program will take to run for a large
number of replications, so that potential problems could be fixed before wasting lots of your and computer
time.
Remark: FYI: Sometimes you may need to replace prog by ./prog to activate the executable prog,
depending on how your default path is set up. Here we should not need to in zeus.vse.gmu.edu. Likewise,
all ./prog below can be replaced by prog.
Timing can be done using the time command. Try a sample size of 1000 and say 10 replications and
then
time ./prog 1.0 1000 2 10 > /dev/null
The redirection of the output to /dev/null effectively throws the output away since you are only interested
in timing here. The real time is third field in the output. For more information on time, type man time.
You can now estimate how long a larger number of replications will take. One way of speeding up the
program is to compile it with optimization. Edit the Makefile so that the CFLAGS= -g now reads CFLAGS=
-O and clear out the old object code and executable by make clean and then make again and retime the
code. These options -g and -O work similarly for Fortran code.2
We now want to analyze the output. Right now the estimates for each replication are just printed. We
could write some extra code to compute some summary statistics like means and variances but a more
flexible approach is to read the output data into R/Splus. Change the number of replications to a larger
number and write the data to a file instead of the terminal, say,
./prog 1.0 1000 2 100 > outfile
module add r3 or module add r4
1Even if you do not write a software package using make, it is still worthwhile knowing its basics since many archived
software/freeware packages use it.
2Warning: sometimes -O does funny things and may not speed up the running time or give a correct answer. So, try it out
before running a very large simulation. You can use other tricks/good algorithms learned in the class to speed up the program,
too.
2
R – enter R. You can also use R-studio instead.
where “1.0 1000 2 100” indicates that the simulation size is 100 (you now know how long it will take -
you can increase the simulation size if you wish but don’t overdo it.) We now read the data into R and
compute some summary statistics.
d <-read.table("outfile") — assuming you started R in the directory where “outfile” resides.
apply(d,2,mean) — the column means - an estimate of the expected value which should be 1
apply(d,2,var) — the column variances - which is small, which is large?
sqrt(apply(d,2,var)/length(d[,1])) — se for the means
Since we know the true value of β is 1, MSE is a useful summary of the performance of an estimate
mse <- function(x) sum((x-1)^2)/length(x) — a function that estimates the MSE about 1
apply(d,2,mse) — the column MSE’s - which is smallest?
We can also use summary in R to get summary statistics: Min, 1st Qu, Median, Mean, 3rd Qu and Max.
Since I’d like to know se of the mean and mse as well, I can modify summary by attaching the them to
summary:
mysummary<-function(x){c(summary(x), sqrt(var(x)/length(x)),mse(x))}
mysummary(runif(100)) — see how it works.
apply(d,2,mysummary) — apply it to what we are interested in.
Looking at the output of mse and mysummary, we don’t know if the observed difference is real or is just
simulation sampling error due to having a finite number of replications. We need to estimate the se of
the estimated MSE and/or the output of mysummary. Let’s estimate the se of MSE, say, using a batching
method.
z <- matrix(d[,1] ,ncol=10,byrow=T) — divide LS data into batches of 10.
ms <- apply(z,1,mse) — apply mse to each row
ms — these are independent estimates of the MSE
sqrt(var(ms)/length(ms)) — the estimated se of the mean of MSE for LS - will need same for the
other three
motif() — This step is only needed in Splus or older version of R
par(mfrow=c(3,1)) — if you’d like to see all three histograms together
apply(d, 2 ,hist)
par(mfrow=c(1,1)) — reset to one plot per window
boxplot(d[,1],d[,2],d[,3],names=c("LS","HUBER","LAD")) — side by side boxplots - useful for iden-
tifying outliers and distributional comparison
pairs(d,labels=c("LS","HUBER","LAD")) — a matrix of scatter-plots, notice the correlations between
estimates.
Exercises: Conduct experiments for all possible error distributes provided in rgenl.c, summarize the
results in a nice and clean table and make any sensible comments as you wish.
3
学霸联盟