sas代写-ST 307
ST 307 Final Project (50 pts)
In this project you will create a SAS program, save it as a .sas file, and upload that file to
Moodle on the assignment link.
● Everyone must submit their own code.
● You may not work with others on this project! You may obtain help from a TA
only. You should not post to the discussion board about this project. Failure to
adhere to these guidelines will result in an academic integrity violation.
● Be sure that your SAS file adheres to the SAS file submission guidelines (available
on Moodle in the “Resources and Information” Section).
● For all code you write, it should be similar to code that we wrote in class. For
instance, if you are asked to create a histogram you should use PROC SGPLOT not
something like PROC PLOT or PROC UNIVARIATE.
The dataset for this homework comes from the UCI machine learning repository. It is a
dataset concerning the use of a bike share. The information below comes from the
README file associated with the dataset.

Bike sharing systems are new generation of traditional bike rentals where whole
process from membership, rental and return back has become automatic. Through
these systems, the user is able to easily rent a bike from a particular position and
return it back at another position. There exists great interest in these systems due
to their important role in traffic, environmental, and health issues.

Apart from interesting real-world applications of bike sharing systems, the
characteristics of data being generated by these systems make them attractive for
the research. Opposed to other transport services such as bus or subway, the
duration of travel, departure and arrival position is explicitly recorded in these
systems. This feature turns bike sharing system into a virtual sensor network that
can be used for sensing mobility in the city.

The response variables for the dataset (aggregated for each day):
- casual: count of casual users
- registered: count of registered users
The predictor variables were:
- instant: record index
- dteday : date
- season : season (1:spring, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : whether a particular day is a holiday or not
- weekday : day of the week
- workingday : if a day is neither a weekend nor a holiday the variable takes
on 1, otherwise it is 0
- temp : Normalized temperature in Celsius. The values are divided by 41
- atemp: Normalized feeling temperature in Celsius. The values are divided
by 50 (max)
- hum: Normalized humidity. The values are divided by 100 (max)
- windspeed: Normalized wind speed. The values are divided by 67 (max)
- weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light
Rain + Scattered clouds
Programming questions
You will now write code corresponding to each question/output/etc. below (we don’t
need the output out all, your code can recreate it!). That is, do not simply modify the
code used for question 1 to do question 2. You can copy and paste the previous code if
needed, but we need to see the code used to answer each question below. Don’t forget
to add comments prior to your SAS steps describing what you are doing!

1. (1pt) Create a permanent library called project using a LIBNAME statement.

2. (1pts) Read in the day.csv file and place the SAS dataset into your project library.

3. (5pts total) (1pt) Use a DATA step to make a copy of the dataset (overwrite the data)
that does the following:
• (1pt) Uses a dataset option to rename the yr, mnth, and hum variables to year,
monthw, and humidity, respectively
• (1pt) Uses a statement to remove the “instant” variable
• (1pts) Uses an IF statement to remove all observations from January or where
atemp is less than 0.2
• (1pt) Creates a new variable (name of your choice) that is the number of casual
users added to the number of registered users

4. (9pts total) (4pts) Create two-way contingency tables between the variables below.
Your tables should show only the frequencies and no other information like
percentages, etc.
• Season and workingday
• Season and weathersit
(4pts) Also create two bar plots to display these two-way tables. One of your bar
plots should be stacked and one should be side-by-side.
After your code, you should have a comment that answers the following question(s):
a) (1pt) Report the season with the most ‘Clear, Few clouds, Partly cloudy, Partly
cloudy’ days.
5. (12pts total) (5pts) Create numeric summaries for the windspeed, atemp, and variable
you created in your DATA step in part 3. You should find only the following
statistics for each level of the weekday variable:
• Mean, median, standard deviation, 1st, 10th, 90th, and 99th percentile (quantile)
(6pts) You should then create
• side-by-side box plots for the first numeric variable listed (using the categorical
variable as your grouping variable) and overlay the jittered points
• a histogram for the second numeric variable listed with a smoothed histogram
overlayed (that uses the kernel type, not the normal curve)
After your code, you should have a comment that answers the following question:
a) (1pt) How many modes are there on the smoothed histogram? That is, how
many times does a peak appear?

6. (6pts total) (4pts) Write code to run an analysis to see if the average number of casual
users differs depending on the year variable. Output plots to inspect assumptions and
95% confidence interval(s) for any difference of means that makes sense for the
problem. Then report the following in a comment after the code:
a) (1pt) Report all confidence intervals
b) (1pt) Interpret one of them intervals in the context of the problem.
Note: If you have a choice between an equal variance interval or an unequal variance
interval, report the equal variance interval – not everyone will have this choice.

7. (5pts total) (3pts) Write code to run a correlation analysis between the temp, casual
users, and registered users variables. Your code should output scatterplots and
histograms using the same step that creates the correlations.
After your code, you should have a comment that answers the following question(s):
a) (1pt) Report all correlations
b) (1pt) Make a note of which correlations different significantly from 0 using a
significance level of 0.05.

8. (11pts total) (8pts) Write code to fit a multiple linear regression analysis using total
users as your response and temp, humidity, and their interaction as predictors. Your
analysis should output 90% confidence intervals for the slope parameters, produce
diagnostic plots for checking assumptions, and predict the average response given the
following values of your predictors:
• Temp = 0.6, humidity = 0.6
• Temp = 0.1, humidity = 0.1
After your code, you should have a comment that answers the following question(s):
a) (1pt) Report all confidence intervals for the slopes
b) (1pt) Note which slope terms differ significantly from 0 at the 0.1 significance
c) (1pt) Report the confidence intervals for the means predicted on your new

Save this program and upload it to wolfware! Great work!