r 代写-MISM 6202|学霸联盟

r 代写-MISM 6202

时间：2021-11-06

Professor Kate Ashley
MISM 6202
Foundations of Data Analysis for Business
PROBLEM SET 1
Use the file ‘start-data.csv’ for all questions. Begin your work on this problem set by importing
the file ‘start-data.csv’ into RStudio.
Bluebikes is a public bike share system with stations located throughout the Greater Boston
area. Users can purchase a monthly or annual membership that offers unlimited rides of up to
30 minutes, or buy one-time-use passes for either a single 30-minute ride or unlimited 2-hour
rides within a 24-hour period (the “adventure pass”). Rides that extend beyond their time limit
incur additional charges. After purchasing a pass or membership, riders use a kiosk or mobile
app to unlock a bike at any station in the system, bike along their desired route, and dock the
bike at any station in the system that has open spaces available. Bluebikes staff occasionally
redistribute bikes across the various docking stations in the city in order to optimize bike and
dock availability.
The Northeastern Student Affairs team is interested in understanding current usage statistics
and trends for the Bluebikes stations closest to campus. The team has identified five stations
(Mass Ave T Station; Northeastern University North Parking Lot; Ruggles T Station; Tremont St
at Northampton St; and Wentworth University) and collected data for all rides originating at or
terminating at these five stations during the month of August. Data captured include ride date,
trip duration in minutes, start and end station information (name, internal station ID, and latitude
and longitude coordinates), the internal ID number of the bike used in the trip, user type
(Subscriber if the user has a membership; Customer if the user purchased a one-time pass) and
the rider’s ZIP code. Student Affairs has asked you to help with some preliminary data analysis
for the rides originating at the stations near campus, which are contained in the file
‘start-data.csv.’
Use RStudio to answer the following questions. Provide your written answers, along with any
relevant tables and charts, in a single PDF file. Any charts or tables included in your report
should be properly labeled and formatted for an audience of company executives. Screenshots
from an R output are not appropriate for this assignment. Additionally, your answers to each
item listed below must be clearly numbered in order to receive full credit. You should also submit
a single .R script file with your code for the analysis. No other file formats will be accepted for
this assignment, and you should submit just one file of each type.
Data Wrangling.
1. Which variables, if any, appear to contain missing, null, or incorrect values? After
reviewing the rest of the assignment instructions, briefly describe the impact you expect
missing values to have on your analysis for this problem set, and your approach to
handling these data errors.
Professor Kate Ashley
MISM 6202
2. Using the z-score method (with z-score threshold of +/- 3), what are the cutoff values for
identifying outliers in the ‘tripduration’ variable? Do you believe this method of identifying
outliers is well-suited to the ‘tripduration’ data? Explain.
3. Using the boxplot/IQR method, what are the cutoff values for identifying outliers in the
‘tripduration’ variable? Do you believe this method of identifying outliers is well-suited to
the ‘tripduration’ data? Explain.
4. Student Affairs has decided that only rides of one hour or less should be included in your
analysis, as they are most interested in commuting trends and longer rides are more
likely to be for leisure purposes. Subset the data, creating a dataframe with only rides
that are up to 60 minutes in duration. How many rides are in the resulting dataframe?
For all remaining analyses, use only data for rides that are 60 minutes or less in duration.
Visualization & Descriptive Statistics.
5. Create a histogram for the ‘tripduration’ variable. Describe the shape of the distribution
and provide a brief (1-2 sentence) explanation of why the observed distribution shape
might be expected for the ‘tripduration’ variable.
6. Create a contingency table and an accompanying stacked or clustered column chart to
summarize and visualize the variables ‘start.station.name’ and ‘usertype.’ In 1-2
sentences, describe any noteworthy patterns or insights you observe from this table and
chart.
7. Create a table showing the average trip duration for rides originating at each of the start
stations in the dataframe. In 1-2 sentences, describe any noteworthy patterns or insights
you observe from this table.
8. Create a table showing the average trip duration for rides taken by each ‘usertype’ in the
dataframe (Customer and Subscriber). In 1-2 sentences, describe any noteworthy
patterns or insights you observe from this table.
Probability.
9. Suppose you randomly select a ride from the data set. What is the probability that the
selected ride was taken by a Subscriber, as defined by the ‘usertype’ variable?
10. Is the probability of selecting a Subscriber independent of the start station? Briefly
describe, citing at least two conditional probabilities in your explanation.
Professor Kate Ashley
MISM 6202
Sampling Distributions.
11. Treating the data you have collected as the population of all rides taken from the
selected stations in August, suppose you repeatedly selected random samples of 50
rides and calculated mean trip duration and proportion of users of type ‘Customer’ in
each sample. (a) What are the mean and standard deviation of the resulting sampling
distribution of sample mean? (b) What are the mean and standard deviation of the
resulting sampling distribution of sample proportion?
Statistical Inference.
12. Choose a random sample of 50 rides from the data and store these observations in a
new dataframe called ‘sample.df.’ Estimate a 95% confidence interval for population
mean trip duration based on your sample. Does the confidence interval include the true
population mean trip duration? Show your work.
13. Using the observations in ‘sample.df,’ calculate the sample proportion of rides taken by
user type ‘Customer’ and estimate a 95% confidence interval for the population
proportion of ‘Customer’ user types. Does the confidence interval include the true
population proportion? Show your work.
14. Calculate the average trip duration for all rides taken during the first week of August
(8/1-8/7) and store this value as ‘mu0.’ Next, create a dataframe called ‘week4’ that
includes only rides taken during the final week of August (8/25-8/31). Run the command
set.seed(999) and then take a sample of size 100 from the ‘tripduration’ variable in
the ‘week4’ dataframe (i.e., sample from week4$tripduration).
(a) State the null and alternative hypotheses to test whether the average trip length
during the final week of August is higher than the average trip length in the first week
(mu0). Then, perform the hypothesis test using significance level ⍺ = 0.05. Show your
work and clearly state your decision in the test.
(b) Without running set.seed() again, re-run the lines of code that select a sample and
calculate sample statistics (xbar and s) and the test statistic (tstat) several times. What
do you observe in repeating the sampling and hypothesis test procedure? Optional:
Write code to take 100 different samples, calculating the test statistic for each one, and
briefly summarize the resulting hypothesis test decisions.
15. [Open ended]. What other analyses would you be interested in performing with this
data? Think about the data fields you have and what would be interesting to know. In 1-2
short paragraphs, describe at least two specific analyses that could be done using the
variables in this dataset. Optional: Perform these analyses and create visualizations to
show the results!

学霸联盟