1 | P a g e
ALY2010 Project 6
Instructor: Dr. Dee Chiluiza, PhD
Correlation, regression analysis and chi-square test
Overview and Rationale
This project will help you measure your understandings of basic concepts on analytics.
It will help you measure your skills on R, R Studio and R Markdown.
It will help you to measure your understanding of correlation, regression analysis and chi-square test.
It will help you measure your skills to apply critical thinking to make meaningful observations of your data
analysis results.
Support file
Use the attached R Markdown file (Project_6_Template.Rmd) as a template to fill your answers.
Assignment
Part 1. Title and Introduction
Prepare your report using R Markdown, and present your report using an HTML file.
1. Title: Present a title to your report.
2. Introduction: Present a well informative introduction section, this will measure your understanding of
the topic and analytical processes for data analysis:
Your introduction needs good information and good organization. This applies for any report you make.
Separate each topic in individual paragraph.
• Regression: Using your own words, talk about the significance of the regression analysis. Provide a practical
example from the financial or market industries.
• Chi-Square: Using your own words, talk about the significance of using chi-square tests and their application
in the industry.
Use Bluman as a reference. Also present at least one additional academic reference for each topic.
Part 2. Analysis section
Task 1. Correlation and regression analysis
1.1 Data set description.
Use ?faithful in the console and read the information about faithful. This is a public data set.
2 | P a g e
1.1 Using your own words, describe the data set.
1.2 What is the coefficient of correlation between eruptions and waiting? Create an object named: corr_coef =
1.3 Explain the meaning of the coefficient of correlation?
1.4 What is the coefficient of determination between eruptions and waiting? Create an object named:
determ_coef =
1.5 Explain the meaning of the coefficient of determination?
1.6 Obtain the linear regression model for eruptions and waiting. Create an object named: Linear_reg =
1.7 Write the linear regression formula.
1.8 Present a scatter plot of eruptions versus waiting, and using the regression model you obtained on 1.6,
add the regression line to the plot.
The plot should have a good title, good x- and y-axes labels, data points presented as triangle (check pch
codes), and regression line must have a color.
Check this page for pch codes:
http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r
1.9 Describe the direction of the regression line and explain what it tells you about your data.
Important: Notice that in the template Rmd file I already created an R chunk where you can enter your
r codes.
Task 2. Chi-square Goodness-of-fit test
Customers per day in store. Imagine that you own a store, and you want to know if there are differences in the
number of customers that visit your store each fay from Monday to Saturday. In order to answer this question,
you collect data for three weeks, your data is the following:
3 | P a g e
Prepare one single R chunk to enter all your codes, remember to add names to all your objects to prevent the
display of their outcomes on your report, here you will present your answers using inline r codes.
Important: all the following tasks must be prepared in the same r chunk.
You will apply the following formula:
Check M12 Lecture ChiSQ.pptx, I made modifications to slides 10, 11 and 12.
2.1 Create vectors to enter the data for each day.
2.2 Create object named table1 to create a matrix with the data. If your matrix is well done, it should look like
this:
Do not present this table in your report, just create the object: table1 = matrix()
2.3 Create a vector for the days of the week:
days = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
2.4 Create a vector for the week numbers:
weeks = c("Week1", "Week2", "Week3")
2.5 Read my R file “Vectors and Matrices. R”.
Use the vector days to provide names to the rows and the vector weeks to provide names to the columns.
2.6 Transform your table1 into a data frame. If your table is well done, it should look like this:
4 | P a g e
Do not present this table in your report, just create the object: table1 = data.frame()
2.7 Read my R Markdown file “1 Calculated_field. Rmd”.
Create a new object named: table1a and use the mutate code to create a new column for the means of
weeks1, 2, and 3, name this column Observed.
In the same mutate code create a new column named Expected, this is the sum of observed values divided by
the number of days, same for each cell. Remember, if there are no customer preferences, then the number of
customers visiting the stores is the same each day.
In the same mutate code create a new column named OmE, to calculate Observed minus Expected.
In the same mutate code create a new column named "(OmE)^2", to calculate the squares of OmE. Notice
that this name has quotations since it has special characters.
In the same mutate code create a new column named "((OmE)^2)/E", to divide (OmE)^2 by the Expected
values. Notice that this name has quotations since it has special characters.
Now your column should have the following look:
I hided the values, you must calculate them.
Important. Sometimes the name of the rows is lost when applying the mutate() code. If this is your case, use
pipes (%>%) to process this data, and use code rownames_to_column to fix the row names issue, I will help
you with this strategy:
table1a = table1 %>%
rownames_to_column('Days') %>%
mutate( )
2.8 Create an object named chisq_value and use it to calculate the chi-square test value. It must be calculated
from the table you created. Remember that it is the sum of all (O-E)^2/E, last column you created.
2.9 Create an object named alpha to enter the value α = 0.01
5 | P a g e
2.10 Create an object named df to calculate the degrees of freedom.
2.11 Create an object named cv to calculate the critical value of your test.
2.12 Create an object named table1b to prepare your table using the knitr::kable() code. Make sure to use
only one decimal for your data.
Important:
At this point, if you knit your document, you should obtain the r chunk without any white box, this is because
you created names for all your values, calculations, vectors, and tables.
Using inline r codes with two `` at each side (``r ``) Complete the following:
2.13 α = ``r ``
2.14 Critical value = ``r ``
2.15 Chi-square value = ``r ``
2.16 Is Chi-Square higher than critical value? = ``r ``
2.17 Based on the answer obtained on 2.15, do you have enough evidence to reject Ho?
2.18 Present your table 1b here: ``r ``
Important: Notice that in the template Rmd file I already created an R chunk where you can enter your r
codes.
6 | P a g e
Part 3. Conclusions
Write your conclusions.
Part 4. Bibliography
Write your Bibliography section to present all your references.
Due date
Tuesday April 20 at 11:59 PM
Grade
50 points.
学霸联盟