59PM ET-r代写
时间:2023-10-08
Final Project Part 1
Research Proposal and Data Introduction
Due: October 8, 2023 by 11:59PM ET
Goal of the Assessment: Learning Outcomes being Assessed:
• To have the opportunity to work on
a topic of interest to them and to be
creative about this topic.
• To experience the process of
conducting a small literature review
and incorporating knowledge gained
into analysis.
• To think about whether a research
question and/or a dataset is
appropriate for use with linear
regression.
• To create a draft of the components
to be included in an introduction
section of a report, as well as
summary figures and/or tables for
results section.
• Apply multiple linear models on various
datasets using R statistical software.
• Differentiate the relationships
modelled using qualitative predictors,
interactions between predictors, and
continuous predictors.
• Create appropriate residuals plots to
evaluate model assumptions for a given
data set using software.
• Recognize distinct patterns in
appropriate residual plots and correctly
conclude which assumption is violated
• Report the results of a residual plot
analysis and recommend a course of
action.
Instructions:
1. Students will need to locate open-source data in an area of interest to them that meets
the data requirements listed below. Some examples could be (but are certainly not
limited to) sports, medicine, public health, economics, video games, literature, etc.
2. Students will need to then define an explicit research question using the information in
that dataset. Note that students will need to ensure and show that linear regression can
be used to answer this question with this dataset.
3. Students must also locate 3 peer-reviewed academic papers related to their specific
research question or topic of interest. Students will need to describe how each article
relates back to their proposed research question, as well as rank it for its usefulness in
informing them about the population relationship being estimated.
4. Students will need to select at least 9 variables from their dataset to be predictors in a
multiple linear model, with at least one of these five being categorical in nature. A
justification for why each variable is chosen must be provided. This model can then be
fit and a complete residual analysis to assess model assumptions will be done.
5. Lastly, students will provide a table that numerically summarizes each variable used in
their preliminary model, with an informative caption that highlights any interesting
features of the variables (e.g., skews, possible outliers or non-sensical observations, high
spread, missing values).
Dataset Requirements:
• Dataset must be open-source and the website where it was found/downloaded from
must be provided.
• MUST contain at least 1000 observations (i.e., rows).
• MUST contain 1 response variable suitable for linear regression and at least 9 predictor
variables. Categorical variables with multiple levels count as 1 variable.
• Since at least one predictor will need to be categorical, you may convert one of your
numerical variables to categorical if no such variable is available in your downloaded
dataset. However, you will need to justify your choice of variable and categorization in
part B.2. of the proposal.
• Should NOT be from an educational resource, such as a textbook dataset. If you’re not
sure, please ask the instructor or one of the TAs.
• If the dataset was found in a data repository (e.g., Kaggle, UCI Repository, etc.), you
MUST ensure that your research question is different from the original usage of the
data.
Proposal Format:
Groups must complete each portion of the Final Project Part 1 Template. The proposal
document should be no more than 5 pages in length, which includes plots and tables. Keep
responses brief and to the point while ensuring that you address each point noted in the rubric
requirements for that portion of the proposal.
What to Submit:
Only ONE member of the group should submit ALL required submission components. A
complete submission to Quercus will include:
1. The original downloaded dataset as a CSV file (if file is too large, save it on a cloud-based
storage service (e.g. OneDrive) and include a shareable link as a comment on your
submission).
2. The cleaned dataset containing the variables used in your preliminary model and data
summary as a CSV file (if file is too large, save it on a cloud-based storage service (e.g.
OneDrive) and include a shareable link as a comment on your submission).
3. The completed proposal template, saved as a PDF.
4. The R code should be provided in the appendix or in a separate R Markdown file
containing the code used to subset and clean the data, fit the model, produce a
summary table, and conduct the residual analysis for checking assumptions.
Failure to meet these submission requirements, including incorrect format of components,
missing components, and cloud links that do not allow shared access will result in a one-mark
deduction on the grade of the proposal.
Dataset Resources:
Should your group have difficulty locating a suitable dataset that meets the groups interest and
the dataset requirements, your group can consider using one of the below datasets:
• Ames Housing dataset
• NHANES survey dataset
• AirBnB dataset (needs you to create a free account)
• Million Song dataset
• NBA player dataset
Library Resources (for locating and citing academic papers):
• How to search for academic articles
• Using search operators to find articles
• Limiting search to peer-reviewed articles
• Why and how to cite your references
• Help getting the correct citation format
• Exporting a citation
RMarkdown Resources:
• Settings for displaying or not displaying R code in knitted document
• Adding captions and other plotting features
• Creating tables in RMarkdown using Kabble or manually
• Exporting plots in RStudio
essay、essay代写