SOC470 Homework 2
Due Date: Due Feb 23, 2022, 11:59pm (extra time than what syl-
labus says!)
Preview the notebook, and upload the .html to Canvas. You must show all your work in this file. I should be
able to run all the chunks and get the same results you do.
Note: whenever I say correlation, I mean Pearson’s correlation.
Getting ready
load the packages we’ll need
library(lsr)
library(descr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(haven)
library(psych)
library(stringr)
Load the gss data
d <- readRDS("gss2021.rds")
Note: the GSS has missing data! Not all respondents answered every question. Thus, you have to adjust
your programming to deal with this. See notes on 2-1-2022.
Part 1: Educational degree and concern for the environment.
The gss variable “degree” is the respondent’s highest educational degree. “grncon” is their answer to the
question, “Generally speaking, how concerned are you about environmental issues? Please tell me what you
think, where 1 means you are not at all concerned and 5 means you are very concerned.”
1
Question 1a
What is the mean of educational degree (assuming we treat educational degree as a numeric scale, like the
GSS does)?
# your code here
YOUR ANSWER to 1a:
Question 1b
What is the mean of concern for the environment (assuming we treat concern as a 1 to 5 scale, like the GSS
does)?
# your code here
YOUR ANSWER to 1b:
Question 1c: Scatterplot
Show a scatterplot of educational degree (as your X variable) and and concern for the environment (as your
Y variable). Then, tell me, based only on the scatterplot, do you see a positive, negative, or null association,
and why?
# your code here
YOUR ANSWER to 1c:
Question 1d: Correlations
What is the the Pearson’s correlation between educational degree and concern for the environment, and
does this represent a positive, negative, or null association between educational degree and concern for the
environment?
# your code here
YOUR ANSWER to 1d:
Question 1e: Scatterplot versus Correlation
Are your answers to 1c and 1d the same or different, and why?
YOUR ANSWER to 1e:
Part 2: Correlation matrices
To quickly examine many correlations, create a subset dataframe of the GSS data, called d2, that has
only the following variables: age (respondent’s age in years), tvhours (how much tv respondent watches, in
hours), degree (highest educational degree, on 0-4 scale), realinc (respondent’s income in dollars) and happy
(happiness scale, from 1-3). Hint: use tidy syntax here!
Note: the GSS happiness scale is as follows: VERY HAPPY = 1, PRETTY HAPPY = 2, NOT TOO HAPPY
= 3
Using the d2 dataframe you created, run a correlation matrix on these variables, using “pairwise.complete.obs”
to handle missing data.
2
Question 2a: strongest correlation with happiness
Create the d2 dataframe as described above and tell me which variable has the strongest correlation with
happiness?
# your code here
YOUR ANSWER to 2a:
Question 2b: interpreting the correlation coefficient
Carefully consider the correlation coefficient, the variables, and the numbers of the happiness scale, as listed
above in my note. Do you think the association between happiness and the variable in 2a is positive, negative,
or null – and why?
YOUR ANSWER to 2b:
Part 3: Validity of correlations
Estimate a correlation between “relig” and “tvhours”. tvhours is amount of TV watched in hours. Relig is
the respondent’s religious affiliation, where:
value label
1 protestant
2 catholic
3 jewish
4 none
5 other
6 buddhism
7 hinduism
8 other eastern religions
9 muslim/islam
10 orthodox-christian
11 christian
12 native american
13 inter-nondenominational
Is the correlation you calculated valid – can we make any conclusions about one’s religious affiliation and
their tv watching time? Why or why not?
# your code here
YOUR ANSWER to 3:
Part 4: Human performance
This dataset on human performance is slightly modified from https://www.kaggle.com/kukuroo3/body-
performance-data
p <- readRDS("performance.rds")
Question 4a: boxplots
In a single plot, show boxplots comparing men and women’s performance on situps. The variables you need
are called “sex” (men = 0, women = 1) and “sit_up_counts” (a count of how many sit ups the person could
do)
Based on what you see in the boxplot, which sex has a higher median sit up count?
3
# your code here
YOUR ANSWER to 4a:
Question 4b: Correlations
Which performance outcome has a stronger correlation with sex: gripForce (how much someone can grip;
higher values is more grip) or broad_jump_cm (how far someone can jump)? Use correlations to answer this
question and explain why you came to your conclusion.
# your code here
YOUR ANSWER to 4b:
Question 4c: Plots with custom axes
Do a scatterplot of “age” and “body_fat_percent”, but change the y-axis so it goes from 0 to 100. Based on
the plot, do you think the association between age and body fat is positive, negative, or null, and why?
# your code here
YOUR ANSWER to 4c:
4