ST300 Assessed Coursework 2012

Earnings Prediction

Candidate Number: 59342

Date Submitted: 16/03/2012

Time Submitted: 11:00 AM
ST300 Assessed Coursework 2012 Candidate Number: 59342

1 Introduction and Summary
1.1 Objective
We are given a dataset comprising certain characteristics of 27,504 individuals from the United States.
These characteristics include the following:
Continuous variables: age, capital_gain, and hrs_wkly.
Categorical variables: workclass, education, marital_status, occupation, relationship, race and gender.
Response variable (binary): Earnings.
The variable Earnings takes value 1 if an individual earns more than $50k in a year and 0 otherwise. Our
aim is to use the remaining ten variables to predict if an individual makes more than $50k or not.
We use R to come up with a satisfactory logistic regression model for Earnings in terms of the other
variables. In doing so, we aim to obtain as parsimonious a model as possible – one that is simple, fits the
data well and can be easily interpreted.
1.2 Summary of Results
We present our proposed model as an equation involving the true probability of an individual earning
more than $50k, the linear predictor , the intercept, the explanatory variables and their coefficients:
logit = + ∗ age + ∗ logage + ∗ hrs_wkly + ∗ hrs_wkly
= + ∗ capital_gain + capital_gain + capital_gain
+ education + relationship
= +occupation + relationship∗hrs_wkly
For each of the categorical variables education, relationship and occupation, we have merged some levels
together and given them new names. (See Section 2.2.1 for more details.) Using the merged levels, we
obtain the following estimates for the coefficients:
= -0.330, = -0.187, = 9.16, = 0.0957, = -0.000768, = 0.000633, capital_gain = 2.12, capital_gain
= 4.37,
education =

, relationship =

occupation =

, and
relationship∗hrs_wkly =

The intercept term corresponds to education = 12th-grade-and-below, relationship = Husband and
occupation = AC-AF-FF-MOI (which is a merged level for Adm-clerical, Armed-Forces, Farming-fishing and
At this juncture, we note the important point that if an individual’s capital gain in a year is more than $50k,
we will definitely have Earnings = 1 for this individual. We explore the implications of this on our analysis in
Section 2.1. Our model here predicts the probability of earning more than $50k for an individual whose
capital gain does not exceed $50k.
1.3 Interpretation of Results
We give a few interesting observations and implications arising from our model:
for Assoc-degree
for Bachelors
for Professional-or-doctorate
for HS-grade
for Masters
for Some-college
for Craft-repair
for Exec-managerial
for Handlers-cleaners
for Other-service
for Priv-house-serv
0.651 for Prof-specialty
0.867 for Protective-serv
0.613 for Sales
0.943 for Tech-support
0.181 for Transport-moving

for Not-in-family
for Other-relative
for Own-child
for Unmarried
for Wife
for Not-in-family:hrs_wkly
for Other-relative:hrs_wkly
for Own-child:hrs_wkly
for Unmarried:hrs_wkly
for Wife:hrs_wkly
ST300 Assessed Coursework 2012 Candidate Number: 59342

• The coefficients for relationship are largest in magnitude, followed by those for education. These
suggest that relationship and education are important characteristics in determining the probability
of earning more than $50k.
• From Chart 1, we see that appears to first increase then decrease with age. It reaches a peak at
about age = 45, suggesting that people of around that age are more likely to earn more than $50k.
The validity of this statement can, however, be challenged by the observation that the variability of
appears to increase with age.
• The above observations about age also apply to hrs_wkly, with appearing to peak at around
hrs_wkly = 70. (See Chart 2.)
• The Wald test statistic for the term I(capital_gain==0) is associated with a p-value less than 2E-16.
This indicates that the effect of the term is statistically significant. This makes intuitive sense, since
the characteristics of the individuals in the category I(capital_gain==0) tend to be different from
those whose capital gain is non-zero.
• As discussed in Section 2.3.2, the overall effect of the interaction term relationship:hrs_wkly is
statistically significant. This result suggests that every additional hour of work each week has
different impacts on for individuals with different relationship statuses.
2 Analysis
We now proceed to describe the steps taken in the analysis that led to the chosen model.
2.1 Data Cleaning and Checking
Before any analysis, we first examine the structure of the dataset.
We find that there are 139 individuals who earn more than $50k in capital gain. Their probability of earning
more than $50k in a year is therefore exactly one. If we carry out logistic regression directly on the data, we
will obtain very large standard errors for some of our estimates of coefficients, resulting in a distorted
Thus, we have taken the liberty to remove these 139 data points in our subsequent analysis. It is worth
mentioning again that the model derived without these data points is one that applies only to individuals
with less than $50k in capital gain.
We also find that no individual with occupation = Priv-house-serv or workclass = Without-pay earns more
than $50k in a year. Unlike capital_gain however, there is no reason why these could not have happened
by chance. Hence, there is no strong justification to remove these data points from the dataset.
Since there are a large number of covariate patterns in the dataset compared to the actual number of
observations, it is more feasible to work directly with binary data.
2.2 Grouping Factor Levels and Transforming Continuous Variables
2.2.1 Categorical Variables
In order to group levels of a categorical variable together, we first regress Earnings on each variable
separately. We then test for differences among levels using the Wald test at 1% significance level. For
example, suppose reg.occupation is the name of the glm model with Earnings regressed on occupation.
We execute the following command:
Forces”,”occupationArmed-Forces=occupationFarming-fishing”, “occupationFarming-
This command essentially carries out a Wald test. Here, the change in deviance 0.2274 is compared to the critical value. The associated p-value is 0.973, indicating that the larger model (here the one without the
specified restrictions) is not significantly better. We proceed to group Adm-clerical, Armed-Forces, Farming-
Fishing and Machine-op-inspct together.

ST300 Assessed Coursework 2012 Candidate Number: 59342

Repeating this procedure for each factor (except gender), we end up with the following:
occupation: • Adm-clerical, Armed-Forces, Farming-Fishing and Machine-op-inspct have been
merged into AC-AF-FF-MOI.
education: • Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th and 12th have been merged into
• Assoc-acdm and Assoc-voc have been merged into Assoc-degree.
• Prof-school and Doctorate have been merged into Professional-or-Doctorate.
marital_status: • Married-AF-spouse and Married-civ-spouse have been merged into Married-spouse-
race: • Asian-Pac-Islander and White have been merged into Asian-Pac-Islander -White.
• Amer-Indian-Eskimo, Black and Other have been merged into Other.

2.2.2 Continuous Variables
For age and hrs_wkly, we try two transformations, namely square and logarithmic, as well as linear
combinations of them. We regress Earnings on each variable, and then add the transformed variables one-
by-one to the model. We then compare between nested models using the change in deviance to decide if a
transformed variable should remain in the model.
We also compare the deviances informally between two models that are not nested. We choose the model
that leads to the largest drop in residual deviance without increasing standard errors of the coefficient
estimates by too much.
For capital_gain, we plot observed proportions of individuals who earn more than $50k (see Chart 3), and
notice that they seem to be split into three distinct groups: those corresponding to individuals with negative,
zero and positive capital gain. Hence, we try adding the two categorical variables I(capital_gain==0) and
I(capital_gain<0) to the model that has only capital_gain. We carry out Wald tests to determine the
significance of the effect of each added variable, as well as the significance of the combined effect of both
added variables. We find that the two added categories significantly enhance the fit of our model.
After we obtain a satisfactory model with transformed variables and/or categories, we examine plots of
fitted and observed proportions of individuals earning more than $50k to ensure visual goodness-of-fit:

Chart 1: Fitted and observed proportions for a model including
age, age^2 and log(age).

Chart 2: Fitted and observed proportions for a model including
hrs_wkly and hrs_wkly^2.

ST300 Assessed Coursework 2012 Candidate Number: 59342

Chart 3: Fitted and observed proportions for a model including
capital_gain, I(capital_gain==0) and I(capital_gain<0). (Note:
Observed proportions here have been tabulated in $1,000
intervals of capital_gain.)
We conclude that our models with transformed variables fit the observed data very well.
2.3 Variable Selection
2.3.1 Main Effects
We proceed to select the grouped/transformed variables to include in our final model. As a preliminary step,
we conduct automatic stepwise variable selection using the step command. We do it both ways (i.e.
forward and backward). Using this method, we end up obtaining a model with all ten variables included,
except age^2. Even if the automatic selection method picked out fewer than ten variables, we continue to
be wary of this method, as it excludes human discretion and subject area expertise in model selection.
We manually select variables in an iterative manner, similar to forward selection. We start by considering all
45 combinations of two variables and choosing the “best” model among these. To generate models with
one more variable, we combine the variables in this “best” model with each of the remaining variables. We
then choose the “best” model among the ones that arise from these combinations. By repeating the last two
steps, we keep adding one more variable each time until either all variables are exhausted, or until a
significant drop in residual deviance can no longer be achieved.
When we choose the “best” model, we are effectively selecting the model that best satisfies the following
criteria, in decreasing order of importance:
1. Large, significant drop in residual deviance when moving from a smaller model to a larger one. (The
magnitude of the drop is compared informally across models, while the significance of the drop is
checked using the likelihood ratio test for nested models.)
2. Small standard errors for estimated coefficients.
3. Large p-value obtained using the residuals.lrm command. (See Section 2.4.2.)
4. Low AIC.
5. Large number of residual degrees of freedom.
6. Small Pearson’s chi-squared statistic and large associated p-value. (We recognise that since the
data is binary, this test should not be advised. We use this measure sparingly in the variable
selection process.)
We hope to arrive at a desirably small model using this procedure. However, it turns out that our procedure
eventually leads to a ten-variable model. As such, we exercise our judgement regarding which variables to
include. We exclude gender, race and age^2 since their overall effects are not as significant as others. We
exclude marital_status since it is very similar, both in terms of intrinsic meaning and estimated coefficients,
to relationship. We exclude workclass since we cannot justifiably group the level Without-pay to any other
level (if we include workclass without grouping Without-pay with any other levels, we get large standard
errors for coefficient estimates).
ST300 Assessed Coursework 2012 Candidate Number: 59342

With this strand of argument, it is decided that our final model will include the following variables: education,
relationship, occupation, capital_gain, I(capital_gain==0), I(capital_gain<0), age, log(age), hrs_wkly and
2.3.2 Interactions
We adapt the above methodology to choose the interaction that we want to include. (In order to preserve
model simplicity, we have chosen to include only one interaction term. We are also considering only
interactions between two variables.) We find that relationship:hrs_wkly has the most significant p-value
associated to the Wald test for its overall effect (this corresponds to criterion 1 in the above methodology).
Inclusion of the relationship:hrs_wkly term also results in one of the largest drops in residual deviance,
when compared to the inclusion of other interaction terms. The p-value of 5.16E-14 below is significant:
Analysis of Deviance Table

Model 1: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +
I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2)
Model 2: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +
I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2) + relationship:hrs_wkly

Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 27336 17858
2 27331 17787 5 71.429 5.166e-14 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Table 1: Analysis of deviance table showing the significance of the interaction term relationship:hrs_wkly.
We conduct a visual inspection of the interaction plots, and indeed, we see that several lines are not
parallel, indicating the presence of interaction effects.

Chart 4: Interaction plot between relationship and hrs_wkly.
(Note: Observed proportions here have been tabulated in ten-
hourly intervals of hrs_wkly.)

Chart 5: Interaction plot between hrs_wkly and relationship.
(Note: Observed proportions here have been tabulated in ten-
hourly intervals of hrs_wkly.)
2.4 Regression Diagnostics
2.4.1 Residual Plots and Removal of Outliers

Chart 6: Pearson’s residual plot against index of the data. Note the two
extreme residuals.
ST300 Assessed Coursework 2012 Candidate Number: 59342

A quick glance at Chart 6 reveals the existence of two extreme Pearson’s residuals. After analysing the
Pearson’s residuals ’s further, we find that 19 data points have || > 10. The same 19 data points, plus
one additional data point, have deviance residuals ’s such that > 3. We acknowledge that the
common rule of thumb for Pearson’s and deviance residuals is to look at those ’s and ’s such that || > 2 and > 2. However, we provide reasons later for why we think it is very difficult to achieve || < 2 and < 2 for most ’s.
The indices of the 20 data points for which > 3 are as follows:
1843, 3014, 3155, 3813, 4320, 5514, 6024, 6499, 7882, 8147, 12964, 13128, 13784,
17049, 19364, 19671, 19742, 20254, 25053, 26623.
(Data points 12964 and 17049 correspond to points 12902 and 16966 in the Pearson’s residual plot in
Chart 6 respectively. This shift in index is due to the removal of data corresponding to individuals with more
than $50k in capital gain.)
Most of these data points give large residuals because the response variable takes on a value opposite to
what our model predicts. For instance, some data points, such as 3813, 6024, 12964 and 17049,
correspond to individuals with characteristics that our model predicts will give them a very high chance of
earning more than $50k (such as high capital gain), but they eventually do not. Conversely, other data
points, like 4320, 6499, 13128 and 19671, correspond to individuals with characteristics that our model
predicts will give them a very low probability of earning more than $50k (such as education = 12th-grade-
and-below, relationship = unmarried etc.), but it turns out they earn more than $50k.
We propose that either higher-order interactions need to be considered, or other factors not included within
the ten variables given may be at play. This is because even when all ten variables are fitted in an additive
model, we still do not achieve || < 2 and < 2 for most ’s (although we do not show the residual plots
for the ten-variable additive model here, they look very similar to those of our proposed model, in that most
Pearson’s residuals satisfy || < 10 and most deviance residuals satisfy < 3). After removing the 20
data points for which our proposed model cannot accurately predict (reasons for removal as discussed
above), the following diagnostics plots are the best we can achieve for our model:

Chart 7: Pearson’s residual plot against index of the data.

Chart 8: Deviance residual plot against index of the data.
2.4.2 Goodness-of-Fit Test
Since the number of covariate patterns is large relative to the number of observations, Pearson’s chi-
square test is not good for checking goodness-of-fit. Moreover, as the dataset is binary, we cannot use
residual deviance to test goodness-of-fit either. We use the residuals.lrm command to test if the current
model is alright compared to a more general alternative. The p-value associated with this test is
0.08213949 > 0.05, which suggests that there is no significant lack of fit.