xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

rstudio代写-ST300

时间：2021-01-14

ST300 Assessed Coursework 2012

Earnings Prediction

Candidate Number: 59342

Date Submitted: 16/03/2012

Time Submitted: 11:00 AM

ST300 Assessed Coursework 2012 Candidate Number: 59342

2

1 Introduction and Summary

1.1 Objective

We are given a dataset comprising certain characteristics of 27,504 individuals from the United States.

These characteristics include the following:

Continuous variables: age, capital_gain, and hrs_wkly.

Categorical variables: workclass, education, marital_status, occupation, relationship, race and gender.

Response variable (binary): Earnings.

The variable Earnings takes value 1 if an individual earns more than $50k in a year and 0 otherwise. Our

aim is to use the remaining ten variables to predict if an individual makes more than $50k or not.

We use R to come up with a satisfactory logistic regression model for Earnings in terms of the other

variables. In doing so, we aim to obtain as parsimonious a model as possible – one that is simple, fits the

data well and can be easily interpreted.

1.2 Summary of Results

We present our proposed model as an equation involving the true probability of an individual earning

more than $50k, the linear predictor , the intercept, the explanatory variables and their coefficients:

logit = + ∗ age + ∗ logage + ∗ hrs_wkly + ∗ hrs_wkly

= + ∗ capital_gain + capital_gain + capital_gain

+ education + relationship

= +occupation + relationship∗hrs_wkly

For each of the categorical variables education, relationship and occupation, we have merged some levels

together and given them new names. (See Section 2.2.1 for more details.) Using the merged levels, we

obtain the following estimates for the coefficients:

= -0.330, = -0.187, = 9.16, = 0.0957, = -0.000768, = 0.000633, capital_gain = 2.12, capital_gain

= 4.37,

education =

1.27

1.95

2.95

0.814

2.21

1.17

, relationship =

-3.36

-1.08

-3.47

-4.10

1.78

,

occupation =

0.303

1.03

-0.389

-0.625

-6.115

, and

relationship∗hrs_wkly =

0.0278

-0.0280

0.0132

0.0363

-0.0305

.

The intercept term corresponds to education = 12th-grade-and-below, relationship = Husband and

occupation = AC-AF-FF-MOI (which is a merged level for Adm-clerical, Armed-Forces, Farming-fishing and

Machine-op-inspct).

At this juncture, we note the important point that if an individual’s capital gain in a year is more than $50k,

we will definitely have Earnings = 1 for this individual. We explore the implications of this on our analysis in

Section 2.1. Our model here predicts the probability of earning more than $50k for an individual whose

capital gain does not exceed $50k.

1.3 Interpretation of Results

We give a few interesting observations and implications arising from our model:

for Assoc-degree

for Bachelors

for Professional-or-doctorate

for HS-grade

for Masters

for Some-college

for Craft-repair

for Exec-managerial

for Handlers-cleaners

for Other-service

for Priv-house-serv

0.651 for Prof-specialty

0.867 for Protective-serv

0.613 for Sales

0.943 for Tech-support

0.181 for Transport-moving

for Not-in-family

for Other-relative

for Own-child

for Unmarried

for Wife

for Not-in-family:hrs_wkly

for Other-relative:hrs_wkly

for Own-child:hrs_wkly

for Unmarried:hrs_wkly

for Wife:hrs_wkly

ST300 Assessed Coursework 2012 Candidate Number: 59342

3

• The coefficients for relationship are largest in magnitude, followed by those for education. These

suggest that relationship and education are important characteristics in determining the probability

of earning more than $50k.

a

• From Chart 1, we see that appears to first increase then decrease with age. It reaches a peak at

about age = 45, suggesting that people of around that age are more likely to earn more than $50k.

The validity of this statement can, however, be challenged by the observation that the variability of

appears to increase with age.

a

• The above observations about age also apply to hrs_wkly, with appearing to peak at around

hrs_wkly = 70. (See Chart 2.)

a

• The Wald test statistic for the term I(capital_gain==0) is associated with a p-value less than 2E-16.

This indicates that the effect of the term is statistically significant. This makes intuitive sense, since

the characteristics of the individuals in the category I(capital_gain==0) tend to be different from

those whose capital gain is non-zero.

a

• As discussed in Section 2.3.2, the overall effect of the interaction term relationship:hrs_wkly is

statistically significant. This result suggests that every additional hour of work each week has

different impacts on for individuals with different relationship statuses.

2 Analysis

We now proceed to describe the steps taken in the analysis that led to the chosen model.

2.1 Data Cleaning and Checking

Before any analysis, we first examine the structure of the dataset.

We find that there are 139 individuals who earn more than $50k in capital gain. Their probability of earning

more than $50k in a year is therefore exactly one. If we carry out logistic regression directly on the data, we

will obtain very large standard errors for some of our estimates of coefficients, resulting in a distorted

model.

Thus, we have taken the liberty to remove these 139 data points in our subsequent analysis. It is worth

mentioning again that the model derived without these data points is one that applies only to individuals

with less than $50k in capital gain.

We also find that no individual with occupation = Priv-house-serv or workclass = Without-pay earns more

than $50k in a year. Unlike capital_gain however, there is no reason why these could not have happened

by chance. Hence, there is no strong justification to remove these data points from the dataset.

Since there are a large number of covariate patterns in the dataset compared to the actual number of

observations, it is more feasible to work directly with binary data.

2.2 Grouping Factor Levels and Transforming Continuous Variables

2.2.1 Categorical Variables

In order to group levels of a categorical variable together, we first regress Earnings on each variable

separately. We then test for differences among levels using the Wald test at 1% significance level. For

example, suppose reg.occupation is the name of the glm model with Earnings regressed on occupation.

We execute the following command:

linearHypothesis(reg.occupation,c(”occupationAdm-clerical=occupationArmed-

Forces”,”occupationArmed-Forces=occupationFarming-fishing”, “occupationFarming-

fishing=occupationMachine-op-inspct”))

This command essentially carries out a Wald test. Here, the change in deviance 0.2274 is compared to the critical value. The associated p-value is 0.973, indicating that the larger model (here the one without the

specified restrictions) is not significantly better. We proceed to group Adm-clerical, Armed-Forces, Farming-

Fishing and Machine-op-inspct together.

ST300 Assessed Coursework 2012 Candidate Number: 59342

4

Repeating this procedure for each factor (except gender), we end up with the following:

occupation: • Adm-clerical, Armed-Forces, Farming-Fishing and Machine-op-inspct have been

merged into AC-AF-FF-MOI.

education: • Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th and 12th have been merged into

12th-grade-and-below.

• Assoc-acdm and Assoc-voc have been merged into Assoc-degree.

• Prof-school and Doctorate have been merged into Professional-or-Doctorate.

marital_status: • Married-AF-spouse and Married-civ-spouse have been merged into Married-spouse-

present.

race: • Asian-Pac-Islander and White have been merged into Asian-Pac-Islander -White.

• Amer-Indian-Eskimo, Black and Other have been merged into Other.

2.2.2 Continuous Variables

For age and hrs_wkly, we try two transformations, namely square and logarithmic, as well as linear

combinations of them. We regress Earnings on each variable, and then add the transformed variables one-

by-one to the model. We then compare between nested models using the change in deviance to decide if a

transformed variable should remain in the model.

We also compare the deviances informally between two models that are not nested. We choose the model

that leads to the largest drop in residual deviance without increasing standard errors of the coefficient

estimates by too much.

For capital_gain, we plot observed proportions of individuals who earn more than $50k (see Chart 3), and

notice that they seem to be split into three distinct groups: those corresponding to individuals with negative,

zero and positive capital gain. Hence, we try adding the two categorical variables I(capital_gain==0) and

I(capital_gain<0) to the model that has only capital_gain. We carry out Wald tests to determine the

significance of the effect of each added variable, as well as the significance of the combined effect of both

added variables. We find that the two added categories significantly enhance the fit of our model.

After we obtain a satisfactory model with transformed variables and/or categories, we examine plots of

fitted and observed proportions of individuals earning more than $50k to ensure visual goodness-of-fit:

Chart 1: Fitted and observed proportions for a model including

age, age^2 and log(age).

Chart 2: Fitted and observed proportions for a model including

hrs_wkly and hrs_wkly^2.

ST300 Assessed Coursework 2012 Candidate Number: 59342

5

Chart 3: Fitted and observed proportions for a model including

capital_gain, I(capital_gain==0) and I(capital_gain<0). (Note:

Observed proportions here have been tabulated in $1,000

intervals of capital_gain.)

We conclude that our models with transformed variables fit the observed data very well.

2.3 Variable Selection

2.3.1 Main Effects

We proceed to select the grouped/transformed variables to include in our final model. As a preliminary step,

we conduct automatic stepwise variable selection using the step command. We do it both ways (i.e.

forward and backward). Using this method, we end up obtaining a model with all ten variables included,

except age^2. Even if the automatic selection method picked out fewer than ten variables, we continue to

be wary of this method, as it excludes human discretion and subject area expertise in model selection.

We manually select variables in an iterative manner, similar to forward selection. We start by considering all

45 combinations of two variables and choosing the “best” model among these. To generate models with

one more variable, we combine the variables in this “best” model with each of the remaining variables. We

then choose the “best” model among the ones that arise from these combinations. By repeating the last two

steps, we keep adding one more variable each time until either all variables are exhausted, or until a

significant drop in residual deviance can no longer be achieved.

When we choose the “best” model, we are effectively selecting the model that best satisfies the following

criteria, in decreasing order of importance:

1. Large, significant drop in residual deviance when moving from a smaller model to a larger one. (The

magnitude of the drop is compared informally across models, while the significance of the drop is

checked using the likelihood ratio test for nested models.)

2. Small standard errors for estimated coefficients.

3. Large p-value obtained using the residuals.lrm command. (See Section 2.4.2.)

4. Low AIC.

5. Large number of residual degrees of freedom.

6. Small Pearson’s chi-squared statistic and large associated p-value. (We recognise that since the

data is binary, this test should not be advised. We use this measure sparingly in the variable

selection process.)

We hope to arrive at a desirably small model using this procedure. However, it turns out that our procedure

eventually leads to a ten-variable model. As such, we exercise our judgement regarding which variables to

include. We exclude gender, race and age^2 since their overall effects are not as significant as others. We

exclude marital_status since it is very similar, both in terms of intrinsic meaning and estimated coefficients,

to relationship. We exclude workclass since we cannot justifiably group the level Without-pay to any other

level (if we include workclass without grouping Without-pay with any other levels, we get large standard

errors for coefficient estimates).

ST300 Assessed Coursework 2012 Candidate Number: 59342

6

With this strand of argument, it is decided that our final model will include the following variables: education,

relationship, occupation, capital_gain, I(capital_gain==0), I(capital_gain<0), age, log(age), hrs_wkly and

hrs_wkly^2.

2.3.2 Interactions

We adapt the above methodology to choose the interaction that we want to include. (In order to preserve

model simplicity, we have chosen to include only one interaction term. We are also considering only

interactions between two variables.) We find that relationship:hrs_wkly has the most significant p-value

associated to the Wald test for its overall effect (this corresponds to criterion 1 in the above methodology).

Inclusion of the relationship:hrs_wkly term also results in one of the largest drops in residual deviance,

when compared to the inclusion of other interaction terms. The p-value of 5.16E-14 below is significant:

Analysis of Deviance Table

Model 1: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +

I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2)

Model 2: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +

I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2) + relationship:hrs_wkly

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 27336 17858

2 27331 17787 5 71.429 5.166e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Table 1: Analysis of deviance table showing the significance of the interaction term relationship:hrs_wkly.

We conduct a visual inspection of the interaction plots, and indeed, we see that several lines are not

parallel, indicating the presence of interaction effects.

Chart 4: Interaction plot between relationship and hrs_wkly.

(Note: Observed proportions here have been tabulated in ten-

hourly intervals of hrs_wkly.)

Chart 5: Interaction plot between hrs_wkly and relationship.

(Note: Observed proportions here have been tabulated in ten-

hourly intervals of hrs_wkly.)

2.4 Regression Diagnostics

2.4.1 Residual Plots and Removal of Outliers

Chart 6: Pearson’s residual plot against index of the data. Note the two

extreme residuals.

ST300 Assessed Coursework 2012 Candidate Number: 59342

7

A quick glance at Chart 6 reveals the existence of two extreme Pearson’s residuals. After analysing the

Pearson’s residuals ’s further, we find that 19 data points have || > 10. The same 19 data points, plus

one additional data point, have deviance residuals ’s such that > 3. We acknowledge that the

common rule of thumb for Pearson’s and deviance residuals is to look at those ’s and ’s such that || > 2 and > 2. However, we provide reasons later for why we think it is very difficult to achieve || < 2 and < 2 for most ’s.

The indices of the 20 data points for which > 3 are as follows:

1843, 3014, 3155, 3813, 4320, 5514, 6024, 6499, 7882, 8147, 12964, 13128, 13784,

17049, 19364, 19671, 19742, 20254, 25053, 26623.

(Data points 12964 and 17049 correspond to points 12902 and 16966 in the Pearson’s residual plot in

Chart 6 respectively. This shift in index is due to the removal of data corresponding to individuals with more

than $50k in capital gain.)

Most of these data points give large residuals because the response variable takes on a value opposite to

what our model predicts. For instance, some data points, such as 3813, 6024, 12964 and 17049,

correspond to individuals with characteristics that our model predicts will give them a very high chance of

earning more than $50k (such as high capital gain), but they eventually do not. Conversely, other data

points, like 4320, 6499, 13128 and 19671, correspond to individuals with characteristics that our model

predicts will give them a very low probability of earning more than $50k (such as education = 12th-grade-

and-below, relationship = unmarried etc.), but it turns out they earn more than $50k.

We propose that either higher-order interactions need to be considered, or other factors not included within

the ten variables given may be at play. This is because even when all ten variables are fitted in an additive

model, we still do not achieve || < 2 and < 2 for most ’s (although we do not show the residual plots

for the ten-variable additive model here, they look very similar to those of our proposed model, in that most

Pearson’s residuals satisfy || < 10 and most deviance residuals satisfy < 3). After removing the 20

data points for which our proposed model cannot accurately predict (reasons for removal as discussed

above), the following diagnostics plots are the best we can achieve for our model:

Chart 7: Pearson’s residual plot against index of the data.

Chart 8: Deviance residual plot against index of the data.

2.4.2 Goodness-of-Fit Test

Since the number of covariate patterns is large relative to the number of observations, Pearson’s chi-

square test is not good for checking goodness-of-fit. Moreover, as the dataset is binary, we cannot use

residual deviance to test goodness-of-fit either. We use the residuals.lrm command to test if the current

model is alright compared to a more general alternative. The p-value associated with this test is

0.08213949 > 0.05, which suggests that there is no significant lack of fit.

Earnings Prediction

Candidate Number: 59342

Date Submitted: 16/03/2012

Time Submitted: 11:00 AM

ST300 Assessed Coursework 2012 Candidate Number: 59342

2

1 Introduction and Summary

1.1 Objective

We are given a dataset comprising certain characteristics of 27,504 individuals from the United States.

These characteristics include the following:

Continuous variables: age, capital_gain, and hrs_wkly.

Categorical variables: workclass, education, marital_status, occupation, relationship, race and gender.

Response variable (binary): Earnings.

The variable Earnings takes value 1 if an individual earns more than $50k in a year and 0 otherwise. Our

aim is to use the remaining ten variables to predict if an individual makes more than $50k or not.

We use R to come up with a satisfactory logistic regression model for Earnings in terms of the other

variables. In doing so, we aim to obtain as parsimonious a model as possible – one that is simple, fits the

data well and can be easily interpreted.

1.2 Summary of Results

We present our proposed model as an equation involving the true probability of an individual earning

more than $50k, the linear predictor , the intercept, the explanatory variables and their coefficients:

logit = + ∗ age + ∗ logage + ∗ hrs_wkly + ∗ hrs_wkly

= + ∗ capital_gain + capital_gain + capital_gain

+ education + relationship

= +occupation + relationship∗hrs_wkly

For each of the categorical variables education, relationship and occupation, we have merged some levels

together and given them new names. (See Section 2.2.1 for more details.) Using the merged levels, we

obtain the following estimates for the coefficients:

= -0.330, = -0.187, = 9.16, = 0.0957, = -0.000768, = 0.000633, capital_gain = 2.12, capital_gain

= 4.37,

education =

1.27

1.95

2.95

0.814

2.21

1.17

, relationship =

-3.36

-1.08

-3.47

-4.10

1.78

,

occupation =

0.303

1.03

-0.389

-0.625

-6.115

, and

relationship∗hrs_wkly =

0.0278

-0.0280

0.0132

0.0363

-0.0305

.

The intercept term corresponds to education = 12th-grade-and-below, relationship = Husband and

occupation = AC-AF-FF-MOI (which is a merged level for Adm-clerical, Armed-Forces, Farming-fishing and

Machine-op-inspct).

At this juncture, we note the important point that if an individual’s capital gain in a year is more than $50k,

we will definitely have Earnings = 1 for this individual. We explore the implications of this on our analysis in

Section 2.1. Our model here predicts the probability of earning more than $50k for an individual whose

capital gain does not exceed $50k.

1.3 Interpretation of Results

We give a few interesting observations and implications arising from our model:

for Assoc-degree

for Bachelors

for Professional-or-doctorate

for HS-grade

for Masters

for Some-college

for Craft-repair

for Exec-managerial

for Handlers-cleaners

for Other-service

for Priv-house-serv

0.651 for Prof-specialty

0.867 for Protective-serv

0.613 for Sales

0.943 for Tech-support

0.181 for Transport-moving

for Not-in-family

for Other-relative

for Own-child

for Unmarried

for Wife

for Not-in-family:hrs_wkly

for Other-relative:hrs_wkly

for Own-child:hrs_wkly

for Unmarried:hrs_wkly

for Wife:hrs_wkly

ST300 Assessed Coursework 2012 Candidate Number: 59342

3

• The coefficients for relationship are largest in magnitude, followed by those for education. These

suggest that relationship and education are important characteristics in determining the probability

of earning more than $50k.

a

• From Chart 1, we see that appears to first increase then decrease with age. It reaches a peak at

about age = 45, suggesting that people of around that age are more likely to earn more than $50k.

The validity of this statement can, however, be challenged by the observation that the variability of

appears to increase with age.

a

• The above observations about age also apply to hrs_wkly, with appearing to peak at around

hrs_wkly = 70. (See Chart 2.)

a

• The Wald test statistic for the term I(capital_gain==0) is associated with a p-value less than 2E-16.

This indicates that the effect of the term is statistically significant. This makes intuitive sense, since

the characteristics of the individuals in the category I(capital_gain==0) tend to be different from

those whose capital gain is non-zero.

a

• As discussed in Section 2.3.2, the overall effect of the interaction term relationship:hrs_wkly is

statistically significant. This result suggests that every additional hour of work each week has

different impacts on for individuals with different relationship statuses.

2 Analysis

We now proceed to describe the steps taken in the analysis that led to the chosen model.

2.1 Data Cleaning and Checking

Before any analysis, we first examine the structure of the dataset.

We find that there are 139 individuals who earn more than $50k in capital gain. Their probability of earning

more than $50k in a year is therefore exactly one. If we carry out logistic regression directly on the data, we

will obtain very large standard errors for some of our estimates of coefficients, resulting in a distorted

model.

Thus, we have taken the liberty to remove these 139 data points in our subsequent analysis. It is worth

mentioning again that the model derived without these data points is one that applies only to individuals

with less than $50k in capital gain.

We also find that no individual with occupation = Priv-house-serv or workclass = Without-pay earns more

than $50k in a year. Unlike capital_gain however, there is no reason why these could not have happened

by chance. Hence, there is no strong justification to remove these data points from the dataset.

Since there are a large number of covariate patterns in the dataset compared to the actual number of

observations, it is more feasible to work directly with binary data.

2.2 Grouping Factor Levels and Transforming Continuous Variables

2.2.1 Categorical Variables

In order to group levels of a categorical variable together, we first regress Earnings on each variable

separately. We then test for differences among levels using the Wald test at 1% significance level. For

example, suppose reg.occupation is the name of the glm model with Earnings regressed on occupation.

We execute the following command:

linearHypothesis(reg.occupation,c(”occupationAdm-clerical=occupationArmed-

Forces”,”occupationArmed-Forces=occupationFarming-fishing”, “occupationFarming-

fishing=occupationMachine-op-inspct”))

This command essentially carries out a Wald test. Here, the change in deviance 0.2274 is compared to the critical value. The associated p-value is 0.973, indicating that the larger model (here the one without the

specified restrictions) is not significantly better. We proceed to group Adm-clerical, Armed-Forces, Farming-

Fishing and Machine-op-inspct together.

ST300 Assessed Coursework 2012 Candidate Number: 59342

4

Repeating this procedure for each factor (except gender), we end up with the following:

occupation: • Adm-clerical, Armed-Forces, Farming-Fishing and Machine-op-inspct have been

merged into AC-AF-FF-MOI.

education: • Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th and 12th have been merged into

12th-grade-and-below.

• Assoc-acdm and Assoc-voc have been merged into Assoc-degree.

• Prof-school and Doctorate have been merged into Professional-or-Doctorate.

marital_status: • Married-AF-spouse and Married-civ-spouse have been merged into Married-spouse-

present.

race: • Asian-Pac-Islander and White have been merged into Asian-Pac-Islander -White.

• Amer-Indian-Eskimo, Black and Other have been merged into Other.

2.2.2 Continuous Variables

For age and hrs_wkly, we try two transformations, namely square and logarithmic, as well as linear

combinations of them. We regress Earnings on each variable, and then add the transformed variables one-

by-one to the model. We then compare between nested models using the change in deviance to decide if a

transformed variable should remain in the model.

We also compare the deviances informally between two models that are not nested. We choose the model

that leads to the largest drop in residual deviance without increasing standard errors of the coefficient

estimates by too much.

For capital_gain, we plot observed proportions of individuals who earn more than $50k (see Chart 3), and

notice that they seem to be split into three distinct groups: those corresponding to individuals with negative,

zero and positive capital gain. Hence, we try adding the two categorical variables I(capital_gain==0) and

I(capital_gain<0) to the model that has only capital_gain. We carry out Wald tests to determine the

significance of the effect of each added variable, as well as the significance of the combined effect of both

added variables. We find that the two added categories significantly enhance the fit of our model.

After we obtain a satisfactory model with transformed variables and/or categories, we examine plots of

fitted and observed proportions of individuals earning more than $50k to ensure visual goodness-of-fit:

Chart 1: Fitted and observed proportions for a model including

age, age^2 and log(age).

Chart 2: Fitted and observed proportions for a model including

hrs_wkly and hrs_wkly^2.

ST300 Assessed Coursework 2012 Candidate Number: 59342

5

Chart 3: Fitted and observed proportions for a model including

capital_gain, I(capital_gain==0) and I(capital_gain<0). (Note:

Observed proportions here have been tabulated in $1,000

intervals of capital_gain.)

We conclude that our models with transformed variables fit the observed data very well.

2.3 Variable Selection

2.3.1 Main Effects

We proceed to select the grouped/transformed variables to include in our final model. As a preliminary step,

we conduct automatic stepwise variable selection using the step command. We do it both ways (i.e.

forward and backward). Using this method, we end up obtaining a model with all ten variables included,

except age^2. Even if the automatic selection method picked out fewer than ten variables, we continue to

be wary of this method, as it excludes human discretion and subject area expertise in model selection.

We manually select variables in an iterative manner, similar to forward selection. We start by considering all

45 combinations of two variables and choosing the “best” model among these. To generate models with

one more variable, we combine the variables in this “best” model with each of the remaining variables. We

then choose the “best” model among the ones that arise from these combinations. By repeating the last two

steps, we keep adding one more variable each time until either all variables are exhausted, or until a

significant drop in residual deviance can no longer be achieved.

When we choose the “best” model, we are effectively selecting the model that best satisfies the following

criteria, in decreasing order of importance:

1. Large, significant drop in residual deviance when moving from a smaller model to a larger one. (The

magnitude of the drop is compared informally across models, while the significance of the drop is

checked using the likelihood ratio test for nested models.)

2. Small standard errors for estimated coefficients.

3. Large p-value obtained using the residuals.lrm command. (See Section 2.4.2.)

4. Low AIC.

5. Large number of residual degrees of freedom.

6. Small Pearson’s chi-squared statistic and large associated p-value. (We recognise that since the

data is binary, this test should not be advised. We use this measure sparingly in the variable

selection process.)

We hope to arrive at a desirably small model using this procedure. However, it turns out that our procedure

eventually leads to a ten-variable model. As such, we exercise our judgement regarding which variables to

include. We exclude gender, race and age^2 since their overall effects are not as significant as others. We

exclude marital_status since it is very similar, both in terms of intrinsic meaning and estimated coefficients,

to relationship. We exclude workclass since we cannot justifiably group the level Without-pay to any other

level (if we include workclass without grouping Without-pay with any other levels, we get large standard

errors for coefficient estimates).

ST300 Assessed Coursework 2012 Candidate Number: 59342

6

With this strand of argument, it is decided that our final model will include the following variables: education,

relationship, occupation, capital_gain, I(capital_gain==0), I(capital_gain<0), age, log(age), hrs_wkly and

hrs_wkly^2.

2.3.2 Interactions

We adapt the above methodology to choose the interaction that we want to include. (In order to preserve

model simplicity, we have chosen to include only one interaction term. We are also considering only

interactions between two variables.) We find that relationship:hrs_wkly has the most significant p-value

associated to the Wald test for its overall effect (this corresponds to criterion 1 in the above methodology).

Inclusion of the relationship:hrs_wkly term also results in one of the largest drops in residual deviance,

when compared to the inclusion of other interaction terms. The p-value of 5.16E-14 below is significant:

Analysis of Deviance Table

Model 1: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +

I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2)

Model 2: Earnings ~ education + relationship + occupation + capital_gain + I(capital_gain == 0) +

I(capital_gain < 0) + age + I(log(age)) + hrs_wkly + I(hrs_wkly^2) + relationship:hrs_wkly

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 27336 17858

2 27331 17787 5 71.429 5.166e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Table 1: Analysis of deviance table showing the significance of the interaction term relationship:hrs_wkly.

We conduct a visual inspection of the interaction plots, and indeed, we see that several lines are not

parallel, indicating the presence of interaction effects.

Chart 4: Interaction plot between relationship and hrs_wkly.

(Note: Observed proportions here have been tabulated in ten-

hourly intervals of hrs_wkly.)

Chart 5: Interaction plot between hrs_wkly and relationship.

(Note: Observed proportions here have been tabulated in ten-

hourly intervals of hrs_wkly.)

2.4 Regression Diagnostics

2.4.1 Residual Plots and Removal of Outliers

Chart 6: Pearson’s residual plot against index of the data. Note the two

extreme residuals.

ST300 Assessed Coursework 2012 Candidate Number: 59342

7

A quick glance at Chart 6 reveals the existence of two extreme Pearson’s residuals. After analysing the

Pearson’s residuals ’s further, we find that 19 data points have || > 10. The same 19 data points, plus

one additional data point, have deviance residuals ’s such that > 3. We acknowledge that the

common rule of thumb for Pearson’s and deviance residuals is to look at those ’s and ’s such that || > 2 and > 2. However, we provide reasons later for why we think it is very difficult to achieve || < 2 and < 2 for most ’s.

The indices of the 20 data points for which > 3 are as follows:

1843, 3014, 3155, 3813, 4320, 5514, 6024, 6499, 7882, 8147, 12964, 13128, 13784,

17049, 19364, 19671, 19742, 20254, 25053, 26623.

(Data points 12964 and 17049 correspond to points 12902 and 16966 in the Pearson’s residual plot in

Chart 6 respectively. This shift in index is due to the removal of data corresponding to individuals with more

than $50k in capital gain.)

Most of these data points give large residuals because the response variable takes on a value opposite to

what our model predicts. For instance, some data points, such as 3813, 6024, 12964 and 17049,

correspond to individuals with characteristics that our model predicts will give them a very high chance of

earning more than $50k (such as high capital gain), but they eventually do not. Conversely, other data

points, like 4320, 6499, 13128 and 19671, correspond to individuals with characteristics that our model

predicts will give them a very low probability of earning more than $50k (such as education = 12th-grade-

and-below, relationship = unmarried etc.), but it turns out they earn more than $50k.

We propose that either higher-order interactions need to be considered, or other factors not included within

the ten variables given may be at play. This is because even when all ten variables are fitted in an additive

model, we still do not achieve || < 2 and < 2 for most ’s (although we do not show the residual plots

for the ten-variable additive model here, they look very similar to those of our proposed model, in that most

Pearson’s residuals satisfy || < 10 and most deviance residuals satisfy < 3). After removing the 20

data points for which our proposed model cannot accurately predict (reasons for removal as discussed

above), the following diagnostics plots are the best we can achieve for our model:

Chart 7: Pearson’s residual plot against index of the data.

Chart 8: Deviance residual plot against index of the data.

2.4.2 Goodness-of-Fit Test

Since the number of covariate patterns is large relative to the number of observations, Pearson’s chi-

square test is not good for checking goodness-of-fit. Moreover, as the dataset is binary, we cannot use

residual deviance to test goodness-of-fit either. We use the residuals.lrm command to test if the current

model is alright compared to a more general alternative. The p-value associated with this test is

0.08213949 > 0.05, which suggests that there is no significant lack of fit.