STAT6118 Assignment 2021-22
STAT6118 Complex Survey Data Analysis
Assignment 2021-22
You must submit one electronic copy of your report in PDF by 11.59pm on Thurs day 18 th May
20 2 2 . You submit the electronic copy (a single file) via the STAT6118 Blackboard website using
the TurnItIn Software (in the Assignments fold, select View/Complete to submit your report). A
scanned handwritten document is not allowed. You are allowed to submit only one document to
TurnItIn! Note that the file has to be smaller than 10 MB. If your Word file is larger than this, try
converting it to a readable PDF or save your images (graphs, plots, …) in JPEG (instead of BMP).
It is the policy of the Department of Social Statistics that courseworks are anonymous. Your
Student ID Number must appears in the first page of your Word or PDF document. To maintain
anonymity please do not put your name on any part of your submission except the lower part of the
Submission Form, which is removed prior to the marker seeing the coursework.
Students are encouraged to discuss and exchange ideas, since this is an important part of the
educational process. However, it is not acceptable that you read and gain ideas for your coursework
from another student’s finished work. It is very important that you read carefully the Section 5
(Academic Integrity and Referencing) of the module outline (available on blackboard).
Make sure that your assignment fits in a single PDF document. A scanned handwritten document
is not allowed. Your Student ID Number must appear on the first page of your Word or PDF
document. Make sure that you have 3 sections called Part 1, Part 2 and Part 3. The subsections 1a),
1b), 2a), 2b) and 2c) should be also clearly labelled. The maximum number of words is 6000.
Information about coursework submission, penalty for late submission, policy for over-length work,
procedure for coursework extensions, feedback and academic integrity and referencing can be
found in module outline (available on blackboard). It is very important that you read carefully
the module outline, because it contains additional important information about this assignment.
It is recommended to use STATA for this assignment. However, you can use R instead of STATA
(at your own risk), if you prefer.
- 1/4 -
STAT6118 Assignment 2021-22
Your Assignment:
The data file called samprj.dta contains an extract from the Brazilian Family Budget Survey
2002/2003 or the state of Rio de Janeiro, in Brazil. Observations in this file correspond to residents
in the participating households. The original dataset has been pre-processed to remove a few cases
of households containing records for absent members and to select the relevant variables for
analysis. But otherwise, these are the real survey data for the target region.
The variable person contains the label of each person within the household, and the reference
person of the household is always labelled 1 in this variable. A description of the variables can be
found in the Excel file “samprj_variables.xls” available on blackboard.
The sampling design is a stratified, two-stage sampling of households.
Stratification of PSUs by State & Education of heads of households (average at PSU level)
Primary sampling units are census enumeration areas. PSUs sampled with PPS – size =
number of households in census
Secondary sampling units are households. SSUs sampled with SRS within each PSU
Achieved sample sizes:
Part 1 (Descriptive statistics)
1a) Consider the proportion of households having microwave ovens (microwav) by education level
(educatio) of the reference person (person = 1). In particular, the two proportions for heads with
the highest and lowest education levels, respectively. You need to address both of the following:
1. Estimate the difference between the two proportions, taking into account of the sampling
design.
2. Test the hypothesis “the two proportions are equal”, against the alternative “the two
proportions are unequal”, taking into account of the sampling design.
[8]
1b) Apply two different tests of the independence between the number of bathrooms in the
household (nbathrms) and the ethnic group of the reference person (ethngrp). If you think it is
necessary, you can recode (ethngrp) by combining the groups.
[8]
- 2/4 -
STAT6118 Assignment 2021-22
For tasks 1a) and 1b) below, you should explain how you took into account of the sampling design,
by defining the appropriate estimator and test, using analytic expressions (as in the lectures slides).
You should use STATA procedures “svy”. You should also describe briefly your STATA codes.
Part 2 (Modelling):
2a) Fit a logistic regression model to the indicator of having credit card (crdcard) for persons with
age 20 or above using as predictors educatio, ethngrp, sex and income. You should an
aggregated approach that takes into account of the design.
1. Is sex a relevant determinant after you control for the other covariates?
2. Re-estimate the final fitted model without allowing for the sampling design. How do the
results change?
[20]
2b) Fit and interpret a model for the total monthly income (totincom) using as independent
variables sex, ethngrp, educatio and age. Note that
• ‘no income’ is represented by ‘totincom = 0’ in this dataset,
• the dependent variable of your model may be a transformation of totincom,
• and different predictors may be constructed based on the given independent variables.
You should use an aggregated approach that takes into account of the sampling design.
[15]
For tasks 2a) and 2b), you should explain how you took into account of the design, by defining the
appropriate estimators, using analytic expressions (as in the lectures slides). You should use
STATA procedures “svy” (except for the fits which do not allow for the survey design). You
should also describe briefly your STATA codes.
2c) By using the STATA procedures “svy” in 2b), you should have used an aggregated approach to
fit your regression model.
1. Describe a disaggregated approach that could have been used for the total monthly income.
By using the disaggregated approach you propose, fit your disaggregated model with the
effects of your final model obtained in 2b). Compare briefly your results with 2b).
[15]
2. Describe a model-based aggregated approach that could have been used for the total
monthly income. Compare this approach with the one used in 2b). Discuss the advantages
and disadvantages. Fit this model and compare it with 2b).
[9]
- 3/4 -
STAT6118 Assignment 2021-22
Part 3 (Nonresponse):
The data (DataCPS.CSV) is extracted from the September 1976 Current Population Survey in the
USA. The units are individual persons. We assume that a stratified simple random sampling have
been used. The population size is N = 46049. This does not correspond to any sub-population of the
USA. It should be viewed as a fictitious population for the purpose of this assignment. The
variables are
• “stratum”: The stratum label. The have 3 geographical strata.
• “area” represent compact geographic areas.
• “person”: person number
• “age”: the age of the persons.
• “agecat”: age category. 1 = 19 years and under; 2 = 20-24; 3 = 25-34; 4 = 35-64; 5 = 65
years and over.
• “race”: 1 = non-black; 2 = black
• “sex”: 1 = male; 2 = female
• “hour”: usual number of hours worked per week
• “wage”: usual amount of weekly wages (in 1976 US $). Contains missing values, labelled
NA.
The variable “wage” contains missing values. Your aim is to estimate the population average of the
variable “wage”. The strata sizes are 12279 for strata 1, 18420 for strata 2 and 15350 for strata 3.
Some population counts are given in the following table.
Non-black Black
Age Male Female Male Female Total
< 19 801 1700 296 184 2981
20 - 24 2459 1980 864 1377 6680
25 - 34 12313 3133 497 137 16080
35- 64 9349 6396 810 2624 19179
> 65 503 365 167 94 1129
Total 25425 13574 2634 4416 46049
Using the data above. Create weights that take into account of the design and non-response. Your
aim being to estimate the population average of the variable “wage”. The assumptions about the
response mechanism must be clearly stated and justified. You should describe and justify the
approach you adopted. Provide your weighted estimate for the population average of the variable
“wage”. Any statistical package can be used.
[25]
Dr Yves Berger, 28th March 2022
- 4/4 -