R代写-FEM11149|学霸联盟

R代写-FEM11149

时间：2021-10-23

Final assignment: income inequality in Brazil FEM11149 - Introduction to Data Science Instructor: dr. A. Tetereva TA: A. Schmidt 2021/2022 Introduction This is an instruction booklet for the Introduction to Data Science course from the Data Science and Marketing Analytics Master program at the Erasmus School of Economics. In this final assignment you are not going to be given step-by-step instructions. You are expected to know which techniques are needed to clean and subset data (if needed), run models and their respective diagnostics. What you should hand-in • A PDF file, generated using R Markdown containing – A business report (max 4 pages) with your analysis – The report should follow the guidelines specified on Canvas - except for the code appendix (see below) • A *.rmd file, used to generate the PDF file above No attachments to the PDF file are allowed. Deadline: 24/10, 23:59. Final Assignment: investigating income inequality in Brazil Congratulations! Because of your hard work for Gelukshuisje, you are now a worldwide known data scientist. The United Nations Office in Brazil is looking for a consultant for a data science job, and your former intern from Gelukshuisje, Ahsia, recommended you. You are trusted with sample data from the last Brazilian National Census (2010). The dataset Brazil_data_census contains 4500 rows. Rows correspond to Brazilian cities, that are divided among the 26 states (estados/UF) plus the Federal District, where Brasília is located. In the 30 columns, besides the identification of the city, you find several numerical indicators, that range from total population to proportion of houses with electricity. The outcome of interest is the income appropriated by the 10% richest divided by the income of the 40% poorest within a city, identified as R1040 in the dataset (hereby referred as 10/40 ratio). This is a comparison of the per capita income of the richest decile with the 2/5 poorest and gives a notion of inequality. Column descriptions are in the file Brazilian_census_databook.xlsx. 1 The plan You need to investigate what explains the 10/40 ratio. For that, you will build and compare two regression models: • The first model will be a penalized regression using LASSO; • The second model is a regression using PCA scores as explanatory variables, that is, principal compo- nents scores are used instead of the original variables to explain your outcome of interest using a linear regression model. Minimum Requirements You need to set aside a sample size of 10 municipalities for which you will perform an out-of-sample prediction exercise to compare the results from both models. For both models, use the best practices you saw in the lectures. Specifically for PCA, in addition to three simple criteria, please use permutation test to select the meaningful number of principal components. Moreover, apply bootstrap procedure to Kaiser’s rule, i.e. test if the variance explained by each component is significantly larger than 1. Use the results of your analysis to name and interpret the selected components. Note that those are partial requirements and are not sufficient for a full grade. Everything should be explained and interpreted. Single results without interpretation will not be considered. Be aware that different criteria for selecting the principal components might differ in their conclusions, and the final decision is up to you and need to be justified. This is an individual assignment, and all students have different datasets. Tips on PCA • To extract your new variables Zj , j = 1, . . . , p, containing the scores of principal components, you can use the following R command # Estimate the PCA model my_pca <- princomp(x, ...) # my_scores is an object with the same number of rows and columns # as x and contains scores of principal components. my_scores <- my_pca$scores • The biplot() command is very limited and will have a hard time with this dataset. – If you want something more customizable, you might want to check the function fviz_pca_biplot() from the package factoextra. Note that you can add an extra layer of segmentation by using col.ind. If used wisely, you can have nice geographical inputs for your analysis (experiment using the column Estados in this argument. – If you want to use only the basic plot functions from R, here is a nice reference: https://www. benjaminbell.co.uk/2018/02/principal-components-analysis-pca-in-r.html. 2 Table 1: Brazilian states and respective regions State_letter State_name Region_letter Region_name AC Acre NO North AP Amapá NO North AM Amazonas NO North PA Pará NO North RO Rondônia NO North RR Roraima NO North TO Tocantins NO North AL Alagoas NE Northeast BA Bahia NE Northeast CE Ceará NE Northeast MA Maranhão NE Northeast PB Paraíba NE Northeast PI Piauí NE Northeast PE Pernambuco NE Northeast RN Rio Grande do Norte NE Northeast SE Sergipe NE Northeast DF Distrito Federal CW Central West GO Goiás CW Central West MT Mato Grosso CW Central West MS Mato Grosso do Sul CW Central West ES Espírito Santo SE Southeast MG Minas Gerais SE Southeast RJ Rio de Janeiro SE Southeast SP São Paulo SE Southeast PR Paraná SO South SC Santa Catarina SO South RS Rio Grande do Sul SO South Some ‘insider’ information from Brazil Geographical division in Brazil starts from neighbourhoods, that are located within cities. There are 26 states plus a federal district, in which cities are located. In the dataset, you have the column Estados with the letters of the state that the city belongs. Each state is located in a region, and there are 5 regions. You can see in Table 1 the division of States among these five regions and in Figure 1 you see how they are divided in the map. Regions are economically and socially unequal. The state of São Paulo, in which several companies and factories are located, is in the Southeast. This is the region with the highest GDP in comparison to others. You can see some indicators per region in Table 2. One interesting analysis you can do is trying to color your biplot using the Regions. For that, you need to create this new variable to your dataset. Data credits Data comes from Atlas Brasil and was pre-processed for the Assignment. 3 Figure 1: Brazilian regions Table 2: Brazilian states and respective regions State_letter State_name Population GDP NO North 17.7 94.8 NE Northeast 56.9 273.1 CW Central West 15.6 174.3 SE Southeast 86.3 803.0 SO South 29.4 313.0 Note: Source: Wikipedia 1 Population as in 2016, unit is million; 2 GDP is in US billion dollars (2016). 4

学霸联盟