Department of Economics The University of Melbourne ECOM30001/ECOM90001: Basic Econometrics Semester 1, 2023 Assignment 2 Introduction There are two assignments in the subject, each contributing 10% towards your final grade. The assignments are linked together and will involve completing an econometric analysis of a topic, chosen by you, on one of the real-world data-sets that have been provided to you. The principal purpose of the first assignment was to provide you with extensive feedback on the feasibility of your proposed project. Specifically, you were required to provide: - A description of your research question to be examined. - Data: a description of the data used. - Model: a description of the model to be estimated. This should include an explicit definition of your dependent variable, as well as am list of your intended explanatory variables. - Analysis: a description of the proposed estimation methodology to be used, as well as a statement of any identifying assumptions required for the methodology to be appropriate. The first requirement for Assignment 2 will be to re-write this section of your report, incorporating the feedback that you have received. This might involve: - a refinement and narrowing of your research question to become more specific and focused. - a more precise description of the data used, including details and the motivation for any additional sample restrictions that you impose. - a more precise description of your empirical model, including a discussion of the functional form for your dependent variable (in logs or levels) and the functional form of your conditional mean function, and whether you have included any quadratic terms or other interactions in your model. 1 - a refinement of your proposed methodology, incorporating any material that we have covered since you submitted the first assignment. The second requirement for Assignment 2 will be to include: - Summary Statistics : a description and interpretation of the summary statistics associated with your chosen sample. This would involve a table of means and standard deviations of your variables used in the analysis. A key component will be a discussion of the sample characteristics of your sample. - Results : a description and interpretation of your main results, which will need to be presented in table(s). Your report should also include a discussion of the ‘robustness’ of your results to different modelling assumptions, if applicable, such as functional forms, set of included explanatory variables, and/or robust/ordinary standard errors. - Conclusions : a discussion of your main conclusions, as well as a discussion of the limitations of your project. Word Limit: Although your submitted assignment will include a rewritten section of the material submitted for Assignment 1, the 650 word limit just applies to the ’new material’ submitted: the second section outlined above that contains your analysis, estimation results, and conclusions. Assignment Weighting: The principal aim of the first assignment was to provide you with extensive feedback on your proposed research project that will be completed in Assignment 2. Since the two assignments are linked together, I will be awarding you final marks based on the maximum of your grade in Assignment 2 only or the sum of the grades in Assignment 1 and the second component (analysis) of Assignment 2 (whichever gives you a higher grade). This provides you with an excellent opportunity to considerably improve your final report by incorporating your feedback from Assignment 1. Extra Help: I will be holding a ‘drop-in’ session on Wednesdays from 10:00am - 12:00pm (or by appointment) if you would like to obtain some specific advice with your research project. Please feel free to drop by my office during these times. 2 UK Census Data The data file uk tidy.csv provides 445,638 observations from the 2011 United King- dom Population Census that includes data for England and Wales. This file contains a ‘cleaned’ version of a random 1% sample of the 2011 census data. Some further informa- tion about the data file can be found here: https://www.ons.gov.uk/census/2011census/2011censusdata/censusmicrodata/ microdatateachingfile I have already cleaned the data file so I would strongly recommend that you use the ‘raw’ data file uk tidy.csv, rather than downloading the raw (uncleaned) data file from the above web-site. The data file uk tidy.csv contains the following variables: id = Unique individual identifier region = Region of residence on census night familycomposition = Family composition gender = Self-Indentified gender age = Age in years on census night (Aged over 15) maritalstatus = Marital status on census night student = Currently studying? countryofbirth = Country of birth (born in the United Kingdom?) health = Self-Reported heath occupation = Last reported occupation (current occupation if employed) industry = Last reported industry of employment (current industry if employed) hours = Hours worked per week (if employed on census night) lfstatus = Labour force status (on census night) You will notice that the data file does not contain any continuous variables. All variables are categorical variables. Further information on the coding of these variables can be found in the file uk variable listing.xls. The R-script file uk tidy.R provide R code to include the value labels in the data file. Note that it is not recommended to include categorical variables directly into an econo- metric model to be estimated. You will first need to create indicator variables (‘dummy’ variables) for each value of the categorical variables. You will also need to omit one of these indicator variables to avoid the ‘dummy variable trap’. Please refer to Week 4, Lec- ture 2: Dummy Variables I for more details on this. The R-script file uk tidy.R provides some sample R code to create these indicator variables from categorical variables, using as an example the variable region. You might find the R package fastDummies useful. 3 Some specific issues you may want to explore: - The potential dependent variables or outcomes are indicator variables. Should you use a Linear Probability Model (LPM) or a Probit model? Your report should include a discussion of the main advantages and disadvantages associated with your modelling choice. - The errors in the Linear Probability Model (LPM) are inherently heteroskedas- tic so you will need to use robust standard errors. - Be careful with your interpretation of marginal effects when using the Probit model. - All of the possible outcomes or explanatory variables are categorical variables. How- ever, you want to explore interactions between these variables. For example, you want to allow for a differing effect of country of birth on your outcome, by gender. In this case, you could interact (multiply) these variable together in your model and then test whether these interaction terms are statistically significant. Melbourne House Prices The data file houseprices.csv contains data on the selling prices of 4,238 houses sold in Melbourne during the period April 2016 to March 2018. You have already seen some of this data since a subset of this data was used in Tutorial 2. The data file houseprices.csv contains the following variables: year = Year of Sale month = Month of Sale day = Day of Sale price = Selling Price, in dollars rooms = Number of Rooms bedroom = Number of Bedrooms bathroom = Number of Bathrooms car = Number of Car Spaces buildingarea = Building Area of Property, in square metres landsize = Landsize of Property, in square metres large = 1 if landsize ≥ 650 metres squared, 0 otherwise yearbuilt = Year Property was built distance = Distance from Melbourne C.B.D., in kilometres propertycount = Number of properties in postcode regionname = Region Location of Property You will notice that the variable regionname is a categorical variable. Further information on the coding of this variable can be found in the file houseprices variable listing.xls. The R-script file houseprices tidy.R provide R code to include the value labels for this variable in the data file. 4 As noted above, it is not recommended to include categorical variables directly into an econometric model to be estimated using the method of Ordinary Least Squares (OLS). You will first need to create indicator variables (‘dummy’ variables) for each value of the categorical variables. You will also need to omit one of these indicator variables to avoid the ‘dummy variable trap’. Please refer to Week 4, Lecture 2: Dummy Variables I for more details on this. You might find the R package fastDummies useful. Some specific issues you may want to explore: - the data contain a variable indicating the year of sale and also a variable indicating the month of sale. While, it is feasible to create dummy variables for the year of sale, there may not be much variation in prices over the two years of data. However, it might also be worthwhile creating a set of season (or quarter of sale) to explore or control for seasonal variation in prices. Please review Question 2 in Tutorial 5. - Heteroskedasticity is likely an issue in this data so you may want to explore different ways of addressing this: (1) White test for heteroskedasticity; (2) Huber-White (robust) standard errors that correct for heteroskedasticity of an unknown form; or (3) Feasible Generalised Least Squares (FGLS). - Should your dependent variable be expressed in levels or in natural logarithms? Your report should provide a brief discussion motivating your particular choice of functional form for you dependent variables and the advantages of your choice. Remember that the interpretation of the marginal effects will change, depending on whether your outcome is expressed in levels or natural logarithms. - Model Specification: You may want to explore different functional forms for your continuous explanatory variables, such as quadratic functions for age, building area, and/or distance. Alternatively, you may also wish to explore interactions between the variables. For example, you want to allow for a differing effect of distance from the C.B.D. on prices, by region. In this case, you could interact (multiply) these variable together in your model and then test whether these interaction terms are statistically significant. - Multicollinearity : The variables rooms and bulidingarea likely both identify the variation in prices associated with the size of the house. You may to explore whether multicolliearity is an issue when both variables are included in your model. Of course, in applied work there is always a trade-off associated with including as many variables as feasible to avoid the omitted variables problems while at the same time avoiding the multicollinearity problem. - Omitted Variable Bias : You may want to consider the likely impact or direction of the bias on your estimated effect of interest when there are important variables excluded from your empirical model. While the limited number of variables in the data preclude any feasible solutions to this issue, you may want to acknowledge the possibility of omitted variable bias. 5
学霸联盟