Tutorial 5 QUESTION 1 The file Earnings.dta contains a random sample of data on individuals’ hourly earnings, age, gender and marital status (bachelor). Summarize the data. How much variation is there in earnings across the sample? What percentage of the sample are female and bachelors? sum ahe age female bachelor * average earnings in the sample are $18.97 per hour, but there is quite a lot of variation (point out the standard deviation and the min and max values) * average age is about 30 * 43% of the sample are female and 48% are bachelors Run a regression of earnings (ahe) on age. What is the intercept, and what does this imply about earnings? reg ahe age * 1.08: this implies that regardless of a person’s age they would on average be paid $1.08 an hour What is the slope? Why might this relationship exist? * 0.60: increasing age by 1 year is expected to increase earnings by approximately 60 cents per hour. * Why might this relationship exist? * The longer workers are in a job, the more experience they accumulate, and they become more productive * For example, they learn how to do their job better / more efficiently James is a 34 year old worker. Use the regression output to predict his earnings * Earnings = 1.08 + 34(0.6049863) = $21.65 How does being female or bachelor status affect earnings? reg ahe age bachelor female * Females earn $3.66 less than men per hour. * Bachelors earn $8.08 more relative to other employees per hour Why might these patterns exist? * Women may be paid less for two reasons, 1) because of the ‘pregnancy premium’ – there is a possibility * they may fall pregnant and therefore have to take a sustained leave of absence from work. Employers * respond by paying lower wages, 2) outright discrimination. Men are more likely to hold managerial * positions and simply discriminate against women. * Bachelors may be paid a premium because they are able to devote more time to their job: they do * not have a family to look after. (note that bachelors are defined as all single people, not just men) QUESTION 2 The file Teacher Ratings.dta contains data on course evaluations, course characteristics, and professor characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin. VariableDefinitionCourse_eval“Course overall” teaching evaluation score, on a scale of 1 (very unsatisfactory) to 5 (excellent)BeautyRating of instructor physical appearance by a panel of six students, averaged across the six panelists, shifted to have mean zero.FemaleMinorityNNenglishintro onecreditageProfessor’s age Construct a scatterplot of average course evaluations (course_eval) on the professor’s beauty (beauty). Does there appear to be a relationship between the variables? twoway (scatter course_eval beauty) Fit a linear regression curve to the scatterplot twoway (scatter course_eval beauty)(lfit course_eval beauty) Run a regression of average course evaluations on the professor’s beauty. What is the estimated slope? What is the estimated intercept? reg course_eval beauty * What is the estimated slope? 0.133 , so an increase in the beauty index of 1 point is estimated to increas course evaluation scores by 0.13 points * What is the estimated intercept? 4.00 – the average course evaluation score is 4 out of 5 Explain why the estimated intercept is equal to the sample mean of course_eval sum course_eval beauty * We see that the sample mean of beauty is 0 (4.75e^-8 is a very small number). Why is beauty’s mean equal to zero, because it has been standardized relative to the * average beauty in the sample Professor Stock has an average value of beauty, while Dr Watson’s value of beauty is one standard deviation above the average. Predict the difference between their evaluation scores. * standard deviation of beauty is 0.7886477 * Stock: 3.998272 + 0(0.1330014) = 3.998272 * Watson: 3.998272 = 0.7886477(0.1330014) = 4.10316 * Difference: 0.1049 Include the onecredit and age variables in the regression. What does this do to the estimated coefficient on the beauty variable? What might this tell you about the assumptions underlying the OLS model? reg course_eval beauty age onecredit * Controlling for the type of course, and age of instructor the coefficient on beauty is now bigger: 0.15 compared to 0.13 perviously. * this suggests that in the previous, more parsimonious, regression there were other factors in the error term that also explain course evaluation scores that are * also correlated with the explanatory variable (beauty). The assumption that E(u|X)=0 therefore might not hold – there are other factors that explain evaluation * scores that are also correlated with the explanatory variable of interest. The previous regression slope coefficient was therefore biased. QUESTION 3 The file Growth and Trade.dta contain information on the average annual economic growth rate of GDP and trade shares (defined as exports divided by GDP) for 65 countries. Construct a scatterplot of the average annual growth rate (growth) on average trade share (trade share). Does there appear to be a relationship between the variables? twoway (scatter growth tradeshare) From the scatterplot there appears to be an outlier. Identify which country this is using the scatterplot twoway (scatter growth tradeshare, mlabel(country_name)) Next identify the outlier country using summary statistics tabstat growth tradeshare, by(country) Run a regression of growth on tradeshare including the outlier (Malta) reg growth tradeshare Run a regression of growth on tradeshare excluding the outlier (Malta). What effect does this have on the estimated slope of the relationship? reg growth tradeshare if country_name!=”Malta” * excluding the outlier lowers the estimated coefficient. Graphically fit regression curves to the data with and without the outlier twoway (lfit growth tradeshare)(lfit growth tradeshare if country_name!=”Malta”) * blue line is the estimated relationship with Malta included * red is the regression curve excluding Malta
