A homework on mulitway fixed effect approach
In this homework we are going to consider a fictuous grading at a given universities and try to see what we can learn from grade variability and it affects earnings.
In the first data set we have the grades collected by a bunch of student within a given summester. Each student attended 2 courses. From this data we are going to try to extrac the ability of each student while allowing for course specific intercept. We can then use this to evaluate how much of the grade variation is due to student differences versus course differences and other factors (residuals).
Given this ability measures, we then merge a second file which has the earnings of the student at age 35. We then evaluate the effect of academic ability on individual earnings. Here again we will worry about the effect of overfitting.
Of course this requires, like we saw in class, estimating many parameters, hence we will look into overfitting and how to address it! We wil lmake use of sparse matrices, degree of freedom correction and bias correction.
The two data files you will need are:
 grades: hw4grades.json
 earnings: hw4earnings.json
Useful links:  Sparse linear solver
1 2 3 4 5 6 

1 2 3 4 5 

1 

Explaining the dispersion in grades
Load the grade data from hw4grades.json
. Then compute:
 total variance in grades
 the between class variance
 plot the histogram of class size
1 2 3 4 5 6 7 8 9 

1 2 

1 

grade  class_id  student_id  major  firstname  

0  0.843900  GP8471  9  PHYSICAL AND HEALTH EDUCATION TEACHING  Leann 
1  0.926570  IK1731  9  PHYSICAL AND HEALTH EDUCATION TEACHING  Leann 
2  1.695413  GW2045  15  STUDIO ARTS  Marcus 
3  0.038370  ML7772  15  STUDIO ARTS  Marcus 
4  2.129442  BI3547  22  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS  Lonnie 
...  ...  ...  ...  ...  ... 
13459  2.318588  JE6164  49527  PUBLIC POLICY  Lea 
13460  1.458787  GQ0531  49534  PHYSICS  Leonard 
13461  1.560250  IX0276  49534  PHYSICS  Leonard 
13462  2.091195  KX3268  49539  EARLY CHILDHOOD EDUCATION  Addie 
13463  0.105092  BN9468  49539  EARLY CHILDHOOD EDUCATION  Addie 
13464 rows × 5 columns
Constructing the sparse regressor matrices
In a similar fashion to what we covered in the class we want to estimate a twoway fixed model of grates. Specifically, we are want to fit:
where i denotes each individual, c denote each courses and \epsilon_{ic} is an error term that will assume conditional mean independent of the assignment of students to courses.
We are going to estimate this using leastsquare. This requires that we construct the matrices that correspond to the equation for y_{ic}. We then want to consruct the A and J such that
where for n_s students each with n_g grades in difference courses and a total of n_c courses we have that Y is n_s \cdot n_g \times 1 vector, A is a n_s \cdot n_g \times n_s matrix and J is n_s \cdot n_g \times n_c. \alpha is the vector of size n_s and \psi is a vector of size n_c.
Each fo the n_s \cdot n_g correspond to a grade, in each row A has a 1 in the column corresponding to the individual of this row. Similary, J has a 1 for for the column corresponding to the class of that row.
So, I ask you to:
 construct these matrices using python sparse matrices
scipy.sparse.csc.csc_matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

1 2 3 4 5 6 

Estimating the model
Next we estimate our model using the OLS estiamtor formula. We first remove the last column of J (since the model we wrote does not pin down a constant we force the last course to have \psi=0). Solve the linear system using the formula
where M = [A,J] and \gamma = (\alpha,\psi).
So do the following:
 select the last column simply by doing
J = J[:,1:(nc1]]
 use
scipy.sparse.hstack
to concatenate the matrices to create M  use
scipy.sparse.linalg.spsolve
to solve a sparse linear system  extract \hat{\alpha} from \hat{\gamma} by selecting the first n_s terms
 merge \hat{\alpha} into
df_all
 compute the variance of \hat{\alpha} in
df_all
 compute the variance of the residuals
 What share of the total variation in grades can be attributed to difference in students?
1 2 3 

1 

grade  class_id  student_id  major  firstname  alpha_hat  

0  0.843900  GP8471  9  PHYSICAL AND HEALTH EDUCATION TEACHING  Leann  1.523048 
1  0.926570  IK1731  9  PHYSICAL AND HEALTH EDUCATION TEACHING  Leann  1.523048 
2  1.695413  GW2045  15  STUDIO ARTS  Marcus  0.668487 
3  0.038370  ML7772  15  STUDIO ARTS  Marcus  0.668487 
4  2.129442  BI3547  22  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS  Lonnie  2.119907 
...  ...  ...  ...  ...  ...  ... 
13459  2.318588  JE6164  49527  PUBLIC POLICY  Lea  1.358326 
13460  1.458787  GQ0531  49534  PHYSICS  Leonard  0.248203 
13461  1.560250  IX0276  49534  PHYSICS  Leonard  0.248203 
13462  2.091195  KX3268  49539  EARLY CHILDHOOD EDUCATION  Addie  1.255938 
13463  0.105092  BN9468  49539  EARLY CHILDHOOD EDUCATION  Addie  1.255938 
13464 rows × 6 columns
A simple evaluation of our estimator
To see what we are dealing with, we are simly going to resimulate using our estimated parameters, then rerun our estimation and compare the new results to the previous one. This is in the spirit of a bootstrap exercise, onyl we will just do it once this time.
Please do:
 create Y_2 = M \hat{\gamma} + \hat{\sigma}_r E where E is a vector of draw from a standard normal.
 estimate \hat{\gamma}_2 from Y_2
 report the new variance term and compare them to the previously estimated
 comment on the results (not that because of the seed and ordering, your number doesn't have to match mine exactly)
1 2 

1 

Noticed how even the variance of the residual has shrunk? Now is the time to remember STATS 101. We have all heard this thing about degree of freedom correction! Indeed we should correct our raw variance estimates to control for the fact that we have estimated a bunhc of dummies. Usually we use n/n1 because we only estimate one mean. Here however we have estimated n_s +n_c  1 means! Hence we should use
please do:
 compute this variance corrected for degree of freedom using your recomputed residuals
 compare this variance to the variance you estimated in quetion 3
 what does this suggest about your estimates in Q3?
1 

1 

Evaluate impact of academic measure on earnings
In this section we load a separate data set that contains for each student their earnings at age 35. We are intereted in the effect of \alpha on earnings.
Do the following:
 load the data the earnings data listed in the intro
 merge \alpha into the data
 regress earnings on \alpha.
1 2 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

student_id  major  firstname  earnings  alpha_hat  

0  9  PHYSICAL AND HEALTH EDUCATION TEACHING  Leann  1.498308  1.523048 
1  15  STUDIO ARTS  Marcus  2.908033  0.668487 
2  22  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS  Lonnie  0.933555  2.119907 
3  24  MISCELLANEOUS FINE ARTS  Seamus  0.913896  0.703119 
4  35  MEDICAL ASSISTING SERVICES  Wilbert  0.545178  1.874107 
...  ...  ...  ...  ...  ... 
6727  49520  ANIMAL SCIENCES  Amos  1.723242  1.208074 
6728  49525  COMMERCIAL ART AND GRAPHIC DESIGN  Isaac  1.264385  0.518734 
6729  49527  PUBLIC POLICY  Lea  0.368899  1.358326 
6730  49534  PHYSICS  Leonard  1.244166  0.248203 
6731  49539  EARLY CHILDHOOD EDUCATION  Addie  1.735586  1.255938 
6732 rows × 5 columns
Bias correction  construct the Q matrix
We want to apply bias correction to refine our results. As we have seen in class thaqt we can directly compute the bias of the expression of interest.
under homoskedatic assumption of the error and hence we get the following expresison for the bias for any Q matrix:
When computing the variance of the measured ability of the student, we simply use a diagonal matrix on \gamma which selects only the ability part and removes the average.
do: 1. Construct such Q matrix. 2. check that \gamma Q \gamma' = \hat{Var}(\hat{a}).
1 2 3 4 5 6 7 8 9 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 

Bias correction  Variance share
We are now finally in the position to compute our bias. We have are matrix Q. Now we also need the variance of the residual! Given what we have learn in Question 4, we definitely want to use the formula with degree of freedom correction.
 Compute \sigma^2_r with the degree of freedom correction
 Invert M'M using
scipy.sparse.linalg
 Compute B = \frac{\sigma^2}{N} \text{Tr}[ ( M'M )^{1} Q] using
np.trace
 Remove this from original estimate to get the share of variance explained by student differences!
Note that inversing a matrix is far longer than solving a linear system. You might need to be patient here!
1 2 

1 2 3 4 5 6 7 8 9 10 

1 

1 

Bias correction  Regression coefficient
Finally, we look back at our regression of earnings on estimated academic ability. We have seen in class that when the regressor has measurment error this will suffer from attenuation bias. Here we now know exactly how much of the variance is noise thanks to our bias correction.
The attenuation bias is given by :
We then decide to compute a correction for our regression using our estimated B. This means computing a corrected parameters as follows:
Do:
 compute the corrected \hat{\beta}^{BC}
 FIY, the true \beta I used to simulate the data was 0.2, is your final parameter far? Is is economically different from the \hat{\beta}^{Q5} we got in Question 5?
Conclusion
I hope you have learned about the pitfalls of overfitting in this assignment! There are many and they can drastically affect the results and the conclusion of an empirical analysis.
This is the end of the class, I hope you enjoyed it and that you learned a thing or twom, have a nice summer!