OLS as a review

We are going to use the OLS (ordinary least square) to clarify a few terms.

Unless you are doing experimental economics, most projects start with a data-set and a question. Let's then consider a simple data set as $(X_i,Y_i)$ for $i \in (1,...,n)$ . These are two variables and $n$ observations.

A simple question that can directly answer is what is the average $X$ in the sample. This does not require any statistics, it only requires taking a mean: $\bar{X}_n = \frac{1}{n} \sum_i X_i$ . Objects like $\bar{X}_n$ , which are functions of the sample are called estimators.

In general, we are interested in questions that involve values that we didn't directly observe. For instance we might be interested in average value of $Y$ conditional on some given fixed value of $X$ , or we might be interested in a the average value of $X$ in the all U.S. but we only got a sample of size $10,000$ . To make the object of interest precise, we then define a population, a distribution where the sample is coming from.

Population, models and estimands

The population is a construct that allows us to precisely define the objects we are interested in recovering from our data. In concrete terms it will be represent the joint distribution from which the data/sample is drawn from.

In its most simple form, we could write down our population as the joint distribution of the data $f(X_1,Y_1,...,X_n,Y_n)$ . Using this population I can for instance define the average of $X$ in the population, ie $\bar{X} = \mathbb{E}_F[X]$ . $\bar{X} = \mathbb{E}_F[X]$ is different from $\bar{X}_n = \frac{1}{n} \sum_i X_i$ since it defined using the population and not using the sample. Object constructed on the population will be refered to as estimands. One can then start to think about how well we can learn about to learn about $\bar{X}$ from the sample $(X_i,Y_i)$ .

Often however, we want to include variables that are not be observed. For instance we might want to control for some unobserved factors. Indeed I might be interested in the effect of changing $X$ while keeping some characteristics of each individual $i$ fixed, let's call such characteritic $U_i$ . In this case I would define my population as $f(X_1,Y_1,U_1,...,X_n,Y_n,U_n)$ .

In general we will start with a class of such distributions which is indexed by some parameter $\theta$ . We will think of this class of distribution as our model. The parameter space can be finite dimensional in which case we will call it a parametric model, or it could be infinite and we will call it non parametric.

The first interesting point to note that in this new population $\mathbb{E}_F[Y|X=x']$ and $\mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big]$ might be quite different objects. Take $Y$ as income and $X$ as college degree, then the first expression asks the difference in income between people with a college degree, and people without a college degree. The second expression asks what is the average effect of changing the degree of each individuals.

To be precise, also imposing iid across $i$ , we defined:

$\begin{aligned} \mathbb{E}_F[Y|X=x'] & = \int Y f_{Y|X}(y,x') \text{d}y \\ \mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big] & = \int \Big( \int Y f_{Y|X,U}(y,x',u) \text{d}y \Big) f_U(u) \text{d} u \\ \end{aligned}$

The second important observation is that there might be different $f(X_1,Y_1,U_1,...,X_n,Y_n,U_n)$ distributions that deliver a given $f(X_1,Y_1,...,X_n,Y_n)$ . It is then said that such augmented distributions are observationally equivalent. To the extent that we have 2 such distributions, that would generate the same exact data, but would give us two different values of our parameter of interest, then we would be in trouble.

An example of a model is the linear conditional expectation where we specify

$Y_i = X_i \beta + U_i$

as well as the joint distribution $f(X_1,U_1,...,X_n,U_n)$ where we would probably be interested in $\beta$ . We see if given $\beta$ and $f(X_1,U_1,...,X_n,U_n)$ one knows the joint $f(X_1,Y_1,...,X_n,Y_n)$ . The reverse will require additional assumptions! This leads to our next paragraph.

Identification

An important first step once the model has been defined and an estimand of interest has been expressed is to ask the question whether the observed part of the population $f(X_1,Y_1,...,X_n,Y_n)$ together with the structure we imposed in the model allow recovering a unique estimand. When this is the case, we say that the parameter of interest, or the estimand of interest is identified.

In other words, being identified refers to the ability to construct the estimand using observed data in the context where the full distribution is given to you.

Let's take the linear case again, the assumption that $Y_i = X_i \beta + U_i$ is for instance not sufficient to construct $\beta$ from $f(X_1,Y_1,...,X_n,Y_n)$ .

Let's make an additional familiar assumption, let's assume that $(X_i,Y_i,U_i)$ are indenpendent across $i$ and drawn from a joint where $U_i$ and $X_i$ are conditional mean independent. Hence we impose further that $\mathbb{E}[U_i | X_i ]=0$ and that $\mathbb{E} XX'$ is invertible.

In this case, let's show that $\beta$ is identified. We can show identification by directly constructing $\beta$ from $f(X_1,Y_1,...,X_n,Y_n)$ . Indeed:

$\begin{aligned} \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XY & = \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XX'\beta+\left(\mathbb{E}XX'\right)^{-1}\mathbb{E}X U \\ & = \beta+0 \end{aligned}$

Note here that $\mathbb{E}XX'$ is a $k \times k$ matrix.

Estimators

As stated at the begining, an estimator is a function of sample, and as such it is a random object. They are often written either with a hat or with a n $n$ subcript. In the case where the data is given by a vector $Y_n$ of size $n$ and a matrix $X_n$ of size $n \times k$ , the OLS estimator is given by

$\beta_n^\text{ols} = (X_n' X_n)^{-1} X_n Y_n$

Finite sample properties of estimators

Unbiasedness

This is the property that $\mathbb{E} [ \beta_n | X_n] = \beta$ . We can check this is true for the OLS estimator under the assumptions we stated before:

$\begin{aligned} \mathbb{E}[ \beta_{n} | X_n] & =\mathbb{E} [ \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} | X_n]\\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'(X_{n}\beta + U_{n}) | X_n ] \\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta| X_n] + \mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n] \\ & =\beta \end{aligned}$

Since this is true conditional on $X_n$ , it will also be true unconditionaly.

Finite sample distribution

Let's assume further that $U_n | X_n$ is Normally and independently distributed. In other words:

$U_n | X_n \sim \mathcal{N}(0, \sigma^2_u I_n)$

Then we have that

$\beta_n - \beta = \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n}$

and condition on $X_n$ since $U_n$ is normaly distributed, any linear combination is also Normaly distributed. We can then compute the variance covariance matrix of the joint Normal distribution:

$\begin{aligned} \text{Var}( \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n) &= \left( X_n'X_n \right)^{-1}X'_{n} \; \mathbb{E} [ U_n U_n' | X_n ] \; X_n \left( X_{n}'X_{n}\right)^{-1} \\ &= \sigma_u^2 \left( X_n'X_n \right)^{-1} \end{aligned}$

In addition note that

$X_n'X_n = \sum_i X_i X_i' = n \hat{Var}(X_i)$

So we end up with the following expression for the finite sample distribution of the estimator of $\beta$ :

$\beta_n | X_n \sim \mathcal{N}\left( \beta , \frac{\sigma^2_u}{n} \hat{Var}(X_i)^{-1} \right)$

Let's make a couple of remarks:

we notice that as $n \rightarrow \infty$ , indeed $\beta_n$ concentrates on $\beta.
we also notice that it looks like copy pasting the sample reduces the the variance. Why is that not true?

Asymptotic properties of estimators

Very often, we would rather not have to make the Normality assumption on the error directly. Instead it is common to try to rely on results based on large samples and build on top of the central limit theorem.

If we can specify a sequence of population $f_n(Y_1,X_1,...,Y_n,X_n)$ we can start thinking about deriving properties of estimators in the limit where $n$ grows large.

A common way to generate such a sequence of population is again to assume that observations are iid across $i$ and drawn from $f(Y,X)$ .

We then look at two important properties.

Consistency

An estimator is consistent if $\beta_n \rightarrow \beta$ in probability as $n \rightarrow \infty$ .

We look again at the OLS estimator:

$\begin{aligned} \beta_{n} &= \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} \\ &=\left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta +\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{n}X_{n}'U_{n}\right) \\ \end{aligned}$

and hence we get that

$\text{plim}\beta_{n}=\beta+\left(\text{plim}\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\text{plim}\frac{1}{n}X_{n}'U_{n}\right)$

where $\text{plim}\frac{1}{n}X_{n}'U_{n}=\text{plim}\frac{1}{n}\sum_{i}X_{i} U_{i}=\mathbb{E} X_i U_i=0$ (under existence of these limits).

Asymptotic distribution

We conclude with the asymptotic distribution of $\beta_n$ . We consider

$\begin{aligned} \sqrt{n}(\beta_{n}-\beta) &=\sqrt{n}\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{\sqrt{n}}X_{n}'U_{n}\right) \end{aligned}$

we have that $\frac{1}{n}X_{n}'X_{n}\rightarrow\mathbb{E} XX'$ . Now we should look at the second term. By the central limit theorem we will converge to

$\frac{1}{\sqrt{n}}X_{n}'U_{n}=\frac{1}{\sqrt{n}}\sum_{i}X_{i} U_{i}\overset{d}{\rightarrow}\mathcal{N}(0,\mathbb{E} X_{i}U_{i}U_{i}'X_{i}')$

if we are in an iid case then $\mathbb{E}\left[U_i U_i'|X_i\right] = \sigma^{2}_u$ and so

$\sqrt{n}(\beta_{n}-\beta)\overset{d}{\rightarrow}\mathcal{N}(0,\sigma^{2}\left(\mathbb{E} X_iX_i'\right)^{-1})$

Confidence intervals

Beyond point estimates of parameters, we are also interested in forming confidence intervals on parameters $\beta$ . A $1-\alpha$ confidence interval is a combination of two estimators $a_{n},c_{n}$ (function of the sample) such that

$P(\beta\in[a_{n},c_{n}])\geq1-\alpha$

where $beta$ is fixed and $a_{n},c_{n}$ are the random variables. See the example for a normally distributed estimator.

We will often consider asymptotic confidence intervals where we will replace the inequality with a probability limit:

$P(\beta \in[a_{n},c_{n}])\rightarrow 1-\alpha$

Delta method

There are some cases where we are interested in constructing inference on functions of estimates. Consider the case where

$\sqrt{n} ( \beta_n - \beta ) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma)$

And that we are interested in the limiting distribution of $h(\beta_n)$ for some function $h$ .

We write

$h(\beta_n) \simeq h(\beta) + \nabla h(\beta)' ( \beta_n - \beta )$

which allows us to express the variance of $h(\beta_n)$ as

$\begin{aligned} Var(h(\beta_n)) &\simeq Var( h(\beta) + \nabla h(\beta)' ( \beta_n - \beta ) ) \\ & = Var( \nabla h(\beta)' \beta_n ) \\ & = \nabla h(\beta)' Var( \beta_n ) \nabla h(\beta)\\ & = \nabla h(\beta)' \Sigma \nabla h(\beta)\\ \end{aligned}$

and so we get that

$\sqrt{n} ( h(\beta_n) - h(\beta) ) \overset{d}{\rightarrow} \mathcal{N}(0, \nabla h(\beta)' \Sigma \nabla h(\beta) )$

Clustered standard errors

Please refer to the review paper by Cameron and Miller for more details. The following is a summary of the first part of the paper. Here we will consider Molton forumal, but things go beyond this simple consideration. We use the same notations as before, but for simplicity we suppose that we only have one regressor.

We recall that

$\beta_n = \sum x_i y_i / \sum X_i^2 = \beta + \sum x_i u_i / \sum x_i^2$

Hence in genenral we get that

$Var[\beta_n] = Var[ \sum x_i u_i] / \Big( \sum x_i^2 \Big)^2$

In the simplest case where the errors are uncorrelated across $i$ and homoskedastic, we get $Var[\beta_n] = \sigma^2 / \Big( \sum x_i^2 \Big)^2$ . If instead errors are heteroskedastic we get

$Var[\beta_n] = \Big(\sum x_i^2 \mathbb{E}[u_i^2|x_i]) / \Big( \sum x_i^2 \Big)^2$

Where we could construct an estimator using $\hat{u}_i$ :

$\hat{Var}[\beta_n] = \Big(\sum x_i^2 \hat{u}_i^2]) / \Big( \sum x_i^2 \Big)^2$

Finally what if the errors are corrolated across $i$ ? In the most general case:

$\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j Cov[x_i u_i , x_j u_j] \\ &= \sum_i\sum_j x_i x_j \mathbb{E}[u_i u_j] \end{aligned}$

Simply replacing with $\hat{u}_i$ would unfortunately gives $0$ disrectly since $\sum_i x_i \hat{u}_i = 0$ . Instead then we are going to assume that in the population there are known / observed groups such that we allow correlation within cluster, but we assume not correlation between clusters. Then we can compute:

$\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j x_i x_j \hat{u}_i \hat{u}_j] 1[i,j\text{ in same cluster}] \\ \end{aligned}$

Clustered errors

Let's assume that their are $g$ clusters and we still write:

$Y_{ig} = X_{ig}'\beta + u_{ig}$

And we assume that $\mathbb{E}[u_{ig}|x_{ig}] = 0$ , and we assume in additio that for $g \neq g'$ :

$\mathbb{E}[ u_{ig} u_{jg'} | x_{ig}, u_{jg'} ] = 0$

Moulton (1990) considered the case where $Cor[u_{ig},u_{jg}] = \rho_u$ and within correlation of the regressor is also written as $\rho_x$ , and $N_g$ is the average size of a cluster. Then the non-clustered variance estimator should be scaled by

$\tau \simeq 1 + \rho_x \rho_u ( N_g - 1 )$

The variance inflation factor, or the Moulton factor is increasing with:

within cluster correlation of the regressor
within cluster correlation of the error
number of observations in each cluster (because really it is between cluster that matters)

In an influential paper, Moulton (1990) pointed out that in many settings the inflation factor $\tau$ can be large even if $\rho_u$ is small. He considered a log earnings regression using March CPS data (𝑁 = 18,946), regressors aggregated at the state level (𝐺 = 49), and errors correlated within state ( $\rho_u= 0.032$ ) . The average group size was 18,946/49 = 387 , $\rho_x= 1$ for a state-level regressor, so the expression yields $\tau = 1 + 1 × 0.032 × 386 = 13.3$ . The weak correlation of errors within state was still enough to lead to cluster-corrected standard errors being $\sqrt{13.3} = 3.7$ times larger than the (incorrect) default standard errors!

Choosing where to cluster

TBD

Bootstrap

Computing confidence intervals and critical values can be tedious when using asymptotic formulation. If we could draw directly from the population we could conduct a Monte-Carlo exercise and recover the distribution of the estimator. In this section we consider such an approach by sampling from the available sample. Considering a given sample $Y_{1}..Y_{n}$ , there are two main re-sampling approach. The first is to re-sample $n$ elements from $(Y_{1}..Y_{n})$ with replacement, the second is to sample $m<n$ from $(Y_{1}..Y_{n})$ without replacement. In both approaches the goal is generate draws from a distribution that reassembles as much as possible to the population distribution.

The theory behind the bootstrap

The data is assumed to be independent draws from $F_{0}(x)=F_{0}(x,\theta_{0})$ and we consider a statistic $T_{n}=T_{n}(X_{1}..X_{n})$ . The distribution of $T_{n}$ is denoted by $G_{n}=G_{n}(t,F_{0})=P_{0}[T_{n}\leq t]$ . Asymptotic theory relies on $G_{\infty}$ , instead the bootstrap relies on plugging in an estimate of $F_{0}$ and uses $G_{n}(\cdot,F_{n})$ . Taking $B$ samples with replacement from $F_{n}$ , computing $T_{n,b}$ in each, we can construct

$\hat{G}_{n}(t,F_{n})=\frac{1}{B}\sum_{b}\mathbf{1}[T_{n,b}\leq t]$

then what we need for the bootstrap procedure to be asymptotically valid is that

$G_{n}(t,F_{n})\overset{p}{\rightarrow}G_{n}(t,F_{0})$

uniformly in $t$ . This requires smoothness in $F_{0}$ as well as in $G_{n}(\cdot,\cdot)$ and consistency of $F_{n}$ for $F_{0}$ . In general we get that if we have $\sqrt{n}$ asymptotic convergence to $G_{\infty}$ , then both $G_{n}(t,F_{0})$ and $G_{n}(t,F_{n})$ do so and so they are also close to each other:

$G_{n}(t,F_{0})=G_{n}(t,F_{n})+O(N^{-1/2})$

which provides no gain when compared to asymptotic standard error besides the simplicity of the computation.

Parametric Bootstrap

Note that the goal is to approximate $F_{0}$ and hence $F_{n}$ is a good candidate however one can use $F(\cdot,\theta_{n})$ where $\theta_{n}$ is a consistent estimator of $\theta_{0}$ . This is referred to as the parametric bootstrap.

It can be shown that in the case where $T_{n}$ is asymptotically pivotal, meaning that is does not depend on the parameters, then the bootstrap achieves:

$G_{n}(t,F_{0})=G_{n}(t,F_{n})+O(N^{-1})$

The idea here is that one can get a better approximation of the finite sample distribution. At every N, $G_{n}(t,F_{n})$ is closer to $G_{n}(t,F_{0})$ than $G_{\infty}(t,F_{0})$ . This can be shown using the Edgeworth expansion which expands $G_{n}(z)$ as a function of $n^{-\frac{1}{2}}$ .

$\begin{aligned} G_{n}(t,F_{n})-G_{n}(t,F_{0}) & =\left[G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})\right]\\ & +\frac{1}{\sqrt{n}}\left[g_{1}(t,F_{n})-g_{1}(t,F_{0})\right]+O(n^{-1}) \end{aligned}$

and then $G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})=0$ if $T_{n}$ is asymptotically pivotal, and $g_{1}(t,F_{n})-g_{1}(t,F_{0})=O(n^{-1/2})$ delivering an overall $O(n^{-1})$ .

Failure of bootstrap:

One example of the failure even when the estimator is asymptotic normal is the nearest neighbor estimator (Abadie and Imbens 2008). It is shown that the variance of the bootstrap is either too small or too large. Another example is the estimation of the median.

Bias correction using bootstrap

The bootstrap can be used to correct for the bias of an estimator. In many applications the exact form of the bias $\mathbb{E}_{0}(\theta_{n}-\theta_{0})$ is not known, however if we consider $\bar{\theta}_{n}^{*}$ , the expectation across bootstraps replications, then it gives us an estimate of the bias. We can then consider a bias-corrected estimate $\theta_{n}^{BR}=\theta_{n}-(\bar{\theta}_{n}^{*}-\theta_{n})$ .

Non iid samples

There will cases where the data is not exactly iid. For instance there might be weak spatial correlation. In this case, one might want to bootstrap by resampling clusters of data to replicate the dependence.

Probability refresher

Central limit theorem: given a sequence of iid random variables $(X_1, X_2, ...)$ with $\mathbb{E}X = \mu$ and $Var[X_i]=\sigma^2 < \infty$ , define $S_n = 1/n ( X_1 + ... + X_n)$ , then:

$\sqrt{n} ( S_n - \mu) \overset{d}{\rightarrow} \mathcal{N}(0,\sigma^2)$

Law of large numbers: for the same sequence $S_n \overset{p}{\rightarrow} \mu$

Probability limit: given a sequence $X_1,X_2,...$ we say that $X_n$ converges in probability to $X$ , and write $X_n \overset{p}{\rightarrow} X$ if for every $\epsilon$ we have that

$\lim_{n \rightarrow \infty} Pr \Big[ |X_n - X| \geq \epsilon \Big] = 0$

Convergence in distribution: we write $X_n \overset{d}{\rightarrow} X$ iff $P(X_n \leq x) \rightarrow P(X \leq x)$ for all $x$ .

Expressing matrix products: Let's look at $X_n' X_n$ . We have defined $X_n$ as an $n \times k$ matrix where each row correspond to indivual $i$ regressors $x_i$ which is $k \times 1$ . Hence the element in row $i$ and column j of $X_n$ is $[X_n]_{ij} = [x_i]_j$ , the $j$ component of the regressors of individual $i$ .

Looking at the matrix $C = X_n' X_n$ , by definition of the matrix mulitplication, the elements of $C$ are

$\begin{aligned} [C]_{pq} & = \sum_i [X'_n]_{pi} \cdot [X_n]_{iq} \\ & = \sum_i [X_n]_{ip} \cdot [X_n]_{iq} \\ & = \sum_i [x_i]_{p} \cdot [x_i]_{q} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x_i]_{ql} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x'_i]_{lq} \\ & = \sum_i \Big[ x_i x'_i \Big]_{pq} \end{aligned}$

where we recognized the matrix multiplication $D_i = x_i x_i'$ which is $k \times k$ and has for elements $[D_i]_{pq} = [x_i]_p [x_i]_q$ . Hence we do get that $C = \sum_i D_i$ and :

$\begin{aligned} X_n' X_n & = C \\ & = \sum_i D_i \\ & = \sum_i x_i x'_i \\ \end{aligned}$

We used this often in proofs to express the limits as averages.