Skip to content

OLS as a review

We are going to use the OLS (ordinary least square) to clarify a few terms.

Unless you are doing experimental economics, most projects start with a data-set and a question. Let's then consider a simple data set as (X_i,Y_i) for i \in (1,...,n). These are two variables and n observations.

A simple question that can directly answer is what is the average X in the sample. This does not require any statistics, it only requires taking a mean: \bar{X}_n = \frac{1}{n} \sum_i X_i. Objects like \bar{X}_n, which are functions of the sample are called estimators.

In general, we are interested in questions that involve values that we didn't directly observe. For instance we might be interested in average value of Y conditional on some given fixed value of X, or we might be interested in a the average value of X in the all U.S. but we only got a sample of size 10,000. To make the object of interest precise, we then define a population, a distribution where the sample is coming from.

Population, models and estimands

The population is a construct that allows us to precisely define the objects we are interested in recovering from our data. In concrete terms it will be represent the joint distribution from which the data/sample is drawn from.

In its most simple form, we could write down our population as the joint distribution of the data f(X_1,Y_1,...,X_n,Y_n). Using this population I can for instance define the average of X in the population, ie \bar{X} = \mathbb{E}_F[X]. \bar{X} = \mathbb{E}_F[X] is different from \bar{X}_n = \frac{1}{n} \sum_i X_i since it defined using the population and not using the sample. Object constructed on the population will be refered to as estimands. One can then start to think about how well we can learn about to learn about \bar{X} from the sample (X_i,Y_i).

Often however, we want to include variables that are not be observed. For instance we might want to control for some unobserved factors. Indeed I might be interested in the effect of changing X while keeping some characteristics of each individual i fixed, let's call such characteritic U_i. In this case I would define my population as f(X_1,Y_1,U_1,...,X_n,Y_n,U_n).

In general we will start with a class of such distributions which is indexed by some parameter \theta. We will think of this class of distribution as our model. The parameter space can be finite dimensional in which case we will call it a parametric model, or it could be infinite and we will call it non parametric.

The first interesting point to note that in this new population \mathbb{E}_F[Y|X=x'] and \mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big] might be quite different objects. Take Y as income and X as college degree, then the first expression asks the difference in income between people with a college degree, and people without a college degree. The second expression asks what is the average effect of changing the degree of each individuals.

To be precise, also imposing iid across i, we defined:

\begin{aligned} \mathbb{E}_F[Y|X=x'] & = \int Y f_{Y|X}(y,x') \text{d}y \\ \mathbb{E}_F \big[ \mathbb{E}_F[Y|X=x',U] \big] & = \int \Big( \int Y f_{Y|X,U}(y,x',u) \text{d}y \Big) f_U(u) \text{d} u \\ \end{aligned}

The second important observation is that there might be different f(X_1,Y_1,U_1,...,X_n,Y_n,U_n) distributions that deliver a given f(X_1,Y_1,...,X_n,Y_n). It is then said that such augmented distributions are observationally equivalent. To the extent that we have 2 such distributions, that would generate the same exact data, but would give us two different values of our parameter of interest, then we would be in trouble.

An example of a model is the linear conditional expectation where we specify

Y_i = X_i \beta + U_i

as well as the joint distribution f(X_1,U_1,...,X_n,U_n) where we would probably be interested in \beta. We see if given \beta and f(X_1,U_1,...,X_n,U_n) one knows the joint f(X_1,Y_1,...,X_n,Y_n). The reverse will require additional assumptions! This leads to our next paragraph.


An important first step once the model has been defined and an estimand of interest has been expressed is to ask the question whether the observed part of the population f(X_1,Y_1,...,X_n,Y_n) together with the structure we imposed in the model allow recovering a unique estimand. When this is the case, we say that the parameter of interest, or the estimand of interest is identified.

In other words, being identified refers to the ability to construct the estimand using observed data in the context where the full distribution is given to you.

Let's take the linear case again, the assumption that Y_i = X_i \beta + U_i is for instance not sufficient to construct \beta from f(X_1,Y_1,...,X_n,Y_n).

Let's make an additional familiar assumption, let's assume that (X_i,Y_i,U_i) are indenpendent across i and drawn from a joint where U_i and X_i are conditional mean independent. Hence we impose further that \mathbb{E}[U_i | X_i ]=0 and that \mathbb{E} XX' is invertible.

In this case, let's show that \beta is identified. We can show identification by directly constructing \beta from f(X_1,Y_1,...,X_n,Y_n). Indeed:

\begin{aligned} \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XY & = \left(\mathbb{E}XX'\right)^{-1}\mathbb{E}XX'\beta+\left(\mathbb{E}XX'\right)^{-1}\mathbb{E}X U \\ & = \beta+0 \end{aligned}

Note here that \mathbb{E}XX' is a k \times k matrix.


As stated at the begining, an estimator is a function of sample, and as such it is a random object. They are often written either with a hat or with a n n subcript. In the case where the data is given by a vector Y_n of size n and a matrix X_n of size n \times k, the OLS estimator is given by

\beta_n^\text{ols} = (X_n' X_n)^{-1} X_n Y_n

Finite sample properties of estimators


This is the property that \mathbb{E} [ \beta_n | X_n] = \beta. We can check this is true for the OLS estimator under the assumptions we stated before:

\begin{aligned} \mathbb{E}[ \beta_{n} | X_n] & =\mathbb{E} [ \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} | X_n]\\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'(X_{n}\beta + U_{n}) | X_n ] \\ & =\mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta| X_n] + \mathbb{E}[ \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n] \\ & =\beta \end{aligned}

Since this is true conditional on X_n, it will also be true unconditionaly.

Finite sample distribution

Let's assume further that U_n | X_n is Normally and independently distributed. In other words:

U_n | X_n \sim \mathcal{N}(0, \sigma^2_u I_n)

Then we have that

\beta_n - \beta = \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n}

and condition on X_n since U_n is normaly distributed, any linear combination is also Normaly distributed. We can then compute the variance covariance matrix of the joint Normal distribution:

\begin{aligned} \text{Var}( \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} | X_n) &= \left( X_n'X_n \right)^{-1}X'_{n} \; \mathbb{E} [ U_n U_n' | X_n ] \; X_n \left( X_{n}'X_{n}\right)^{-1} \\ &= \sigma_u^2 \left( X_n'X_n \right)^{-1} \end{aligned}

In addition note that

X_n'X_n = \sum_i X_i X_i' = n \hat{Var}(X_i)

So we end up with the following expression for the finite sample distribution of the estimator of \beta:

\beta_n | X_n \sim \mathcal{N}\left( \beta , \frac{\sigma^2_u}{n} \hat{Var}(X_i)^{-1} \right)

Let's make a couple of remarks:

  • we notice that as n \rightarrow \infty, indeed \beta_n concentrates on $\beta.
  • we also notice that it looks like copy pasting the sample reduces the the variance. Why is that not true?

Asymptotic properties of estimators

Very often, we would rather not have to make the Normality assumption on the error directly. Instead it is common to try to rely on results based on large samples and build on top of the central limit theorem.

If we can specify a sequence of population f_n(Y_1,X_1,...,Y_n,X_n) we can start thinking about deriving properties of estimators in the limit where n grows large.

A common way to generate such a sequence of population is again to assume that observations are iid across i and drawn from f(Y,X).

We then look at two important properties.


An estimator is consistent if \beta_n \rightarrow \beta in probability as n \rightarrow \infty.

We look again at the OLS estimator:

\begin{aligned} \beta_{n} &= \left(X_{n}'X_{n}\right)^{-1}X_{n}'Y_{n} \\ &=\left(X_{n}'X_{n}\right)^{-1}X_{n}'X_{n}\beta +\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\beta + \left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{n}X_{n}'U_{n}\right) \\ \end{aligned}

and hence we get that


where \text{plim}\frac{1}{n}X_{n}'U_{n}=\text{plim}\frac{1}{n}\sum_{i}X_{i} U_{i}=\mathbb{E} X_i U_i=0 (under existence of these limits).

Asymptotic distribution

We conclude with the asymptotic distribution of \beta_n. We consider

\begin{aligned} \sqrt{n}(\beta_{n}-\beta) &=\sqrt{n}\left(X_{n}'X_{n}\right)^{-1}X_{n}'U_{n} \\ &=\left(\frac{1}{n}X_{n}'X_{n}\right)^{-1}\left(\frac{1}{\sqrt{n}}X_{n}'U_{n}\right) \end{aligned}

we have that \frac{1}{n}X_{n}'X_{n}\rightarrow\mathbb{E} XX'. Now we should look at the second term. By the central limit theorem we will converge to

\frac{1}{\sqrt{n}}X_{n}'U_{n}=\frac{1}{\sqrt{n}}\sum_{i}X_{i} U_{i}\overset{d}{\rightarrow}\mathcal{N}(0,\mathbb{E} X_{i}U_{i}U_{i}'X_{i}')

if we are in an iid case then \mathbb{E}\left[U_i U_i'|X_i\right] = \sigma^{2}_u and so

\sqrt{n}(\beta_{n}-\beta)\overset{d}{\rightarrow}\mathcal{N}(0,\sigma^{2}\left(\mathbb{E} X_iX_i'\right)^{-1})

Confidence intervals

Beyond point estimates of parameters, we are also interested in forming confidence intervals on parameters \beta. A 1-\alpha confidence interval is a combination of two estimators a_{n},c_{n} (function of the sample) such that


where beta is fixed and a_{n},c_{n} are the random variables. See the example for a normally distributed estimator.

We will often consider asymptotic confidence intervals where we will replace the inequality with a probability limit:

P(\beta \in[a_{n},c_{n}])\rightarrow 1-\alpha

Delta method

There are some cases where we are interested in constructing inference on functions of estimates. Consider the case where

\sqrt{n} ( \beta_n - \beta ) \overset{d}{\rightarrow} \mathcal{N}(0, \Sigma)

And that we are interested in the limiting distribution of h(\beta_n) for some function h.

We write

h(\beta_n) \simeq h(\beta) + \nabla h(\beta)' ( \beta_n - \beta )

which allows us to express the variance of h(\beta_n) as

\begin{aligned} Var(h(\beta_n)) &\simeq Var( h(\beta) + \nabla h(\beta)' ( \beta_n - \beta ) ) \\ & = Var( \nabla h(\beta)' \beta_n ) \\ & = \nabla h(\beta)' Var( \beta_n ) \nabla h(\beta)\\ & = \nabla h(\beta)' \Sigma \nabla h(\beta)\\ \end{aligned}

and so we get that

\sqrt{n} ( h(\beta_n) - h(\beta) ) \overset{d}{\rightarrow} \mathcal{N}(0, \nabla h(\beta)' \Sigma \nabla h(\beta) )

Clustered standard errors

Please refer to the review paper by Cameron and Miller for more details. The following is a summary of the first part of the paper. Here we will consider Molton forumal, but things go beyond this simple consideration. We use the same notations as before, but for simplicity we suppose that we only have one regressor.

We recall that

\beta_n = \sum x_i y_i / \sum X_i^2 = \beta + \sum x_i u_i / \sum x_i^2

Hence in genenral we get that

Var[\beta_n] = Var[ \sum x_i u_i] / \Big( \sum x_i^2 \Big)^2

In the simplest case where the errors are uncorrelated across i and homoskedastic, we get Var[\beta_n] = \sigma^2 / \Big( \sum x_i^2 \Big)^2. If instead errors are heteroskedastic we get

Var[\beta_n] = \Big(\sum x_i^2 \mathbb{E}[u_i^2|x_i]) / \Big( \sum x_i^2 \Big)^2

Where we could construct an estimator using \hat{u}_i:

\hat{Var}[\beta_n] = \Big(\sum x_i^2 \hat{u}_i^2]) / \Big( \sum x_i^2 \Big)^2

Finally what if the errors are corrolated across i? In the most general case:

\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j Cov[x_i u_i , x_j u_j] \\ &= \sum_i\sum_j x_i x_j \mathbb{E}[u_i u_j] \end{aligned}

Simply replacing with \hat{u}_i would unfortunately gives 0 disrectly since \sum_i x_i \hat{u}_i = 0. Instead then we are going to assume that in the population there are known / observed groups such that we allow correlation within cluster, but we assume not correlation between clusters. Then we can compute:

\begin{aligned} V\Big[\sum x_i u_i\Big] &= \sum_i\sum_j x_i x_j \hat{u}_i \hat{u}_j] 1[i,j\text{ in same cluster}] \\ \end{aligned}

Clustered errors

Let's assume that their are g clusters and we still write:

Y_{ig} = X_{ig}'\beta + u_{ig}

And we assume that \mathbb{E}[u_{ig}|x_{ig}] = 0, and we assume in additio that for g \neq g':

\mathbb{E}[ u_{ig} u_{jg'} | x_{ig}, u_{jg'} ] = 0

Moulton (1990) considered the case where Cor[u_{ig},u_{jg}] = \rho_u and within correlation of the regressor is also written as \rho_x, and N_g is the average size of a cluster. Then the non-clustered variance estimator should be scaled by

\tau \simeq 1 + \rho_x \rho_u ( N_g - 1 )

The variance inflation factor, or the Moulton factor is increasing with:

  • within cluster correlation of the regressor
  • within cluster correlation of the error
  • number of observations in each cluster (because really it is between cluster that matters)

In an influential paper, Moulton (1990) pointed out that in many settings the inflation factor \tau can be large even if \rho_u is small. He considered a log earnings regression using March CPS data (𝑁 = 18,946), regressors aggregated at the state level (𝐺 = 49), and errors correlated within state (\rho_u= 0.032) . The average group size was 18,946/49 = 387 , \rho_x= 1 for a state-level regressor, so the expression yields \tau = 1 + 1 × 0.032 × 386 = 13.3. The weak correlation of errors within state was still enough to lead to cluster-corrected standard errors being \sqrt{13.3} = 3.7 times larger than the (incorrect) default standard errors!

Choosing where to cluster



Computing confidence intervals and critical values can be tedious when using asymptotic formulation. If we could draw directly from the population we could conduct a Monte-Carlo exercise and recover the distribution of the estimator. In this section we consider such an approach by sampling from the available sample. Considering a given sample Y_{1}..Y_{n}, there are two main re-sampling approach. The first is to re-sample n elements from (Y_{1}..Y_{n}) with replacement, the second is to sample m<n from (Y_{1}..Y_{n}) without replacement. In both approaches the goal is generate draws from a distribution that reassembles as much as possible to the population distribution.

The theory behind the bootstrap

The data is assumed to be independent draws from F_{0}(x)=F_{0}(x,\theta_{0}) and we consider a statistic T_{n}=T_{n}(X_{1}..X_{n}). The distribution of T_{n} is denoted by G_{n}=G_{n}(t,F_{0})=P_{0}[T_{n}\leq t]. Asymptotic theory relies on G_{\infty}, instead the bootstrap relies on plugging in an estimate of F_{0} and uses G_{n}(\cdot,F_{n}). Taking B samples with replacement from F_{n}, computing T_{n,b} in each, we can construct

\hat{G}_{n}(t,F_{n})=\frac{1}{B}\sum_{b}\mathbf{1}[T_{n,b}\leq t]

then what we need for the bootstrap procedure to be asymptotically valid is that


uniformly in t. This requires smoothness in F_{0} as well as in G_{n}(\cdot,\cdot) and consistency of F_{n} for F_{0}. In general we get that if we have \sqrt{n} asymptotic convergence to G_{\infty}, then both G_{n}(t,F_{0}) and G_{n}(t,F_{n}) do so and so they are also close to each other:


which provides no gain when compared to asymptotic standard error besides the simplicity of the computation.

Parametric Bootstrap

Note that the goal is to approximate F_{0} and hence F_{n} is a good candidate however one can use F(\cdot,\theta_{n}) where \theta_{n} is a consistent estimator of \theta_{0}. This is referred to as the parametric bootstrap.

Asymptotic refinement

It can be shown that in the case where T_{n} is asymptotically pivotal, meaning that is does not depend on the parameters, then the bootstrap achieves:


The idea here is that one can get a better approximation of the finite sample distribution. At every N, G_{n}(t,F_{n}) is closer to G_{n}(t,F_{0}) than G_{\infty}(t,F_{0}). This can be shown using the Edgeworth expansion which expands G_{n}(z) as a function of n^{-\frac{1}{2}}.

\begin{aligned} G_{n}(t,F_{n})-G_{n}(t,F_{0}) & =\left[G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})\right]\\ & +\frac{1}{\sqrt{n}}\left[g_{1}(t,F_{n})-g_{1}(t,F_{0})\right]+O(n^{-1}) \end{aligned}

and then G_{\infty}(t,F_{n})-G_{\infty}(t,F_{0})=0 if T_{n} is asymptotically pivotal, and g_{1}(t,F_{n})-g_{1}(t,F_{0})=O(n^{-1/2}) delivering an overall O(n^{-1}).

Failure of bootstrap:

One example of the failure even when the estimator is asymptotic normal is the nearest neighbor estimator (Abadie and Imbens 2008). It is shown that the variance of the bootstrap is either too small or too large. Another example is the estimation of the median.

Bias correction using bootstrap

The bootstrap can be used to correct for the bias of an estimator. In many applications the exact form of the bias \mathbb{E}_{0}(\theta_{n}-\theta_{0}) is not known, however if we consider \bar{\theta}_{n}^{*}, the expectation across bootstraps replications, then it gives us an estimate of the bias. We can then consider a bias-corrected estimate \theta_{n}^{BR}=\theta_{n}-(\bar{\theta}_{n}^{*}-\theta_{n}).

Non iid samples

There will cases where the data is not exactly iid. For instance there might be weak spatial correlation. In this case, one might want to bootstrap by resampling clusters of data to replicate the dependence.

Probability refresher

Central limit theorem: given a sequence of iid random variables (X_1, X_2, ...) with \mathbb{E}X = \mu and Var[X_i]=\sigma^2 < \infty, define S_n = 1/n ( X_1 + ... + X_n), then:

\sqrt{n} ( S_n - \mu) \overset{d}{\rightarrow} \mathcal{N}(0,\sigma^2)

Law of large numbers: for the same sequence S_n \overset{p}{\rightarrow} \mu

Probability limit: given a sequence X_1,X_2,... we say that X_n converges in probability to X, and write X_n \overset{p}{\rightarrow} X if for every \epsilon we have that

\lim_{n \rightarrow \infty} Pr \Big[ |X_n - X| \geq \epsilon \Big] = 0

Convergence in distribution: we write X_n \overset{d}{\rightarrow} X iff P(X_n \leq x) \rightarrow P(X \leq x) for all x.

Expressing matrix products: Let's look at X_n' X_n. We have defined X_n as an n \times k matrix where each row correspond to indivual i regressors x_i which is k \times 1. Hence the element in row i and column j of X_n is [X_n]_{ij} = [x_i]_j, the j component of the regressors of individual i.

Looking at the matrix C = X_n' X_n, by definition of the matrix mulitplication, the elements of C are

\begin{aligned} [C]_{pq} & = \sum_i [X'_n]_{pi} \cdot [X_n]_{iq} \\ & = \sum_i [X_n]_{ip} \cdot [X_n]_{iq} \\ & = \sum_i [x_i]_{p} \cdot [x_i]_{q} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x_i]_{ql} \\ & = \sum_i \sum_{l=1}^1 [ x_i]_{pl} \cdot [x'_i]_{lq} \\ & = \sum_i \Big[ x_i x'_i \Big]_{pq} \end{aligned}

where we recognized the matrix multiplication D_i = x_i x_i' which is k \times k and has for elements [D_i]_{pq} = [x_i]_p [x_i]_q. Hence we do get that C = \sum_i D_i and :

\begin{aligned} X_n' X_n & = C \\ & = \sum_i D_i \\ & = \sum_i x_i x'_i \\ \end{aligned}

We used this often in proofs to express the limits as averages.