[SOUND] Welcome, in the previous lecture we introduced the concept of endogeneity. In this lecture we will see how serious this problem really is. We have discussed three main causes of endogeneity: omitted variables, strategic behavior by people in a market, and the presence of measurement error in an explanatory variable. All three lead to a correlation between explanatory variables X, and the unexplained part in the econometric model, epsilon. This violates the standard assumptions underlying OLS estimation. But, .. how bad is this? Let's reconsider the measurement error example where salary, denoted by y, depends on intelligence, denoted by x-star. However, in practice we cannot observe intelligence and can only get a noisy measurement, say an IQ score. The noisy measurement is denoted by x. As an illustration, we will use hypothetical data. where we randomly generate intelligence, x*, and generate y as one plus two times the true intelligence x*, plus an error term u. The IQ score x is generated as x* plus noise. Here you see a scatter between y and intelligence x*. In this new graph, I add a scatter of y versus the IQ score x using blue squares. Note that only the x-values are different such that the points move horizontally. In practice, we would only have these blue squares as data. The OLS regression, through this cloud of points, is given by this blue line. However, this is not the line we want to have. The true effect of intelligence on salary is stronger! This can clearly be seen by the regression line through the red pluses. This red line shows the true effect we would like to estimate! If we use noisy x variables, we obtain the wrong coefficients. This also hold on the endogeneity in general. Can we say anything about the sign of the difference between the true and the estimated effect in case of measurement error? Please think about this by answering this test question. Under measurement error, OLS is biased towards zero. The estimated line is not steep enough. We can intuitively argue this by considering how measurement error moves the noise-free observations. Recall that points only moved horizontally. As a result, points on the left of the scatter are likely due to negative measurement errors. While points on the right are likely due to positive errors. In other words, measurement error stretches the cloud of points horizontally. This results in a flatter regression line. This example illustrates that OLS is biased under endogeneity. However, we only looked at one particular data set with a small number of observations. Would it help to have more data points or different data sets? Let's consider what happens if we repeat the same experiment many times and for differently sized data sets. For each repetition, we generate a new data set and of course get different estimates. We summarize all obtained estimates in a smoothed histogram. Remember that the true value of the coefficient that is used to generate the data equals two. This first graph shows the distribution of the OLS slope estimator when we have 20 observations. The distribution shows that OLS tends to produce estimates between a half and one. The center of this distribution is far away from the true value of two. In fact, in none of the repetitions did we get close to two. Things do not get much better for 100 observations. The center of the distribution is still far away from two. Increasing the sample size to 10,000 also does not help. The distribution converges to a point mass at one. These results illustrate that, even for a very large data set, we do not get close to the correct value of the slope parameter. Things are very different if we would have the noise-free explanatory variable available. All the graphs that have appeared on the right-hand side are obtained by regressing y on the generated noise-free variable. In the context of our example, we do as if we can perfectly measure intelligence. Here, it is clear that for all sizes of the data set, OLS on average gives the correct value. And for large data sets, it almost exactly gives the correct value. If the assumptions underlying OLS are satisfied, OLS is unbiased and consistent. Under endogeneity, OLS does not have these properties. And the OLS estimator converges to the wrong value as the data sets become larger and larger. So OLS is inconsistent under endogeneity. We can also show this inconsistency mathematically. Let's consider the standard linear model in combination with the OLS estimator. For the y in the formula, we insert the model definition, and next, work out the matrix multiplications. In the resulting equation, you can see that the first term reduces to beta. So, we can split the OLS estimator into beta, plus a random term that depends on X and epsilon. We use this formulation to see what happens to the estimator when the sample size gets very large. The first part, beta, is constant, so we only need to study what happens to the second part. Both the term X'X and the term X'epsilon have sums over the observations as elements, as can be seen on the slide. If the number of observations increases, these terms will therefore diverge. However, we can rewrite the estimator such that we are left with terms that do converge. In this equation, we have inserted two 1 over n terms that cancel against each other. The result is that the two matrices now have elements that are averages over the observations. Under mild condition the term 1/n X'X now converges to it's population mean, Q. The second term 1/n X'epsilon is also an average over observations and converges in general. The OLS estimator, b, will now converge to the true parameter beta if three conditions are true. The term 1/n X'X must converge to Q. Q inverse must exist. And 1/n X'epsilon must converge to zero. OLS is consistent under these conditions. This third condition is equal to X being exogenous, that is, no correlation between X and the error term. If X is endogenous, 1/n X'epsilon, does not converge to zero, and the OLS estimator does not converge to beta, that is, OLS is inconsistent. We have seen what happens when n grows large. However, we have not discussed the bias, that is, what happens in small samples. Why do you think this is the case? Please think about this question by answering this test question. To study the bias, we will need the expected value of (X'X)^(-1) X'epsilon. Here, we need to take into account that X is stochastic, and perhaps, correlated with epsilon. Without further assumptions, we just cannot simplify this expectation. However, under endogeneity, this expectation tends to be unequal to zero. To summarize, if X is endogenous, some variable in X is correlated with the error term epsilon. And OLS is not consistent. This means that even with an infinite amount of data, OLS will not give useful estimates. We will study an alternative estimation method that solves this in the next lecture. Now I invite you to make the training exercise to train yourself with the topics of this lecture. You can find this exercise on the website, and this concludes this lecture.