[SOUND] Welcome! The topic of this lecture is parameter estimation in logit models. In the previous lecture, you have seen how to interpret the parameters of the logit model. In practice, the values of the parameters are unknown, and need to be estimated from observed data. This lecture will show you how to do this. The slide shows the logit formula for the probability that y_i equals 1. I now use vector notation, X_i is a k-dimensional vector of the k explanatory variables, including the constant term, and beta is a k-dimensional vector of unknown parameters. The logit model provides the probability that y_i equals one. This model cannot be written in a linear regression format, and hence, minimizing the sum of squared residuals is not the optimal way to estimate the beta parameters if the logit model is the true model. We, therefore, need to use another measure of fit for parameter estimation. And often use measure of fit, is the so-called likelihood function. The corresponding estimation technique is called maximum likelihood. The idea of this technique is to consider the probability of getting the actual observed data as a function of the model parameters. This function is called the likelihood function. A maximum likelihood estimator, abbreviated as MLE, is defined as the parameter value that maximizes the probability of getting the actual observed data. In other words, the maximum likelihood estimate is the parameter value that gives you the highest probability of getting your observed data. To construct the likelihood function necessary to estimate the beta parameters, we first consider the probability of getting an observation y_i. This is called the likelihood contribution of observation i. The probability of getting a 1 for this observation is, of course, the logit probability. The probability of getting a 0 for this observation, is one minus the logit probability. In short, we can write the probability of getting an observation y for observation i as a product of the probability of y_i equals 1, raised to the power y_i, and the probability that yi=0 raised to the power (1-y_i). It is easy to check that the right probability is selected given the value of the observation. As we want to estimate the parameters of the model based on our observations, we have to compute the joint probability of getting all y's. In practice, choices are often independent as in the individuals often make independent decisions. For independent random variables, the joint probability is the simply the product of the the individual probabilities. This implies that the total probability of getting the data, equals the product of all likely contributions. If we consider this product as a function of the model parameters, we have the likelihood function denoted by capital L(beta). Maximizing this likelihood function with respect to beta provides the maximum likelihood estimator. This needs to be done using numerical methods. The likelihood function is a product of probabilities that are smaller than one, and hence, it can give very small values if the number of observations is large. This may cause numerical problems. In practice, one always chooses to maximize the log-likelihood function. The advantage of considering the log- likelihood function is that the product of probabilities becomes the sum of log probabilities, which causes no numerical problems. Using the properties of the logarithmic function, you can easily write a log-likelihood function in the simple form shown on the slide. The maximum likelihood estimator is now defined as the value of beta that maximizes the log-likelihood function. We use MLE as abbreviation for the maximum likelihood estimator. Before we continue, I want you to think about the following: we have now defined the maximum likelihood estimator as the value of beta that maximizes the log-likelihood function. Is this value of beta also the value that maximizes the likelihood function itself? The answer is yes. As the log function is a monotonically increasing function, the value which maximizes the log-likelihood function, also maximizes the likelihood function. Hence, it does not matter whether we maximize the likelihood function or the log-likelihood function. To maximize the log-likelihood function we consider the first-order condition. For this we need the derivative of the log-likelihood function with respect to beta. This first derivative can be computed using the chain rule. You can see that the first-order conditions are non-linear in the parameter beta. It turns out to be impossible to derive a nice formula for beta that solves this equation. In practice computer packages will do this for you using numerical methods. The solution is the maximum likelihood estimator b. To shed some light on the value of the maximum likelihood estimator I consider the first-order condition for the intercept that is when x equals 1. It is easy to see that the first-order condition implies the result shown on the slide. Interpretation of this formula is intuitively clear. When you evaluate the logit probabilities in the MLE and take the sample average, you obtain the same value as the average values of the y's. The latter is equal to the fraction of one observations in the sample. Hence, MLE matches the average values of the logit probabilities with the sample mean of the y's. I know ask you to think about the following. Suppose that you have a data set where all y_i observations are 0 with different explanatory variables x_i per observation. You want to estimate the parameters for the logit model. What is the value of the maximum likelihood estimator b in this case? As you can see on the slide, the first-order conditions imply that the sum of the probabilities should be 0. As the logit function is always strictly larger than zero, there is no solution to the first-order condition and hence, the maximum likelihood estimator does not exist. The maximum likelihood estimator has some nice statistical properties, provided of course that the model is correctly specified. First of all, the estimator is consistent. Which means that for a large number of observations, the estimator is close to the true parameter value. Secondly, MLE is asymptotically efficient, in the sense that the estimator has minimum variance. Thirdly, MLE is asymptotically normally distributed. This is an important result that we can use for testing hypotheses. For practical purposes, you can use that the MLE estimator has approximately the normal distribution with mean the true beta and the covariance matrix V. For the logit model, this matrix V can be estimated as shown on the slide. It is beyond the scope of this lecture to derive the covariance matrix V. You can, however, see that the matrix shows some similarities with the covariance matrix of the least squares estimator, which was sigma squared, X prime X inverse. The first part of the expression is the product of the probabilities times 1 minus the probabilities. This is simply the variance of our Bernoulli distribution. Note that this variance is different across observations. The second part of V contains X prime X but now in vector notation. The matrix V can be used to compute standard errors of b. In practice, computer packages will do this for you. The above properties can be used to select appropriate explanatory variable to include in the logit model. For a single parameter restriction, you can follow a similar approach as in the linear regression case. You can construct the familiar t-statistics to test for the significance of a single parameter b_j. This t-test statistic is approximately normal distributed and you reject the null hypothesis when its value is larger than the critical value. If you want to test for the significance of a set of beta parameters, You cannot use the F-test like in a linear regression model, as the model cannot be written in a regression format with errors. Instead, you have to use a testing procedure called the Likelihood Ratio Test. Suppose you have two logit models. The first model contains all variables, and the second model is the same as the first one, but now has some restricts on the parameters beta. Estimate the parameters of both models using maximum likelihood. Denote the maximum likelihood value of the models with all variables by L(b_1), and the maximum likely value of the model with restrictions imposed by L(b_0). These values are obtained by evaluating the likelihood function in the MLEs. The likelihood ratio statistic can now be computed as minus two times the difference between the latter and the former maximum log-likelihood value. This statistic has approximately a Chi-square distribution, where the degrees of freedom is equal to the number of parameter restrictions. In case the value of the statistic is larger than the critical value, you reject the restrictions. When you reject, it means that the difference in log-likelihood value between the restricted model and the unrestricted model is statistically too large. Now, I invite you to make the training exercise, to train yourself in the topics of this lecture. You can find this exercise on the website. And this concludes our lecture on parameter estimation for logit models.