In our previous video, we discussed the creation of the FICO score which just to remind you, tells us the probability that you will be 90 or more days past due in the next 24 months. Credit bureaus collect information on individual's credit profile and use a type of regression analysis to calculate FICO score, which again is the probability that you will be 90 or more days past due in the next 24 months. What we want to do in this video is briefly go over what a regression does and explain why the standard regression that we might be familiar with is not necessarily appropriate to answering this question. Linear regression is a situation in which we see data on two things that we think might be related to one another. As an example, let's think about interest rates and future inflation. We usually think that a higher interest rate is related to the idea that we expect inflation to be higher in the future. So our hypothesis is that inflation over the next 12 months is related to today's interest rate. That is, that the interest rate forecasts next 12 months inflation. The first thing we can do to look at this hypothesis is simply plot the data. In this figure, I produced a scatter plot. On the x-axis, is the current one-year treasury rate taken from the Federal Reserve to reflect the constant maturity one-year treasury borrowing rate. On the y-axis, I have inflation over the 12 months subsequent to seeing this particular treasury interest rate. The inflation in this case is the growth in the Consumer Price Index for all urban consumers. What hopefully emerges from looking at this particular graph, is that there's an increasing relationship between the treasury rates and future inflation. The relationship is not perfect. I don't see a perfect line, but I can certainly see that as interest rates get higher, future inflation tends to get higher as well. A regression analysis essentially tries to draw a line to fit these data as best as possible. The results of a regression winds up being a line that says, future inflation is equal to 1.08 plus 0.53 times the interest rate. How do we interpret this? The intercept 1.08 says that when interest rates are 0, we expect future inflation to be 1.08 percent. The slope coefficient 0.53 tells us that for each additional one percent increase in interest rates, we expect 0.53 percent more inflation in the next 12 months. In this chart, I've appended the original data with the fit of the regression line. What you can see is that the regression line generally fits to the data. There are obviously some points that it misses, but in general it captures the upward sloping relationship between interest rates and future inflation. If we were to actually perform this regression in Excel, this is what the output of the regression would look like. There are three things that I want to call to your attention. The first is that down to the left we have the coefficients that represent the line that I mentioned before, the intercept 1.0829 and the slope coefficient 0 5334. The second piece of information is what we call the t-statistic, which tests the hypothesis of whether the coefficient is actually zero or not. A larger t-stat is generally assumed to be better, and we usually think that we can be fairly confident that a coefficient isn't just recorded by accident if the t-statistic is greater than two. The final piece of information is the R-square, which is an estimate of how good of a fit the line is to the data. The highest R-square one can get would be one, which would essentially say that the data lines up perfectly in a line. As we saw from our plot before, that is not the case with this particular regression, but we do see an R-square of about 0.5. This is suggesting that about 50 percent of the variation in future inflation is accounted for by the current interest rate. Now let me introduce the concept of a limited dependent variable. The question that motivates this discussion is whether the type of regression analysis that we just performed is appropriate in the case of credit scoring. Again, remember that FICO scores predict whether, one, the borrower will be 90 days past due in the next two years, or the borrower will not be 90 days past due in the next two years. So this is a discrete statement, basically a true or false statement that says the borrower is past due or the borrower is not past due. When there are only two outcomes, we have what is called a limited dependent variable. What we are trying to predict, our y variable can only take one of two values. We will say our y variable is one if the borrower goes 90 days past due, and our y variable is zero if the borrower does not. To illustrate, we will look at a publicly available data set from LendingClub. LendingClub is accompany that crowdsources loans. The idea here is that they provide a platform for people who look for a loan can get that loan from other people who go to the website, potential lenders screen data on loan applicants and LendingClub provides data as to how the loans are and have performed. We will look at data for our loans originated in 2007-2010. If you're interested in looking at this data, you can look at the following hyperlink to download the information. For now, we'll focus on two of the variables that LendingClub provides. The first is the loans status, which we will use to create a new variable. In this case, all of the loans have matured. So we will create a variable called the charge indicator, which is an indicator as to whether the loan has been charged off or considered to be a bad loan. Our charge indicator will take the value zero if the status of the loan is fully paid. In contrast, if the loan is charged off, that is, it becomes a bad loan, we will let the charge indicator take on a value of one. The other variable that we will take a look at is the ratio of debt payments to income. Again, as we discussed in a previous video, this ratio of debt payments to income was often used as a guideline to think about the credit underwriting process. First, let's plot these variables like we did before. On the x-axis, I've plotted the debt-to-income ratio ranging from 0-30 percent. On the y-axis, I've plotted the charge indicator. Although you can't see it very well because these look like a line, there are a number of dots that represent all of the different combinations of debt-to-income ratio and whether the loan was charged off or not. However, as you can see from this particular plot, there are only two values that our y variables can take on; zero or one. We could try to fit a regression to this data. If we did, we would get what looks like the red line. However, as we've discussed with linear regression, we typically want to see the line fit the data fairly well and hopefully the line would actually go through some of the data points. Here you can see that the line doesn't actually go through any of the data points. We can get a sense of how poorly this particular regression fits the data by looking at the R-squared. The R-squared in this particular case is 0.0017, which indicates that debt to income in a linear regression is explaining less than 0.18 percent of the variation in charge-off status. How do we deal with limited dependent variables? When the y variable has a value of either zero or one, a linear regression will not be appropriate as the previous example illustrates. We need an alternative approach in order to perform this sort of data analysis. In the next video, we will discuss logistic regression, which is a version of regression analysis that is suited for this type of data. Logistic regression permits us to have zero or one as our y variable values and something else as our x variable values.