So today, we're going to look at traditional approaches for estimating crash periods in particular, recessions. The first method that has been used for decades is to develop indexes of related indicators. The government has its own index. These are very important for firms. In particular, if you are an expert in an area, one wants to know, what's the chance of a recession occurring? What's the state of the economy today? Will there be recession? Will it be normal? Will be high-growth or low-growth? In particular, this is very important because first of all, markets need reliable information. If they're going to operate in an efficient manner, they have to have reliable and open information, transparent information. So those judgments are very important. Secondly, if we do get indications that there's a problem, then one can take action. I mentioned in the previous section, if there's a hurricane coming, people want to close their shutters, start taking some action. In particular, that should be the same case if we believe there is some problem up ahead. Let's now look at one of those indices. In particular, the Goldman Sachs current activity index. This index has been published in the last six months or so, in 2008, early 2018. It's typical of these things. Now there are some challenges for estimating the future state of the economy. In particular, there's often indexes like the stock market that may be leading but they also could be coincident with the drop in the market. So market stores starts going down. Economy starts contracting these things are coincident not leading. So that's one thing. Secondly, there's often the case that conditions change, and that say as international trade increases, the things that may cause trouble or not necessarily the ones that cause trouble in the past. So we have changed in addition to deal with. In particular, we want to know how species are selected. Here, we see the 22 features, factors in the case of the investment lingo that lead to the index, the Goldman index. These come from various sources. The first collections at the top involve sentiment and other information from various countries; in US, in Canada, outside the US, and the second section deal with flows. You can see a whole series of production trade flows, orders things of that sort. This is the way the index was made up by Goldman with some lag. So we have to be careful that the index is not using information that's advanced. We can't anticipate the future. We've got to use information that's current today. So we can only use that information that we have at each point in time. Secondly, how did we get these indices? In the case of machine learning, we're going to use a systematic approach for estimating these factors, features. But in the case of this index, we really don't have a good sense of how those indices or features are going to be out. We don't know in particular how reliable they are over historical time periods. So we don't have that out-of-sample analysis that we need for machine learning. If you look at the performance of this index over the last seven or eight years, you see a pretty good indication on the right panel here that it patiently does lead the stock markets. This is the S$P 500. We see that the index, the darker line is leading somewhat the market going up and when it starts going down, even the market's going up for awhile eventually it leads. So it does seem to have some ability to lead. On the left panel, we see the interest rates and we're looking to see how that affects it? How does it that the index affects? How is it related to interest rates over time? Then we see some modest indication of review in particular. But we don't have a very good estimate of statistical reliability. In that context, it is not using the theories and techniques that we want for machine learning. So let's now turn back to our question of, is it a traditional statistical method? The method we're going to look at is called logistic regression. In particular, we're going to try to estimate a zero, one variables, these binaries. Are we going to have a crash ahead? Or we're not going to have a crash? In particular, we're going to look at a set of observations. In this case, months and go back to 1976 in our study that we're going to evaluate a bit later, and we're going to look at a group of features. In this case, we're looking at 14 features. We could look at many others with lags, but we're going to look at 14 features and we're going to look at some number of months, something like 500 months for our analysis here that we do a bumped it up ahead. So now we're going to turn to the idea of how do we estimate zero, one variables given a regression type approach if we can. So the way this is done is to look at the odds of a recession versus normal. In particular, let's assume that we have a 15 percent chance of a recession in the next quarter. That is, in effect the average chance of a recession at any particular quarter going ahead. The odds of it being in recession going ahead, would be 15 percent recession versus 85 percent no recession. So that's a number 17.64. Now if we flip that and we looked at the odds of a normal environment, that's the 85 percent chance of normal versus 15 percentage through trash, and that's number above five. So that's the odds that we use. This is going to be the basis for our logistic regression. Now, for many people who are not gamblers, who don't go to race tracks, that's an odd thing. Does everybody know what the odds are? I think that's odd. But in the case of our analysis, we're going to first of all make a transformation. We're going to add what's called the log-odds, so that at least the positive and negatives are equal. We're going to look at the log of the odds being of the log of odds, would be one number for recession and it'll be one, it'll be minus that for a crack for a normal environment. So the log-odds have symmetry. They also have a nice smooth path which is going to estimate. The assumption for a logistic regression is that we can estimate that log-odds using a set of linear terms, just like we did in the factor analysis. The Beta zero plus Beta one times x1, Beta two times x2 or the x's are the features at any particular time. Of course, we want to look ahead. So we're going to have this lag that we need to worry about. So let's now look at in particular, a function that will help us estimate this. Once we have the log, we know that the inverse of that is the exponential. So we're going to define a variable, h of Beta of x which says, given some Beta's and given some features x, that would tell me the probability of y being one. In particular, will there be a recession or not? Y is one will be recession, y is zero would be normal. But let's define that h as one over this term here, where we have one plus the exponential of this Beta terms here. We're going to use this now to use in a training context. We're going to have some training variables, we're going to try to fit that log-odds to this linear term here. To do that in a systematic way, we need some loss function, and we're going to look at the maximum likelihood loss function here, which is this formula of the the y times the log of h's plus 1 minus log y, which you times the log of one minus h's. So we're going to maximize that over the number of observations. If we have 500, a months, we'll have the sum of these over 500 periods. So we're going to then optimize this function and we will find the maximum likelihood to that. So we're going to solve that problem. It's not a complicated optimization problem, but it's going to be more complicated than it would be if we do normal regression, where we have a quadratic fit. Here's a simple example that we got out of the logistic regression. Here we're going to look at someone's sex and try to estimate it by just their height, the feature will be your height. So if you are my height, 71 inches, then one can see that you can estimate by fitting this curve this S-shaped curve. We can fit that to the data, large number of people in this case. So we come up with a probability of being male or female based on that. This is a type of analysis that we're going to do with our logistic regressions. We're going to make that assumption that we can estimate these characteristics with this type of S-shaped curve and in this case as people are taller, they're more likely to be male and it's the shortest one are likely to be female and maximize the log likelihood. In this case, will have a large dataset from the population. So what are the take home points from this section? First of all, there's a lot of attention to estimating the economy ahead. We saw two approaches, we saw leading indicators. Many companies like Goldman and others and the government have leading indicators. Secondly, we turned to this statistical method called logistic regressions. This approach is used widely in medicine and many other areas. But it depends strongly on those assumptions of the log likelihood be a linear function of the estimate that through a linear combination of the betas like we did with traditional regression. This may be reasonable with limited data. But we're going to find that when we have more data, we don't have to make that assumption. We can use other approaches which don't make those assumptions and that's what we're going to do with machine learning. In the end, we want to discover features that are leading and they are reliable over time. In some cases, these are just hiding in plain sight. In other words, they might be obvious after the fact that before the fact, are not well understood. So we had that issue of trying to make sure we understand, what are leading and how we can use that estimate and probability to make better decisions going forward.