Now, I'd like to discuss linear models or linear regression as a form of forecasting. I know you've seen linear regression before, and there are some subtle differences when you look at it as a forecasting model, so let's dive in. Regression analysis is a model to examine the relationship between a variable Y and other variables X1 through Xp. Why is often called the response variable or the dependent variable and the axis are known as the predictors or sometimes they're called the independent variables? For linear regression, we assume linear relationship between the data, and we use a linear equation to build this relationship. Y_t is denoted the response at time t, so Y is the dependent variable at time t, and we have some sort of relationship. If we just focus on that part, we have Beta_0 as the intercept term, we have another variable X_1 at time t, and then we have other variables up to X_k, so we can have multiple values in there. To generalize this equation, here's the Beta_0, and you should stare at this part here and convince yourself it's equivalent to this part of the linear equation up above. So if you said i is equal to 1, Beta_1 X_1t. If i is equal to 2, you have Beta_2 X_2t, so on and so forth until you get to Beta_k X_kt, and then plus your error term here. The Betas are called the coefficients, and Beta_0s called the intercept term, and finally the epsilons here are known as the error terms. Assumptions of linear regression. We have the following assumptions that the Xs and Ys, the response and predictors have some linear relationship. What does that mean? It means that we can draw a straight line through, and it looks pretty good. It's not some curvy linear, maybe art shaped, or maybe even down and up. Let's do straight lines or an art shaped. It has linear relationship in the data, it can be negative, it can be positive. The independent variables are considered to be multivariate normal. They're distributed normal. The independent variables also have no multicollinearity, and there's also no autocorrelation, that the error terms are independent. So the covariance between the error term at time i and j, if we go back to this equation, we see that it's just t, i and j is equal to zero no matter what those time periods are. We also assume that they're distributed around a mean of zero, and some standard variance. So what does that mean? The error terms always going to be about zero, give or take some little bit, and that little bit is constant. It's plus or minus some Sigma, some number. Another way to look at assumptions as a manager of a analytics team is that these are questions you can use to verify or validate any statistical analysis that has been performed in your organization. So turn these into a question if you're reviewing document. Are they responses or predictors? Is there a linear relationship? How would you do that? You could graph it out, make sure that the independent variables are multivariate normal, no autocorrelation, they have the same variance, so you can look at a graph of the error terms, etc. You should have seen these in past studies of linear regression. So here's the simple model. One extension to the model from linear regression is that it's periods t instead of just case 1, case 2, case 3, case 4, is at y at time 1, time 2, time 3, time 4 and the parameters of Beta_0 and Beta_1 are found by minimizing this error term. So here we have our equation, and if we take this part and bring it over to the left-hand side, so it's y minus Beta_0 minus Beta_1 x_t, and we take that, and we square the difference and add them up, that's what we're trying to minimize for each iteration. The formulas are given here. This is more for reference, but the estimates for Beta_1 as follows here. It's the distance away from the mean along the x direction, multiplied by the distance away from the mean in the y direction, and then you add up all of those for each of the individual points, and you divide by the squared distance away from the mean in the x direction, and the Beta_0 is the average minus this Beta_1 x bar. Those are the parameters. They're really here, I do encourage you to stare at the equations and try to get some feel for them, but really, any software package will print this out. Here's an illustration of a linear regression. So this red line that goes across, that's your average of the ys. So if you took your column of y values and you just calculate the average, it would be right here at this point here, y bar on the y-axis. Then we fitted some slope, that's the blue line, and that's our prediction. Recall that we're trying to minimize the error terms. The error terms were denoted as, let's say minimize error terms which is equal to y_t minus Beta_0 minus Beta_1 x_t, and this is all squared. So it's trying to minimize the y value, which is the vertical position of these dots minus this distance to this line, so it's right here. So this is the distance we're trying to minimize. So you can imagine if this line was somehow way up the page up above, this distance from the dots to this line would grow. Likewise, if we dropped it way down below, the distance from these little dots, these data points to another line would also be very big. So if we put it in the middle, that would help to minimize that distance. Likewise, if you pivot this line say, it just went negative, then the data points from here to negative line this, that distance also grows and then this distance here grows. So you're trying to find that line in there that'll minimize that distance, that's what the least squares criterion is all about. That is RSS, is your residual sum of squares. Recall, when you see this, this is just the Euclidean distance. So when you're starting to learn this stuff, if you see the term sum of squares, sum of squares, sum of squares, sum of squares, just think distance, the distance, the total distance, the residual distance, and the distance of the regression, and that'll help you at least intuitively understand what's going on. So in this case, here we have the total sum of squares regression which is equal to y_t, which is your actual data point to y bar, which is that line across. So it's this vertical distance here. For this dot here, is this line here, that distance. So it's the total distance between your data point and the average for the ys, and that's what we're trying to explain. We have sum of squares regression, so from the line which is known as y hat, that's our best guess, to the average, so that's this piece here, and the residual sum of squares of the distance from the data point to that line is here, y_t minus y hat t, so the actual data point, how far away is that from our best guess? That's this. Then you can see here, let's say RSS plus SSR is equal to the total sum of squares. So the distance from the average line here, to my blue line is SSR plus the distance from the blue line, that's my prediction line, to my actual data point, that's my residual sum of squares, and then if you add those two distances together, that's my total sum of squares. Here we have essentially what I said, the residual sum of squares is that error term, it's also y_t minus y hat squared. The total sum of squares goes from the data point to the average line, and the the sum of squares regression goes from the regression line to the average line. That brings us to degrees of freedom and R-squared. The degrees of freedom are the number of observations minus the number of predictors that you use in the model. So that's how many degrees of freedom you have. Then finally, the R-squared otherwise known as the coefficient of determination, can be interpreted as the percentage of information explained by a regression model. So recall that your total sum of squares is equal to your sum of square regression plus your residual sum of squares. So if your regression, this piece here, explains most of that distance of your total sum squares, then that's a pretty good explanatory model. But if most of that is not explained or it's in this residual sum of squares, then it's not a very good model. So once you see this TSS is equal to sum of squared residual plus RSS, if you divide through by total sum of squares and divide this by TSS, divide this by TSS, and divide this by TSS, you can see how you can get to this equation up here that's already written. TSS over TSS is 1, anything over itself is one, so this becomes a one. So now that we know that this term is one, we subtract RSS over TSS, this portion here, so 1 minus this portion here as you can see here is equal to R squared, or in other words, this value here. What percentage does the sum of square residual over TSS? What is that value? That can be interpreted as percentage of information explained by the model.