To evaluate the overall fit of the predicted values of the response variable to the observed values, and look for outliers. We can examine a plot of the standardized residuals for each of the observations. The standardized residuals are simply the residual values transformed to have a mean of 0 and a standard deviation of 1. This transformation is called normalizing or standardizing the values so that they fit a standard, normal distribution. In a standard, normal distribution, 68% of observations are expected to fall within one standard deviation of the mean. That is between negative one and one standard deviation, and 95% of the observations are expected to fall within two standard deviations of the mean. The plot of the standardized residuals for each of the observations, is not one of the plots that is printed out by proc GLM. So we need to run some additional code. We use the gplot procedure to do this. First, we can ask for the labels for the standardized residuals, to which we gave the name stdres in the procedure above, and for the observations which in the is set are the countries. Then, we use the plot command to plot the standardized residuals by country. The slash v ref equals zero throws a horizontal line at the mean which for the standardized residuals, is zero. If we take a look at this plot, we see that most of the residuals fall within one standard deviation of the mean. So basically, they're either between -1 or 1. Few countries have residuals that are more than two standard deviations above or below the mean of zero. For the standard normal distribution, we would expect 95% of the values of the residuals to fall between two standard deviations of the mean. There are no observations that are three or more standard deviations from the mean. So we do not appear to have any extreme outliers. In terms of evaluating the overall fit of the model, there's some rules of thumb that you can use to determine how well your model fits the data based on the distribution of the residuals. If more than one percent of our observations has standardized residuals for the absolute value greater than 2.5, or more than 5% have an absolute value greater than or equal to two. Then there's evidence that the level of error within our model is unacceptable. That is, the model's a fairly poor fit to the observed data. None of the residuals from our model exceeded an absolute value of 2.5, but 5.4% were greater than or equal to an absolute value of 2.0. This suggests that the fit of this model is relatively poor. The biggest contributor to poor model fit is leaving out important explanatory variables. In order to improve the fit of this model, we should include more explanatory variables to better explain the variability in our female employment rate response variable. Going back to the output from the regression analysis, the plot equals all command in the procedure, also prints out residual plots for the individual explanatory variables. Here's the residual for each observation at different values of Internet use rate. There's clearly a funnel shaped pattern to the residuals where we can see that the absolute values of the residuals are significantly larger at lower values of Internet use rate, but get smaller, that is closer to zero, as Internet use rate increases. But then the residuals start to get larger at higher levels. This is consistent with the other regression diagnostic plots that indicate that this model does not predict female employment rate as well for countries that have either have high or low levels of Internet use rate, but is particularly worse predicting female employment rates for countries with low Internet usage rate. Similar to our urban rate variable, there also appears to be sort of a curvilinear pattern to these observations where the residuals get larger again for countries for which Internet use rate exceeds about 80 per 100 residents. This suggests that the association between Internet use rate and female employment rate may also be curved or linear. So maybe we also want to add a second order polynomial or quadratic term for Internet use rate to the model as well. Because we have multiple explanatory variables, we might want to take a look at the contribution of each individual explanatory variable to the model fit, controlling for the other explanatory variables. One type of plot that does this is the partial regression residual plot. The GLM procedure does not print partial regression plots so we need to run some additional code. There's another SaaS procedure called reg which does provide partial regression plots. The reg procedure specifically estimates linear regression models. The regression models we've tested so far could also have been tested using the reg procedure. But we prefer to teach the GLM procedure because it is much more flexible for specifying your models, and accommodates many different kinds of linear models. However, because the reg procedure was designed to run linear regression models specifically, it can provide some additional diagnostic plots that are not available with the GLM procedure. So to make a long story short, we can use the reg procedure to test the same regression model, and it will provide the same results, along with the options for a few different plots. Here's the SaaS code. First, because the reg procedure does not allow you to specify multiplicative variables in the model, we have to square our urban rate variable ahead of time. So because it's an additional data management step, I will create a new temporary data set called partial for my previously managed temporary data set which I had named new. Then I create a new variable called urbanrate2 that is equal to the centered urban rate variable times itself that is squared that I will use in my PROC reg regression model. I use the PROC reg procedure and I add plots=partial to request a partial regression plot. Then I specify my regression model after the model command. Note that I am using my new urban rate two variable to fit the quadratic curve. I add a slash and then partial to ask SaaS to also estimate the partial regression coefficient followed by a semicolon so that we can plot the partial residuals and then run to run the code. The output is the same as we saw in [INAUDIBLE] regression. If we go to the partial regression residual plot for the Internet use explanatory variable. We see that it's a scatter plot. This scatter plot shows the effect of adding Internet use as an additional explanatory variable to a model that includes only the two urban rate explanatory variables. The residuals from a model predicting the female employment rate response from the other explanatory variables excluding Internet use rate are plotted on the vertical axis. And the residuals from a model predicting Internet use rate from all the other explanatory variables are plotted on the horizontal axis. What this means is that the partial regression plot shows a relationship between the response variable and the specific explanatory variable after controlling for the other explanatory variables. We can examine the plot to see if the Internet use rate residuals show a linear or nonlinear pattern. If the Internet use variable shows a linear relationship to the dependent variable after adjusting for the variables already in the model, it meets the linearity assumption in the multiple regression. If there's an obvious non-linear pattern, this would be additional support for adding a polynomial term for Internet use rate to the model. When, we take a look at the plot for Internet use rate here we see that in contrast to the plot we just looked at of the residuals at different values of Internet use rate without adjusting for the urban rate variables which is shown previously the partial residual regression plot for Internet use does not clearly indicate a non linear association. Rather the residuals are spread out in a random pattern around the partial regression line. In addition many of the residuals are pretty far from this line indicating a great deal of female employment rate prediction error. This suggests that although Internet use ratios are statistically significant association with female employment rate. This association is pretty weak after controlling for urbanization rate. We can also look at the partial regression residual pods for each of the other explanatory variables as well. Finally, we can examine a leverage plot to identify observations that have unusually large influence on the estimation of the predicted value of the response variable. Female employment rate or their outliers or both. The leverage of an observation can be thought of in terms of how much the predict scores for other observations would differ if the observation in question were not included in the analysis. The leverage always takes on values between zero and one. A point with zero leverage has no effect on the regression model. Outliers are observations with residuals greater than two or less than negative two. If we go back to the GLM results. We can find the leverage plot in the output. SaaS kindly shows outliers as red symbols, observations with high leverage values as green symbols, and observations that are both outliers and high leverage as brown symbols. One of the first things we see in the leverage plot is that we have a few outliers. That is, countries that have a residual that is greater than 2 or less than negative 2. We've already identified these outliers in some of the other plots we've looked at. But this plot also tells us that these outliers have small, that is close to zero, leverage values. Meaning that although they are outlying observations, they do not appear to have a strong influence on the estimation of the regression parameters. On the other hand we see that there are few cases with higher than average leverage, but one in particular is more obvious in terms of having an influence on the estimation of the predicted value of female employment rate. This observation has high leverage but is not an outlier. We don't have any observations that are both high leverage and outliers.