Welcome to our notebook here on dimensionality reduction. In this notebook, we're going to be using the Portuguese wholesale distributor data set. That data sets going to contain the annual spending on fresh products on milk products, grocery products, and so on. And then the last two which we're actually going to end up dropping are going to be channel and reach. And the reason we dropped those is because we want to focus on the numeric values here. And these are technically going to both be categorical values. And it's just as easy if we wanted two to 100 code them. But for this, we're just going to drop those two columns. We're then going to import our necessary libraries as we do at the start of each one of our notebooks. And then here for part one, we're going to want to import our data and check each of the data types. Or then as mentioned going to drop the channel and the region columns. As we won't be focusing on these throughout our examples here using PCA. We're then going to convert the remaining columns, two floats, if that's necessary. And then we're going to copy a version of the data that we just created using the dot copy method to preserve it and we'll be using that later on and we'll see how in a bit. So, first things first, we import our data using pandas.readcsv. We look at the shape and we see that we have 440 rows, and 8 columns. And recall the number of columns that are going to be important as our goal here with PCA is to reduce that number of columns that were working with when we create our models, or whatever it is we want to do with our data. Maybe want to visualize, and we want to reduce the two columns. So we see our first five rows. And we see here that we still had that channel and region which we said we don't want to include. So we're just going to call data drop, and we drop the channel and region from access equals one. And we look at the data types, we see that they're each integers. And we're just going to convert those each to float calling dot s. Dot as type float for each one of the different columns. Now we have them all as floats. And then as mentioned, we're going to want to save this original data for later. So recall here we have our data, which is our data frame that we've just created. And then data are which is going to be a copy of that, which we're not going to touch for a bit. Here in part two, we need to again ensure that our data is scaled and relatively normally distributed, it will be easier to work with normally distributed data. And then as mentioned in the lecture, we saw how important it is to scale our data to ensure that no feature has extra weight when trying to come up with the different principal components. So we're going to examine the correlation between each one of our different features. And recall this will be important as when we are doing PCA. What we will be looking for is if two features are very highly correlated, they're not adding any extra information and we want to remove or reduce those or combine a few to end up with less features overall. So if they're highly correlated, we can probably remove some without losing much variance from the overall data set. We're then going to perform any transformations and scale our data using whatever scaling method you prefer. Whether it's min max scaler, or the standard scaler. We're then going to view the pairwise correlation plots, using our pair plot just to visualize all the relationships, as well as now seeing if we have normally distributed data looking across that diagonal of a pair plot. So the first thing that we want to do is called data.core so we can see the correlation between each one of the different features. So, this will give us for each feature the correlation with all the other features in a square matrix in a square data frame. And just to ensure that we can get the highest correlation which feature is the highest correlation and because of one feature with itself will always have a correlation of one. We're going to replace that diagonal value which are going to start off as all ones with all zeros. So we're saying for x in the range of format.shape zero, it's a square matrix we could have called shape zero or shape one. So that's going to be for every single value in our matrix, for every single numeric value for the range of our matrix, we're going to take the diagonal value. So zero, zero, one, one, two, two and replace that one with one with a zero. And we can see now our correlation matrix has the correlation between fresh and milk and grocery. And then for fresh and fresh, it's just a zero across each one of the different diagonals. Now we're going to call the absolute value on that full correlation. As we don't care if it's a positive or negative, just the strength of that correlation, and we're going to call it x max to see which feature is most highly correlated with G, which of the other features? So we're saying what's the max index value? So for fresh, it's frozen for milk, it's grocery, so on and so forth. We're then going to examine the skew for each one of our ten values. And then take the long transformation, if necessary, for those that have higher skew. Recall that the skew's going to be a value with zero being no skew, positive value being a right skew, and a negative value being a left skew. The higher that value is, the stronger the skew. So we call data.skew to see the skew of each one of our different columns. We sort them from largest to smallest. And those are going to be our log columns. And that will now be a panda's series. And then we're just going to take those log columns that are greater than 0.75. Those that have a higher skew and we see here we have these values that have tend to have a higher skew. And for those, we're going to to take the log transformation with each, hopefully creating more normally distributed data. So for calling each one of these log columns.index, so these are, this is our log columns that we just defined is that pandas series for called the index we get each one of these Delicatessen frozen milk and so on, which is going to also match up with each one of our different data columns. So we're going to place those columns in place with the log transformation of those columns. We can then also call the min, max scaler. So we import from scaler.pre processing, the min max scaler. We want to ensure that all our values are on the same scale. We call min max scaler, we initiate the object, and then we say for column and each one of our columns are going to fit and transform on that column. So we're going to replace it again in place, to standardize that data, so all values are between zero and one by using the midnight scalar, which were calls just subtracting the minimum value and then dividing by the max minus the min. So that'll ensure all our values are between zero and one. The next thing that we want to do is we're going to visualize everything that we've just done. So we're going to see each of the relationships and hopefully see those high correlations with each one of the different scatter plots that we'll see with the pear plot, as well as saying, hopefully, more normally distributed data, which we see for the most part throughout each one of our different columns. And we see for example, milk, wood and grocery have a pretty high correlation. If you look just three columns in and two columns down, you see that high correlation. Now in part three, we want to introduce how we can do this all in one step. NSV is especially useful if we want to incorporate this into some supervised learning model later on, and be able to pass in different parameters through out. So we're going to pass enough pipeline function and we saw that during our course on supervised learning. But what's important when using the pipeline function. Is that, each one of the functions that are passed in, each one of the different pieces of that pipeline, have to have a fit and transform method to them. So, we want to take the log, and then take the Min Max Scalar. But the log doesn't have that fit-transform that's build in with each one of our different sklearn objects that we've been working with. So min max scalar has a fit transform but log transformer does not. So in order to ensure that we have a version of taking that log transformation that has the fit and transform methods that we can pass into our pipeline. We're going to call this function transformer and this function will take whatever function is that you want to pass in and convert it so that it has a fit and transform method available to it. So now we have a log transformer object, which is going to be a log transformer with a fit and transform method and once we do that we can pass it into our pipeline. So first within our pipeline, we need to pass in that list of tuples, where the first value of that tuple is going to just be that name. If we want to pull it out later. And then the next value is going to be the actual function that we want to call. So here we call that log transformer that we just created, and then min max scaler. We pass in this list of tuples into our pipeline. And then we can just call pipeline.fit transform on our original data. If you recall, we made a copy, and we didn't change our data at all for that copy of the data. And we can call fit, transform and get the output down the line of both taken that log transformation and that min max scaler. And we run this and then that data pipe should equal that data that we just transformed. So we're going to check that using NumPy.allclose, which is just going to check that each value within each of our arrays are exactly the same, with a bit of possible rounding error, many decimal points down the line. So we run this and we see that it's true that all of our values are the same. And we see that our pipeline work just as well as taking each one of these different steps separately. Now that closes out part 3. In part four, we're going to start working with PCA on this transform data that we've been working with and see how much of the variance can we explain with different numbers of these principal components? All right, I'll see you there.