Now in this question, we're going to explore the idea of using clustering as a form of feature engineering. The first thing that we need to do is create a variable that we're going to try and predict, as when we're doing our feature engineering, this will now be for supervised learning. We are going to create a binary target variable y, which is just going to denote whether or not the quality is greater than 7. Greater than 7 will be equal to 1, 7 or less will be equal to 0. We're then going to create a variable called x with k-means, and that's going to be from our original data, so it's going to be a Pandas DataFrame, and we're going to take that data and everything that we've worked with so far. If you recall, we added on agglom as a column as well as k-means as a column. We'll drop quality, color, and agglom, which will leave that k-means, so we have all of our float columns plus that k-means column. Then we're going to create another Pandas DataFrame, which is x without k-means, hence that's just taking what we just created from x with k-means and dropping k-means column. Then for both datasets, we will use StratifiedShuffleSplit with 10 different splits. We will fit 10 different Random Forest Classifiers, and with that, compute that ROC AUC score of these 10 classifiers, find the average of each and see which performed better. The one with k-means or the one without k-means. In order to do so, we're going to have to first import our Random Forest Classifier. We will also import our ROC AUC score as well as our StratifiedShuffleSplit. Hopefully you'll recall all of that from the course when we did supervised learning. We're then going to create our target variable, which is just when the quality is greater than 7, we said that equal to 1. If we say just this part, data quality greater than 7, that will return either true or false setting it, astype int, converts that true to 1 and the false to 0. We then initiate our objects x with k-means, which is just going to be our dataset that we currently have worked towards, but dropping agglom, color and quality. We still have k-means in our float columns, and then x without k-means, we'll take that x with k-means that we just defined and drop the k-means column. Now we have these two different Pandas DataFrame, one is just the float columns, which is x without k-means and one is the float columns with that k-means column as well, which is x with k-means. We're then going to initiate our StratifiedShuffleSplit object, and then we're going to define this function which will allow us to pass in an estimator and that estimator, a spoiler alert here will be Random Forest Classifier, but we'll see how we'll use this again for logistic regression as well, then an x and a y. Our different features and then our outcome variable. First we initiate an empty list of ROC AUC, and that's because if you recall, we're going to create 10 different values and then take the mean of each of those values, so we'll append to each of those values to this empty list. We take train index and tests index, four values in our sss.split for x and y depending on the x and y that we passed in here within our function. Because this sss is defined to have 10 different splits, when we run this for-loop, we are running through 10 different iterations of different StratifiedShuffleSplits. Different splits of our data that have ensured that there's a stratification that's a certain amount of data quality greater than 7 shows up in each one of our different train and test sets. Then we said x train and x test, using those train indices and test indices and we set y train and y test with those train indices and test indices. We can then call that estimator that we defined up here, that we're passing into our function and call,.fit, on our training set that we defined. Then we can come up with our actual prediction, which is going to be estimator.predict on our test set, on our holdout set. We can do the same for our predicted probabilities. If you recall, if we want that ROC AUC score, then we need the predicted probabilities to actually create that. We get the probabilities, that's going to output the probabilities for both of the classes. We only want the positive class, so we're taking all rows, but only the first column, not the 0 with column. That's going to be our different scores and then we can call for each one of our different iterations, the ROC AUC score for our actual values, that's the y test, as well as the scored values that we just computed. We will continuously append that to our empty list so that we get all 10 different ROC values. We then take the mean of that list and we will have the average for the different ROC scores across those 10 different splits. Now that we have that function defined, that will output that average across the 10list, we could set our estimator here to RandomForestClassifier. We have estimator equal to this object. We pass that in to our function that we just defined, along with X_with_Kmeans. This is with the column of Kmeans, and that's going to be our X as well as our target column y. Then we're going to do the same thing running that function to get, the same estimator except on X_without_Kmeans. With our dataset without that extra column. We run this. We can see that without_Kmeans cluster actually did worse than with_Kmeans cluster. We performed better when we had our Kmeans cluster as input into our RandomForest. Now, what I'd like to do is explore the idea of changing the number of labels that we will incorporate when we create this new feature or this now new set of features if we think about this in regards to one-hot encoding. We're going to say, for n equals 1,..., 20, we fit a KMeans algorithm with n clusters. So first, one cluster, two clusters, three clusters, so on. We then have to one-hot encode it, because otherwise 19, label number 19 will be thought of as greater than label number 5 or label number 10. Instead we want those each one-hot encoded, so that there's no ordinal value to each one of those different values. And once we have our one-hot encoded version of that column, we're then going to fit a Logistic Regression model and compute that average roc-auc-score. Then we're going to plot that average roc-auc-score for each one of our different numbers of clusters. So I'm going to run this while explaining because it may take a little bit of time. But the way that we start off, is that we're going to set X_basis equal to just those float columns. We're then going to initiate our StratifiedShuffleSplit with only 10 splits as we did before. We're then going to define this new function, create_kmeans_columns. As I mentioned, we can't just create that one column with multiple labels, we have to one-hot encode those labels. We say km equals Kmeans with a number of clusters equal to whatever n we pass in. We're then going to fit on just our float columns. Then when we call km.predict on our X_basis here, we're actually outputting each one of these different labels. If the number of n was equal to 20, we'd have values 1, 2, 3, 4, all the way through 19, actually starting from zero up until 19 to have our 20 different clusters. We then take that column that we just created and we call pd.get_dummies on that column. Now we create, if there is 19, 19 different columns. Having a one or a zero, if that column happened to be a one, a two, a three, so on and so forth. We then concatenate, just those flow columns to those new kmeans columns that we defined. So that maybe up to 20 columns that we're adding on. Then once we have this dataframe, the idea is that we will be able to pass that in as our dataframe and then fit our models. Initiate our estimator as LogisticRegression. We say the ns, the number of clusters that we want to run through are 1 through 20. We're then going to get our list of roc and auc values by calling that get_avg_ roc_10splits that we defined just above, in the cell above. We pass into that the estimator. Our X value is going to be this create_kmeans outputs. Remember this output will actually output a pandas dataframe, that's going to concatenate onto that original data float columns, our new labels, one-hot encoded. Then using that same target variable for each n in our different ns that we have defined up here. We're then going to plot that out. We initialize our plot and then we just plot the ns versus the different roc aucs that are output given the model, given the function that we're running here. We've already ran this. Let's look down at the results. We see it jumps around quite a bit as we add on and reduce some of those clusters. That closes out. This is just after over 10 iterations. That closes out our section here on the different clustering methods. Gives you an introduction to how you can also use these different clustering methods to actually do some feature engineering. With that, we close out our section on clustering and in lecture we will move on to Dimensionality Reduction. All right, I'll see you there.