Welcome to our lab here on K-Means clustering our first lab for Course 4. In this course we're going to learn how to use K-Means using sklearn. So throughout this lab we'll run a K-means algorithm, understand what parameters are customizable within the algorithm. And then know how to use the inertia curve that we discussed in lecture to determine the optimal number of clusters. Now, quick overview K-means is one of the most basic clustering algorithms that we'll be working with. It relies on finding cluster centers to group data points based on minimizing that sum of squared errors between each data point and its cluster center. So first things first, we're going to import all the necessary libraries. We bring in numpy, pandas, seaborne, matplotlib. And then now we're bringing in scale, KMeans. We're going to make blobs and we'll see how this comes into play. And it'll be very useful for playing around with K-means. And then we'll use shuffle as well, and we'll see that later on. Well, then you're going to just set a bunch of our parameters for our visualizations. And then we are going to get started with creating our first simple data set here. So in order to do this, we're going to first create our function where we have, and I'll break this down step by step. We have our color and this is similar to thinking of it as a list, where our color = brgcnyk. We can think of it looping through a, b for blue, r for red, and so on. We set alpha equal to 0.5, and that's just going to be how opaque each one of our data points are. Hopefully, you realize at this point that we're going to be creating some type of scatterplot, and then the size of each one of our points s = 20. We're going to call this plt.gca, which stands for get current axis, and set the aspect equal to equal. You see here just in quotes as the string equal, and that's just because we're going to be using a circle. And it's going to be each unit going either in the x-direction or the y-direction. We want them to be equal to one another. So we see that clean circle. You can try erasing this to see what it looks like otherwise. We then say if we have no clusters, so we're not clustering at all. Then we're just going to create a scatterplot, passing in our x. We're going to say all rows for column 1, all rows for column 2, the color is going to be equal to just that first color, so that b. Our alpha is going to be called 0.5, and our size equal to 20. Now, if we have a number of clusters, what we want to do is for each one of our different clusters plot these out, and the way that we do this is we call plt.scatter. And then we say we want the x values, for which our K-means model came up with the label equal to i, whatever, that we're looping through the number of clusters here. So for the first one and the second one, and so on, then we're going to say we want that first column. And then we're going to say for all those equal to i again, and we want that second column. So we get each of the two columns, but specifying the rows that are equal to the labels that we came up with. And then we set it to different colors looping through each one of these colors that we have defined above. And then we are also going to plot the actual cluster centers so we can see where those lies as well. So we just say cluster i related to the cluster that we are currently on. And we say the x-coordinate as well as the y-coordinate. So saying that first column and second column. And then again using that same color, and we're going to mark that with an x, so that we can differentiate that from our actual data points. We're also going to make the size of that larger, we're going to say the size is equal to 100. So to see what this looks like, we're going to create our x here. And in order to do that, we create this angle which is just going to be Numpy array, where it's values between 0 and 2 times pi. And it's going to be 20 equally spaced points. And we're saying we don't want the endpoint. So it's going to be up to but not including 2 times pi. We're then going to append two different values together to create our x within our x, our first feature and our second feature. So each of our two axes where the first one's going to be the cosine of our angle, and the second one's going to be the sine of our angle. And the 0 is just to say that we want to append these across the 0 axis, so that we have them one alongside the other. And then we transpose this so that we have two different columns. So I'm going to show you quickly, first I will run this. So, and we display the cluster, and we see a perfect circle here. And just to take a quick look what x looks like. x is just going to be these two different columns, with one of them being the cosine of the angle, and the other one being the sine of that angle. And now we have 0 clusters because our default above was setting the number of clusters equal to zero. There is no km yet, but we'll introduce K-means models. So all we're doing is putting the x. Now we're going to group this data into two clusters to see what it looks like. And we'll use two different random states to initialize the algorithm, and to see how we come up with different results depending on how we initialize the algorithm. So we set number of clusters equal to 2. We say KMeans, and we set the number of clusters equal to that number of clusters. We set random state equal to 10. And we're saying we only want to initialize once. Generally speaking, K-means can initialize a number of times, and then just choose the one with the best inertia. Here we're just saying choose only one time just to see the differences between two different random states. Even though again the default if we look here, will be to use the initialization of K-means plus plus. So that will ensure that it's more likely to choose far away points, but will still choose different points. So is going to be important to either initialize a number of times. Or two, if I guess either way you're going to initialize a number of times. If you do more times, check those inertias and choose which one is best on your own. So we call a K-means on those hyper parameters that we've passed, we call km.fit(X). So now we have our K-means model fit. And then we can, using that K-means that we fit, be able to display the cluster using that function that we defined earlier. And you can see using that km that we came up with, it has an attribute to give us the different labels, as well as the different cluster centers. So that's now available and we can create this scatterplot. So we run this, and we see are two different groupings. And again, these groupings because of the way that we created this data, it could really fall anywhere. There are no natural groupings. This is why it's likely to fall in many different places. Given the way that we are running this, that's why it really will not converge necessarily in the same spot. So we see that here, and we have the x marking where those centroids actually lie. That should be the average of all the red dots. The x is going to be the average of all the blue dots, and we see how it classifies each one of those two classes. Now setting the random state equal to 20 here, we can see that it comes up with a very different clustering. And if we think about it, coming back to lecture, why are these clusters different when we run the K-means twice? And this should be obvious as we talked through it quite thoroughly as I went through each one of these different graphs. But it's because the starting points of the cluster centers have an impact on where these final clusters actually lie. And again, these also are going to be clusters that don't actually probably exist given how equally spaced each one of these points are. So it's very highly likely that each one of these different clusters will come up in a different place. So, I'm going to pause here. And we will continue to figure out the optimum number of clusters, and how we'd actually do that using Python code. All right, I'll see you in a bit.