Welcome to our lab on K-Nearest Neighbors. Here, we're going to be working with a data set that's going to be a customer churn data set. So we've been talking about customer churn, now we will look at an actual data set where we will have customer data, usage of long distance, data usage, monthly revenue, etc, in terms of the different features available. And we want to predict the likelihood of customer churning or not churning. We're going to be using this subset of customers who have phone accounts. And, since the data include a mix of numeric, categorical as well as ordinal variables, we'll have to load this data and do some pre-processing. And then use the K-Nearest Neighbors to predict, whether or not the person will churn. So after completing this lab, you should have a working understanding of how to pre-process the variety of variables in order to apply K-nearest Neighbors, and understand how to choose K, and how to evaluate model performance. So the first thing that we're going to do, is we're going to input the data. This time, it's actually going to be in this Python file, and we're going to import churn data, which will be the data that we want to import. Then for question one, we just want to import the data, examine the columns and data. We want to remove certain columns that we have listed here, and take an initial look at the data for both numeric and non-numeric features. So we have our churn data, which is going to be appendix data frame itself. We're going to set df equal to that, just what these columns drop. So we just call .drop, and we're dropping these columns which we don't need in what we're trying to perform here. The next thing that we want to do is look at df.describe. Here, we're just rounding everything to two decimal places, so we don't have too many scientific notation and large decimals. But we see for our columns, gigabytes for months and months is how many months we've had them as a customer, gigabytes per month, their monthly payments. How satisfied they are, ends the churn value which should be either 0 or 1. And, when we look at this, we see that the number of months is around 32. We can look at the minimum and maximum from 1 to 72. There's a pretty large standard deviation compared to the mean. So we get to start to getting idea of these numerical values, same with the gigabytes per month. The monthly amount that's paid also has a large standard deviation. Seems that each of these have fairly large standard deviation. The satisfaction is going to be a scoring between 1 and 5. And then our churn just values between 0 and 1. Where about 27% actually churn. Then we do df.describe and we include objects. And here, we're looking at all of our categorical or those initially brought in as strings. And we see for each one of these the number of unique values, what's the top value, and then how often that top value is showing up. And we see that we have a lot of binary features here. With a lot of these tools, some of them the different offers has six, the top one being none is a little bit problematic, but we'll just assume that there's some type of encoding in there not being any type of offer. And then for fibre optic, is just going to be the internet type and then we see that for month to month for contracts. And we can dive into the different types of objects and categories that are available to us. For Question 2, we want to identify which variables are binary, which ones are categorical and not ordinal, which ones are categorical and ordinal, and which ones are numeric. And then each one of the non-numeric features will need to have to encode in some form in order to feed it into our machine learning model. So we're going to start by identifying the number of unique values in each variable, so that we can find that binary versus categorical and not binary. Then we're going to list out the categorical, numerical, binary and ordinal. We're going to note that the variable months it can be treated as numeric, the number of months that someone has been as a customer, we're going to treat it as ordinal. And we'll see how we do that towards the end as we go through the solution. And then for the other categorical variables, we want to make sure that we just encode them, either ordinarily or 1's and 0's. So the first thing that we want to do is see the number of unique, all you have to do is run df.nunique. So we see the number of unique values. Something like months should have quite a lot, internet_type has only 4 as we saw before. We see a bunch of 2s, and then monthly, which is their monthly payments can have a large range of values, of unique values. We're then going to see which one of those only have two. So we're just filtering it down DF_uniques, such that there is only two values, this is what we see here. And then we pull out the index. And when we run that, we see that we get the different columns for which we only had two unique values. Then for all of our categorical variables, we're saying besides for these binaries, but they're still categorical, we're assuming those are the ones with at least 2 values up till 6 values, so greater than 2 and less than or equal to 6. And we see these are the values that have, these other columns that have at least 2 values and less than 6 unique values. And with that, we are running this list comprehension, to look at for i and each one of our categorical variables, that's that list we just defined. We want both i as well as the unique values for that column. So when we run this, we see for offer, each one of the six unique values offer A-B-C-D-E and none. For the Internet type, each of the unique values so on and so forth. So, we're going to say off the top that our ordinal variables here are going to be contract and satisfaction. Contract meaning that one year is more than month to month, and two years is more than one year, but the difference isn't necessarily all going to be equal to one another. So there's an ordering, but that ordering maybe does not mean much. We're going to end up encoding those 012. And then satisfactions just going to be the values 1 through 5. We then look at months, we see that we have the unique values for months. We're going to include that as our ordinal variables. So we're just going to append that into our list. And what we're doing here, is creating lists of our ordinal variables, of our binary variables and as of our overall categorical variables. So now we have our ordinal variables with a contract, satisfaction and months. We're then going to pull out our numeric variables. And that's going to be all of our columns. df.columns, and just subtract out our ordinal variables, and our categorical variables, and our binary variables. So some of them may have been numeric before. So it would have been unable to just say pull out those D type those data types. So we said, we take off the categories that we just defined. We subtract the mole out and we end up with just our numeric variables. And if we look now, we can see that our numeric variables. Are just the monthly payments and the gigabytes per month. We can then look at a histogram of each one of our different numeric variables. We see that gigabytes per month is pretty heavily skewed to the lower end with some outliers out towards 80. And then monthly has fairly normal, maybe a bit left skewed except for that large one low values all the way to the left. And then what we're going to do here with our months variable, is we're going to call cut, and what that's going to do is it's going to create beams. So we're going to replace months with beams, five different beams, and I'll show you what this look just like the second. I do want you to note, well let's look at the first before I make that note. Now if look at df.months. We have five different beams of equal stretch where goes from that 0.9 to 15.2, 15.2 to 29.4, so on and so forth up until the largest value, in five different equal cuts. Now these are going to be useful to be named as they are. Generally speaking if you want to create bins, you can also pass in your own labels and you just have to make sure you can do something like labels equals low, low, medium and so on, make sure there's five to match up with the five beams. We can't do that now because we have placements in place. But if you do create this, you do use this pandas.cut, do note that you can actually pass in the actual labels, as well as where you want the cuts to happen. Here by default, it'll create these names and equal cuts. Alright, that's our first portion of just grouping out each one of our different categorical, versus binary, versus ordinal, versus numerical variables. The next section, we'll go about actually encoding those all into numerical values and scaling them, to ensure that we could feed them into our K-Nearest Neighbors model.