So here's an algorithm, find the entropy of your data without any splits, then take each of the attributes, we can split on the A, we can split on the B, you can split to C, the obvious choices. Calculate the gain in purity you get by splitting on each of the attribute. Choose the attribute which has the highest Purity Gain, split the data, and repeat the procedure. Typically, we can do this under all the data gets classified. You can actually use this on every data point, can get classified, and later on we will see how we can reduce. Because if all the data gets completely classified, you are probably over fitting all your model is too complex. So you may have to reduce the depth of the tree to get the right size of the tree. Just an aside, this is not the only purity measure possible. In Rattle, this is the measure used and there are other measures which you can use like the Gini index, which ought allows. But as far as we're concerned, all we need to know is there is a purity measure, and the one that Rattle uses is called the entropy measure. There are other measures which you can read about in the references given at the end of this lecture. So let's take this particular example of flowers. You've got a 110 observations out of which 89 are green, right? So the probability of green is 80.91 percent or 0.8091, 89 divided by a 100 percent. So probability of green times the log of the probability of green, as you can see with a negative number attached to it, because a log of a probability will always be negative and you want positive values is 0.8091 times 0.306 which is 2.2473. Number of observations which are red are 21. So the fraction observations which are red is 19.909. So you multiply the probability of red which is 19.09 with a negative sign attached to it, times log to the base two of minus 2.380 which is 0.4561. You add up the two and you get the entropy, the data set which is 0.07034, which is high or it's closer to one than to zero. So what it says is, this is a measure of purity of the data if you're going split. Step two. We try to split this data using each of these attributes. So for the first categorical values A, so we could split the data into the A1's on one side and the A2's on other side. So we write this as Entropy A, COLOR. What it means is, I'm trying to compute the entropy or if we split using the feature A, and the basis of calculating the entropy is the color. So that's the meaning of that function Entropy A, COLOR. So we break it into two pieces, A1 and A2. These are all the variables, the flowers which have the value A1 for attribute A, and these are all the flowers which have the value A2 for attribute A. We look at them again. Okay. So this is the measure of the Entropy due to the split. pA1 and pA2 in this formula referred to the fraction of flowers which went to the left and went to the right. These two refer to the Entropy of flowers here and here. So pA1 and pA2 is what, how many flows went to the left, how many went to the right. For each of these group of flowers we are recomputing the purity measures. So take this example. So what it says is, if you go on A1, means you go to the left. Your data-set is pure and because your data-set is pure, your Entropy is zero. So the entropy on this side is zero. If you go to the right, 54.55 percent of the flowers went to the right, and 45.545 went to the left. So 54.55, is the number of fraction of flowers which went to the right. The Entropy of the colors on the right turns out to be 0.934. That means, probably it is equally split between red and greens. I have the exact values, you can check it out. But how do you get 0.934068, is the entropy when you branch to the right. Which is the probability of green times the log of the probability of green plus the probability of red times the log of the probability of red, within all those flowers which have branch to the right. That gives you the calculation for other flowers which go to the right. If you add it up, this is your new Entropy. So you purify the data in some sense by splitting on this value. What is the purity gain? The purity gain you got is the original value, which is 0.7034, and the new value if you split on the variable A. This is said to be the gain due to splitting on A. We can repeat this for each of the features. Here's a table if you do. You can see you could split on A, or B, or C, or D, or E. These gain values is the delta you get, right? So 0.25, is the gain you get from splitting on the variable B. So the highest gain we get is splitting on B, and you split on B. So here's the root node on the right, B1 on the left and B2. We can now apply this algorithm to B1 and B2, and try to see what the split on. There is no rule with saying, of course everything on the left is B1, everything on the right is B2. So you can't anymore split on B in this particular example, but sometimes you may repeat the split view, I'll show an example of that. This is the final tree. So you first split on B1 and B2. At the moment you split on B1 you got all the greens. So if it is B1, everybody loves those flowers. On B2, again we split and the best split was on feature A. If you split on A1, everybody loved it. Go to A2, you still have an impute area. You can split it now. The best split is between C1 and C2, and C2, if you split, you get many of the flowers that people hate, right? Don't like. Then if you split on C1, you still have a mixture of reds and greens to further refine it, we split on E1 and E2. In this particular example, we are getting a perfect classification, right? So this is called the root. This is called the leaf. This is a leaf. This is a leaf. This is a leaf. This is a leaf. Note that, the leaf nodes are rectangular in shape. Leaf nodes are all pure. So we have successfully manage to completely classify all the objects into whether they're red or green in this simple example. Sometimes you cannot do that, right? So coming back to where we really were going, what are the rules? The rules are simple. There are five rules, and you can almost see it, right? If it is B1, everybody loves it. If it is B2 and A1, everybody loves it. If it is B2, A2, and C2, nobody likes it. So you actually have rules from the data which allows you now to classify any new flower if you want probably into different categories. Question comes, is this too complex to model? Because what I could've done, I got to stop splitting somewhere there, because it's overfitting the data. So we will see that, how do you control the complexity? They control the complexity by many methods, but basically by the depth of the tree is by pruning the tree. We say we clear the whole tree maybe and then we keep pruning it till we are satisfied. This is an example of overfitting. You manage to perfectly classify everything. Well, here are some rules you can use for pruning. One, you can grow a node, that means you can split on a node if there are sufficient objects in that node, right? Let's call a SizeThreshold. So split a node only if there are 15 things in that node, right? Second, that split only when the purity is not, you're not happy with that, right? If it does a purity of 90 percent, you don't split on that. So here are sudden rules you can use to cut the size of the tree. In this example, we have prone the previous tree, it goes smaller tree. One last thing. All the examples we are given so far is on categorical variables. What happens if you have numerical values, okay? So take this example, initially we are split into A1, A2, A3, and when we came to A2, there was a numerical attribute called X. So what we could do is split, but here the splitting criteria is you pick a number. If X is smaller than t1, you split to the left. If X is greater than t1, you split to the right. So here you go, right? So it's as simple as that. So we could use numerical variables or you could use categorical variables. Numerical variables, you'll find a point to split in, and split it into two. One thing that maybe running through your minds, is should we allow only Two-Way Splits or Multi-Way Splits. A Multi-Way Splits can be captured with Two-Way Splits. So if that's confusing to you, always work with splitting into two doesn't matter. So if two branches come from a node, that's called the Two-Way Split. If many branches come from a node, that's called a Multi-Way Split. There are other methods we have already seen, right? Kmeans right. Hierarchical, the clustering algorithms for joining similar items together. That is we used a decision tree you. All right? So what I'm trying to say here is, the leaf nodes of your tree may have many observations and we can treat those as a cluster. So how is that different from the clustering techniques that we have already seen, okay? The answer is very simple. The difference between a decision tree and the clustering techniques that you saw, is that the decision tree needs labeled data. It needs to know what category it belongs to, okay? So whenever the target values or labels are available, we should use decision trees rather than the other methods we saw. Which belong to what we call an unsupervised learning, okay. This is a very simple point, but please do remember that and you could probably use both. It doesn't matter, but in decision trees up preferable, and there's one more reason at the end of today's module we will see. So just to complete this, we used a decision tree to classify things into red and green. You wouldn't believe it, right. The first time I had learned this, that decision trees can also be used for regression. That means, you can actually predict the value when the target value is numeric for example pressure, or temperature, or weight, or things like that. How do you do that? Basically, you create a tree and you have leaf nodes. The leaf nodes could have one or many values. We take the average value in the leaf node as a prediction. So basically you drop observation down the tree, it'll hit a leaf node and the leaf node there will be lots of data points. We take the average of the target value of these data points and say your value is equal to the mean of that. We will see an example of that going forward. Okay.