So we're going to talk about rule-based classifiers. You're probably familiar with a bunch of rules that people use. For example, think about this, if your temperature is less than 98.6, your normal, it's between 98.6 and 100, maybe a light fever, 100 and 102, medium fever more than 104, high fever, go and talk to your doctor. Let's say diabetic or not, on the first column, you have the HbA1c score and it's got a range. Below six is okay, between seven and eight is good, and above that is in diabetic region where some action is needed. It transforms into measurements. We will not go into the details, but here it is. How did we get this? We got this through a lot of data and people came up with these rules, and people accept these as rules to be followed, at least for adults. Let's take obesity classifiers. You might have seen stuff like that. You notice again, you have clean boundaries. In the previous one, you had boundaries of temperatures. In this case, you have weight on one axis and height on the other axis, and you got boundaries. Depending on where you fall, they say you're obese or not and that comes to a way of data which they have collected. They have actually transformed these numbers into what they call as the BMI, which is the body mass index chart for adults. So they take the height and weight, you can put it in this chart, and from this chart, you can get the BMI, and the BMI is giving you the obesity level of that. You can see the use of a discriminative classifier here. We want to develop things like that as we go on. So similarly, if you took the periodic elements chart, somebody said, okay, depending on how the atoms are structured, you can say whether you have a metal, whether you have a gas which is neutral, whether you have a radioactive material. I'm not very good at this, but as far as I'm concerned, you can see the classification boundaries are very clean depending on the structure of the atoms and how many electrons are in the outer shell. That was the theory based on which they came up with this classification. Again, a way of discriminating one group of elements from another group of elements. So let's revise this question. Instead of saying, okay, I have these rules, now, let's say I gave you data, I didn't give you the rules and say, tell me what rules can you infer from that. You also need to have a sense of what class this object belongs. So there's a slight drawback here that you have data and for each of these data points we know to which class they belong to using some other method outside your system, and now you're saying using this data can then create rules which will allow me to classify the objects into these different classes. So I'll try to illustrate it using first small example as we have done in the past, and then I'll take you to the software. So here's a toy example. This example is a subset of a much bigger example. There is a company called Ponting which uses five features to describe flowers. It's fine to classify flowers into the ones that are popular and those which are not. So it color-coded them. Green means popular, red means not. It defines a popularity index, if at least 35 percent of the people like the flower, they classify it as popular. Each type of flower they carry, it's got five features: A, B, C, D, E. Feature A could be fragrance, and delicate fragrance versus intense fragrance, we call them A1. Feature A1 corresponds to delicate, A2 to intense and similarly, B1, B2. So B has got B1 and B2, two categories. Feature C has got two categories: C1 and C2. I think D has got D1 and D2, and they similarly have the fifth category which they label as E1 and E2. So we have five features and each feature has got two possible values except one which has got three. So how do we build our tree? The idea is very simple. We break the data into maybe two or three parts, then each of these parts we break it again. So we partition, then we take each partition, we recursively repartition, and what we're looking at is whether the classification is getting purer or not. So basis for partitioning is what we call the purity measures. So how do you compute the purity? So which feature to pick depends on whether it improves purity, but how do you measure purity? So I'll give you one measure. There are various other measures. So in this measure, we call it the entropy measure. The entropy measure is called an information measure. It comes from the theory of communications, but it's given by this formula, if you like. If you don't like it, that's fine, but you should understand what this entropy means. What this formula is doing is, you're taking a sum over all the classes. P(c) is the probability of the fraction of the objects that belong to class C. The sum is taken over all C. The C is the set of classes, it could be red and green or whatever it is. You're multiplying it by the log of the probability, is taken to the base 2 in this example. So what does this formula mean? What you have to understand is if everything belongs to one class, one of these probabilities will be one and all the other properties will be zero. It's a pure classification. In that case, the entropy will be zero. If one of these probabilities is zero, then the p is zero and so that term will also be zero. So if everything is in one category, your calculation of the entropy will be zero, the value of your entropy will be zero. So what you really want is as small an entropy as possible. Where will it be highest? It will be highest when the number of objects in each class is the same, so what we say that distribution of object is uniform. At that point, the entropy will be the largest. That's the state in which you can say nothing clearly about whether an object belongs to class 1, or class 2, or class 3. So there are two ranges. One, when everything has got equal probability and entropy is the highest, and one when everything belongs to just one category where entropy is zero. So you sum this up. Generally, you also have to do one thing. If it is binary classification, you can just leave it alone. That means it's classified into red or green, that's okay. That's binary. But if you have more than two categories, you also divide this by the log to the base 2 of the total number of categories, call it capital C. What it does is, it allows this value of entropy to stay in the region 0-1. That's the idea. Now, purity is the opposite of entropy. Now, entropy is between 0-1. So one minus entropy is called purity of a region. So what you're really trying to do is trying to improve the purity and the best you can get is one, the least you can get is zero.