In this lecture, we'll talk about Combinatorial Testing which is one way of systematically testing an input space of values for a program. So let's look at different ways that we can get test cases from specifications. We start out with having functional specifications and we can decompose those into a set of independently testable features. And then what we can do is choose representative values from those features and those values represent behaviors of the model. From that, we can generate test case specifications, and from that, we can get out test cases. So this is a way of automating our testing process. If we can choose representative values for each of the variables or each of the inputs into the program and combine them in some sensible and systematic way, then we can gain a lot of tests out of this information and we can test our program systematically. So what's Combinatorial Testing? What we're going to do is we're going to identify distinct attributes that can be varied for each testable feature. I talked about input parameters, so input variables, but we can also talk about different environmental conditions and so is the file system available or not? Is the database full or not? These kinds of conditions. We can also talk about just different configurations of the program, if the program is part of a family of programs and then we can generate combinations of these attribute values systematically. There are various techniques that allow us to identify which things we should test and how we should combine them, so that we can test as many as possible but with as few test cases as possible. So in order to test our input space systematically, if we have an input parameter that has a large number of values or maybe even an infinite number of values, we need to break up that set of values in some way. So what we do is category-partition testing. So what we're going to do is we're going to try and take something that has a large number of values and put it into a number of categories and those categories that we've defined should partition all the values of the original input. And not only that, but we expect the program to behave largely the same for any value that we sample out of each category. In this way, we can systematically test the input space by checking one value from each category in the partition. Now, usually we have more than one variable so we have to think about each of the categories together. So we need to take one value from this category for this variable and one value for this category for this variable and combine them in some way. So one way to do that is called pairwise testing, and what we do here is we're trying to get a small number of test cases that systematically test a large number of inputs that may have a large number of values each. And what we're going to do is we're going to try and test each pair of values from different categories. So if I had one variable that had a value high and low, and if I had one variable that I categorized into positive values and negative values, I'd have like negative 1 in high, negative 1 in low, and positive 1 in high, and positive 1 in low. That would test all the pairs. It turns out that if you have lots of variables, it's possible to do pairwise testing in a lot smaller number of tests than you might think. And once we go from pairwise testing, we can think of three-way testing or four-way testing, so we test all four way combinations of different variables. That still leaves the question of how do we choose how to make the categories? And so a lot of times what we do is we have catalog-based testing and we're going to use the experience of test designers to determine how to slice up the input space. Maybe you know that this input has a different behavior, maybe its altitude and the plane behaves differently when altitude is greater than 10,000 feet or below 10,000 feet. So the experience of the airplane tester will say that we're going to create the categorization right at that 10,000 foot mark, we want a value greater than 10,000 feet, exactly 10,000 feet, and below 10,000 feet. So that's how we choose how to breakup our variables into categories. So, categories of values for each attribute. You can think about wanting to sample normal values, boundary values at the edges of what we expect, special values like 0 and some error values to see how robust our system is when we provide it bad data. We can generate the combinations of different category values for test cases. What we want to be able to do is we want to try and check all the say, pairwise combinations, but we don't want to check those that are impossible. It turns out that often there are constraints between different variables in our space and some combinations of values would be impossible so we want the automation to tell us what those are. Let me give you a concrete example. We have font configurations in Microsoft Word, and for one font, there are lots of things we can do that we might want to test. We can check the effects, so strikethrough, double strikethrough and so on, and there are 128 possible combinations there of just turning those on and off. If we add that to the font styles that we have, regular, italic, bold, bold italic, and the underlined styles of which there are six I believe, then we end up with 8,192 possible tests that we could run. If we want to bring in the advanced features which are on a different tab, then we have more than a million possible combinations. If we want to add color, well just bringing in color alone adds 4 billion test cases. If you multiply that by the millions that you get another way, well, we're talking about a lot of possible tests. So what we'd like to be able to do is identify representative values. So, for each category especially when they're on a long space like color, we want to identify classes of values. And we want to try and find boundary values or erroneous conditions. If we were to apply this to font configurations, we might want to restrict ourselves to look at when the characters are black, when they're white, or when they're with some other color. And maybe if the color ranges are such that you only get to use 0 to 255 for red, green, and blue, we might try 256 for an error value. Similarly for font sizes, if we don't have as big a space but usually you can write characters from font size 1 all the way to font size 512. So maybe we'll restrict ourselves to small, medium, and large fonts where we define what small means, maybe 6 point type. So these are a way of categorizing the large domain inputs that we have. Another thing we can do is introduce constraints. So what is a test case spec.? It's a selection of representative values for each category but not every combination is meaningful or even feasible. It's not even possible under certain conditions. So if we look at MS Word, two of the check boxes we had were superscript and subscript. It turns out that you can't check both those boxes at the same time and get a meaningful test, you can only do one or the other. So what we want to do is we want to rule out the impossible combinations and in so doing we're going to reduce the number of combinations to consider from the set of possible ones. Here are some different things we can do to annotate values that we've defined so that we don't test them too much. First, we can say that specific values are unlikely, and so we only want to see them in a single test case. One way to do that is to annotate that value with single. Another thing we can do is annotate a value with error which indicates that this value we've chosen is actually an erroneous value, and when we're doing generation it also means try just once. The other thing we can do, is we can define constraints between multiple variables that say, basically if this is true, then we can't use this test case or this test case is only possible if this constraint is true. Let me show you some examples of this. In our Word font choice, at most one of strikethrough or double strikethrough is allowed. If you put both of those on it'll actually deselect the other one. Same thing with superscript or subscript, and same thing with small caps and all caps. And certain other combinations are extremely unlikely using the hidden font attribute. It's something that I've never done in Microsoft Word so I'm going to assume it's rare and I'm going to have a single test case that uses this value. And also double strikethrough. I don't know anyone who does this so we're going to have a single test case that uses that particular value. Now, let's have you try it out. So what this is, is the set of options that you can use with the Unix program "cat" which concatenates two files. One place where we do a lot of combinatorial testing, a lot of pairwise testing, is when we look at command line flags because you can combine these in lots of ways and it turns out that some combinations lead to erroneous behavior on the program. I want you to take a look at this program and see if you can figure out how to pairwise test the different command line options. Okay, hopefully that was an interesting exercise. Let's look at how we can get a program to do this for us. What we want to do is we want to combine values by systematically selecting a few combinations and the justification why might you think this works? Well, there's something called the "Shallow Error Hypothesis" and what that says is that, most errors in a program can be indicated by a relatively small test cases and are caused by a relatively small number of interactions between features in a program. If we check all pairwise combinations or three-way combinations, we're likely to find all or most of the errors in the program. So why do we combine values? Well, it turns out when we look at a value in isolation usually it works correctly. The programmer thinks of that case but what they don't think of is the interactions with other values. And so, if you look at the pairwise combination, we're likely to find these unintended interactions and most of these interactions only involve a few parameters. The surprising thing is that when you use automated tools to generate these tests, is that you can examine a very large space of parameters and have the tool generate a relatively small number of tests that can check pairwise interactions or three-way interactions. Let me give you an example. Suppose that we had a system with one 6-valued parameter, one 5-valued parameter, four 4-valued parameters, one 3-valued parameters, and three 2-valued parameters. So if you look at the combination, if we just multiplied them together, this is a system that would require 184,320 tests. But if we look at the pairwise combinations, if we generate a minimal test suite that checks all pairwise interactions, it requires 32 tests for all three-way interactions that requires 150 tests and so on. So we can get all the way up to six-way tests with a reasonable amount of test cases and it's very unlikely that errors are caused by more than six-way interactions. This is what we want to be able to do. We want to be able to systematically test our input space and the way we're going to do that, is we're going to partition up our values that have lots of different possible values into a small number where hopefully the program behaves the same, and we're going to combine them in a systematic way looking at N-way partitioning.