Whenever we design a study, it is almost impossible to completely eliminate errors. There's always the possibility that we'll make a false positive or a false negative. The goal is to control these error rates at a desired level. We'll see in this lecture that people very often use defaults when they set these error rates, even though this is not really desirable. Let's take a moment to think about the type one error rate that's most widely used in experiments that are designed. Take a moment to think about why we actually use an alpha level of 0.05. Do you know why this is the case? If you don't, don't you think it's important that such an essential part of the experiment that you design is something that you should be able to explain? On the title slide of this lecture, I make a joke as if the default alpha level of 0.05 was one of the 10 Commandments, and actually not the first to make a joke like this. Jerry Ravetz writes, "Those who have any craft scale in the use of such tools will appreciate that the significance level to be adopted is not assigned by God, but must be decided by the user." He has a very nice quote by Berkeley from 1735, that gives a hint to why people might be so tempted to rely on defaults even in science. Men learn the elements of science from others, and every learner has a deference more or less to authority, especially the young learners, few of that kind carrying to dwell long upon principals, but inclining rather to take them upon trust and things early admitted by repetition become familiar, and this familiarity at length passes for evidence. The argumentation here is that we don't really carefully think about certain things that we do. We rely on authority even in science and if we've done so a lot, then we stop questioning things, but we assume that they have some sort of reason. Now I think that one of the most important things that you have to learn as you become a scientist is that this idea that you might had when you entered science, that most people do things for some reason, and that they know what they're doing is fundamentally wrong. There are many examples where people just do something, and the moment that you ask why? Why do we actually use an alpha level of 0.05? They don't really know. It might make sense to go back to the original writings of Neyman and Pearson. We already in a previous lecture talked about Fisher, who called the default alpha level a convenient convention. Nothing more. Just a convenient convention. Not something that was set in stone. A similar idea we see in Neyman and Pearson in 1933. But whatever conclusion is reached, the following position must be recognized. If we reject the null hypothesis, we may reject it when it is true. If we accept the null hypothesis, we may be accepting it when it is false. That is to say, when really some alternative hypothesis is true, these two sources of error can rarely be eliminated completely. In some cases, it will be more important to avoid the first. In others, the second. We are reminded of the old problem considered by Laplace of the number of votes in a court of judges that should be needed to convict the prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error. Is the punishment death or fine? What is the danger to the community if released criminals? What are the current ethical views on punishment? Now for the most important part of this quote, from the point of view of mathematical theory, all that we can do is to show how the risk of errors may be controlled and minimized. The use of these statistical tools in any given case in determining just how the balance should be struck, must be left to the investigator. So that is you. You are supposed to carefully think how the balance in these error rates should be struck. You should not rely on any default because every situation requires you to consider what optimal error rates would be. Let's take one example where an alpha level of one turns out to be the optimal choice. This is research from ecology, where people have done a cost-benefit analysis to see what the risks are, and the costs are of not noticing a decline in the koala population. Now it turns out that of many of the creatures that exist in Australia, the koala are actually extremely valuable. They have a great economic value. So not noticing a decline would be very serious. If you make an error in this case, the financial consequences are so great, actually much greater than just assuming that the koala population is declining. That a cost benefit analysis shows that the optimal alpha level in this case is an alpha of one. You should always pretend as if the koala population is declining. So we've talked about the alpha level, which is one side of the coin, but of course there's also the type two error rate to consider, or the opposite statistical power of your test. You might have heard that a default recommendation is to design a study that has a statistical power of at least 80 percent. So let's ask the same question as we did before. Why? Why is 80 percent a default recommendation? Well, let's again take a look at the original writing in which this was proposed. This recommendation comes from an excellent book on power analysis by Jacob Cohen. He writes, "It is proposed here as a convention that when the investigator has no other basis for setting the desired power value, the value of 0.80 be used." This means that the type two error rate beta, is set at 20 percent. This arbitrary but reasonable value is offered for several reasons. Here, he cites Cohen, 1965, but if you read this paper, there are really no very good reasons mentioned. The chief among them takes into consideration the implicit convention for an alpha level of five percent. The type two error rate beta of 20 percent is chosen with the idea that the general relative seriousness of these two kinds of errors is of the order of 20 to five. In other words, the type one errors are of the order of four times as serious as type two errors. This 80 percent desired power convention is offered with the hope that it will be ignored whenever an investigator can find a basis in his substantive concerns, in his specific research investigation to choose a value ad hoc. In other words, we are building a convention on top of another convention. We already had the convention of using a five percent type one error rate, and now we are on top of this building a second convention, and met that convention that we used an 80 percent power level. So why do researchers set error rates that do not actually reflect the rate at which they want to make errors? They're choosing a type one error rate and a type two error rate based on conventions or conventions built on conventions. There's no real rational reason for this. The main driver of these choices is just normative behavior. There are alternative approaches to deciding how you should set these error rates in line with what Neyman and Pearson already wrote in 1933. Think about the costs and the benefits. One approach that's not used as widely as it should be used is statistical decision theory. If you perform a hypothesis test, you can think about this as a choice between different actions. We accept or we reject the null hypothesis. Given specific states of the world, the null hypothesis is true or the alternative hypothesis is true, that half specific outcomes. If you're able to quantify the costs and the benefits of these outcomes, you can actually quantify the loss function of a certain procedure. So if we can quantify a loss function based on the cost associated with accepting the null hypothesis when the alternative hypothesis is true, or the cost associated with accepting the alternative hypothesis when the null hypothesis is true, we can make a better informed judgments about where we want to place these error rates. Decisions can generally be made more efficiently, if we try to minimize the combined cost of type one and type two errors. So this is an alternative approach to specifying these two error rates, than relying on norms. In this graph we see the plotted loss functions associated with a study in which we expect an effect of half the standard deviation. We collect either 10, 50 or 100 participants. We can minimize the combined cost of the type one and the type two error rate, and calculate the associated alpha level under which these costs are minimized. We see that in this specific case, given the expected effect size of half a standard deviation, the optimal alpha level actually greatly depends on the sample size that we end up collecting. The larger the sample size, the smaller the alpha level that is optimal that minimizes the combined costs. Alternatively, you can not choose to minimize these combined costs, but you could choose to balance the error rates. If you have no good reason to prefer a type one error over a type two error, and you think both are equally serious, then you might choose the error rates such that the type one error rate and the type two error rates are the same. There's actually an option in the widely used power analysis software, G-power, where you can do this. It's called a compromise power analysis. In this case, you specify the effect size and the sample size that you are planning to collect, and you specify the ratio of the type two and type one error rate. In this example, we've specified the ratio is one. If we are planning to collect 50 participants in each group, and we expect an effect size of half a standard deviation, then the balanced type one and type two error probabilities end up being around 14.9 percent. Now, if you want to use these cost-benefit analysis or statistical decision theory, you need to think about the smallest effect size that you actually care about, the relative weights of the Alpha and Beta, or the type one and the type two errors that you can make, and also incorporate maybe prior probabilities that the null hypothesis is true and the alternative hypothesis is true. These are quite difficult things to keep in mind. It's very useful to think about them, because if you do, you can actually design studies that have error rates based on minimizing the costs associated with whatever errors you could make, but this will require careful thought, maybe discussing it with your collaborators. A second reason to want to just in general, maybe lower the type one error rate as a function of the sample size is known as Lindley's paradox. So here we are no longer thinking about costs and benefits, but there's actually another good reason to just in general, lower the Alpha level as your sample size increases. We've discussed Lindley's paradox in the first lecture of my previous MOOC. So if you want to know more about it, by all means, go and do the exercise about p-value distributions. For now, I'll quickly explain what the general idea is. So this notion that if the sample size increases, we want to lower the Alpha level, we don't want to keep it stable as the sample size increases, is actually quite old. It's most eloquently discussed in a book by Leamer. He writes, "The rule of thumb quite popular now, that is, setting the significance level arbitrarily to 0.05, is shown to be deficient in the sense that from every reasonable viewpoint the significance level should be a decreasing function of the sample size". Let's explain what he means by this. If we take a look at how p-values are distributed either when the null hypothesis is true, or when the alternative hypothesis is true, we know or if you don't by all means go back to my previous lectures on p-values in the first MOOC, that under the null hypothesis, p-values are uniformly distributed. In this graph, we see p-values that range from zero to 0.1, and we have a dotted horizontal line that represents the uniform distribution of p-values when the null hypothesis is true. So this is the distribution we would expect to see if there is no true effect. The black curve represents the p-value distribution if the alternative hypothesis is true, and if we have 99 percent power. This is an extremely high level of statistical power and it means that most of the p-values that we will observe if there is a true effect, are smaller than our chosen Alpha level which in this figure is of course set at 0.05. Now, there is something a little bit peculiar about this graph which is known as Lindley's paradox. If we take a look at the p-values between 0.025 and 0.05, we see that the curve, if there is a true effect, is actually lower than the p-value distribution if there is no true effect. There is a certain range of p-values that are much less likely to be observed when there is a true effect than when there is no true effect, even though these p-values fall below the chosen Alpha level. So this is exactly the situation that's referred to by Leamer. This is undesirable, if we have such high power typically associated with a very large sample size, then we are in a situation where Lindley's paradox might play a role. What we actually want to do in this graph is lower the Alpha level for example to 0.02 so that we don't end up in this weird situation that we might actually observe a statistically significant p-value that is nevertheless more likely if the null hypothesis is true. Now, there's really no single approach to choose an Alpha level as a function of this increasing sample size. I'm just highlighting one because it's so very simple proposed by Good in 1982. In this case, we use an Alpha level that's basically the Alpha level divided by the square root of the sample size divided by 100. I have to admit this is also a quiet arbitrary rule, but it will probably make things a little bit better than the use of a default Alpha level, regardless of the sample size. The idea of this approach is that significant results would maintain a similar level of evidence based on a correspondence between a p-value and a base vector, s1 and Alpha level of 0.05 was used in a sample of 100 participants. I would say that if nothing else, if you can't think about costs, and benefits, and apply something like statistical decision theory, it makes sense that if you have a really huge sample and you think that you might have extremely high power, that you lower the Alpha level as a function of the sample size. In general, the recommendation here is to carefully think about the Alpha level that you use in your research. Again, this is not a new idea. We've seen claims for this from 50 years ago. Here, Skipper and colleague say, "If, in contrast with present policy, and I should note that nothing really has changed in the last 50 years, it were conventional that editorial readers for professional journals routinely asked: "What justification is there for this level of significance?" authors might be less likely to indiscriminately select an Alpha level from the field of popular eligibles." Now, there have been attempts to do something about this almost mindless use of setting the Alpha level to 0.05. A well-known case has been Johnson, who in 2013, suggests to lower the Alpha level to 0.005 instead. This reduction in the Alpha level has the goal to lower the number of type one errors or false positives in the literature. It is a response to problems in reproducing published findings. One of the reasons we have this problem in reproducing these findings might be that there are too many false positives in the literature, and lowering the Alpha level might be one way to deal with this. Now, in general, I don't really like replacing one rule of thumb by a new rule of thumb. The whole point of a course like this is to make sure that people think about the kind of questions that they're asking, and there might be cases where you do want to use a very low Alpha level, but there might be cases like the study with the koalas that I mentioned earlier, where you don't want to do this. So I believe we should try to move beyond rules of thumb, make an honest attempt to justify every aspect of the studies that you design. The default use of an Alpha level of five percent and 80 percent power will really never meant as defaults. In general, error rates are such an important part of a study that you design, that you should be able to justify them.