[MUSIC] Welcome to the fourth lecture in this week on judgement scoring. Today's lecture is a little more technically difficult, and we'll require you to do some reading beyond this lecture itself. In this week, we want to focus on how we can judge how similar our scores are between two or more teachers. We're looking again at at interactive assessment practices that require human judgments. There are two things that we are going to look at in terms of moderation. We want to compare, not just, "Did we give the same score, or level, or grade?" but also, "What were our reasons for giving such a grade?" Because it's in discussing those reason that we as teachers come to a deeper understanding of the subject that we're teaching, so that we'll be better teachers. And we're going to talk about how we can use some basic statistical information to evaluate are we close enough that we can use the scores that we've created to give feedback to the students and to report to school leaders and to parents about student's progress. On screen now you see a picture of Tiger Woods and his caddy evaluating the distance to the golf hole, Tiger says, "It's 163 yards." And his caddy says, "It's 163.375 yards." Clearly, the caddy is much more concerned about accuracy than Tiger. But frankly, if it had been me, I would have been happy with, "Oh, it's 150 to 200", would have been near enough for me. So, the question in terms of how close do we need to be is, how important is the decision that we're going to make? If there's room to fix and respond to any errors in our judgment, then close enough is sufficient. But if this is a high stakes examination at the end of the year, in which students get entry to university, for example, then there should be no tolerance of sloppiness or inaccuracy. But in a classroom situation, where there's time to respond to a performance, evaluate it and then act on that evaluation so that there's teaching, then we can tolerate more disagreement between judges, as long as the're close enough. So, the question is, how close is close enough? And how do we tell if we're close enough? To do this, we need to have at least two qualified judges. And when I was a classroom teacher in high school in New Zealand, I taught English, and we would swap - I would mark my class, I swapped my class with another teacher, I would mark her class and then we would get together and compare the two sets of marks for the two classes. And what we're looking for basically is, how many times do we give the same mark to the same piece of work? And this is the simple idea of consensus, or agreement. If you're going to judge on identical scores, say at a score out of 20, I gave it 15. Did she give it 15? If you can get 70% identical or better, then your marking is pretty good for classroom uses. If you want to allow, say, plus or minus one, so I gave it 15, she gave it 14, that's minus one, then we're close enough. --Or 16. If we get it that range, then we would be happy. In that case, you want 90% of the essays, or pieces of work, to have the same score. Same with if you're using an A+ to F scale. If you just say A, B, C, D, and E, then maybe you want 70% the same. But if you're going to go, well, A, A-, B+ is all within one of A minus, then you want 90% the same. And the point of this is when we disagree, we have to discuss what were the reasons when there's a discrepancy. Because it's that discussion where we get a sense of what did I miss that the other marker caught? And that's really important because remember, we, human judges, are frail, error-prone, we don't see everything that's important, and that two pairs of eyes are better than one. And so in our rules of thumb, if we were more than one letter grade apart, or more than three marks out of 20 apart, or more than ten marks out of 100 apart, we had to stop, and discuss the reasons for it. And it might be that your colleague will persuade you to give a different mark or a mark closer to theirs. Or you might mutually just persuade each other to give a compromised grade somewhere in between. However, if you're adamant that you can't be persuaded, or your colleague can't be persuaded to compromise, because they see something as really important, then you're going to need to call in a third judge. And that third judge, hopefully, is somebody more expert than the both of you, who can help adjudicate and guide your evaluation. If you can meet 70% identical or 90% approximately equal in your scores, then you can defend your scores to your school leaders, your department heads, you can defend the scores to the students and you can defend the scores to the parents. And that's going to be really important, if you want to persuade people that you're a trustworthy judge, and that you're giving valid and valuable feedback to learners. How can you do this? On screen now, you can see a simple spreadsheet technique that we've used here in New Zealand, where in the first place we calculated an agreed score on a piece of work that we gave to different markers. And we said, how close are you? And when you look at the scores you'll see there are three A for advanced, four B for basic or three P for proficient. Now, three proficient is close to three advanced, and three advanced is close to four basic. And what we're looking at in this scheme of things is as each marker provides their scores, we're looking to see how many times do we give exactly the same, and how many times do we give approximately the same, plus or minus one difference. And where we're sufficiently close, more than 90% approximate, or more than 70% identical, we can say we trust these grades. And in the examples on screen, you can see only some of these do we reach consensus. And that's where we need more training and more discussion, before we use these marks, to give feedback to learners or school leaders or parents. Another way to look at this is to work out a correlation between two markers. Now, it's possible that you and I don't give the same marks to the same students, but it is possible that our highest marks always go to the same students, and our lowest marks go to the same students - it's just that I'm harsher than you are, you're more generous so your high marks are higher than my high marks, but we're giving them to the same students. And that's what a correlation establishes, whether we're giving high and low to the same people. And if we can get a correlation of more than .70 - which isn't very high - if we can get a correlation of more 70 between us, then we have a pretty good sense of we're at least identifying the same students as high, middle and low. On screen now, you can see what three different correlations look like. The first screen shows a correlation of zero, and you can see, there's no systematic pattern between the two sources of scores. The second screen shows a correlation of negative 0.7. This is definitely something we want to avoid, where every kid that I gave a high score to you gave a low score to, then we're clearly out of sync with each other. And in the third screen, you see a correlation of 0.9, where the scores line up almost perfectly along a straight line. That's the gold standard. In a classroom situation, we'd be happy with a positive 0.7, which says that at least we agree to some meaningful extent as to who the best, middle, and weak students are. This process of moderation is an important professional task. You'll probably need somebody to come and help you learn how to do this. You'll need to read more about it. But if we can't agree with each other, in terms of how we score student work, then we have a serious problem in our education system. We need to know how close we are, and if we're not close enough, that should trigger us to do more professional development, to discuss more with each other about what makes quality in this content area. And the best part of this whole process of clarifying to what extent do we agree, is the professional learning that comes from talking with other teachers about scoring student work, and I hope that you'll engage in that. In the next week, we're going to talk about involving students in assessment. [MUSIC]