Human factors, bias. Our goal in this lesson is first to define common sources of bias in data collection. How do we get out of our own head, and we're also going to identify potential bias errors in creating model attributes, all working toward the goal of more fair models. No surprise there. So let's think big picture. It's possible to take an algorithm, give it access to data, and then, what happens next? Do we get any useful results just in that act? Well, it turns out with the no free lunch theorem, we do not. The no free lunch theorem states that all models have the same error rate when averaged over all possible data generating distributions. This means that we need to make some decisions to make a model useful. We can't just have one model with access to data that can make decisions for lots of different things. We may be be able to get there in future, but that's not happening today. Therefore, a certain model must have a certain bias towards certain distributions, and certain decisions, to be better at modelling that goal. Which not surprisingly make it worse at all other goals. Think about taking a loan model and giving it an image recognition task. Not only would it be worst at the image recognition task, but it'll be worst than eight blank slate model, because that loan model has been purposely biased toward making certain distributions with certain attributes. So the main take away here is that we need to make decisions to make our models useful, and that's where bias can enter the picture. First, our own individual cognitive biases, how we view the world, and what's important, all comes down to what's inside our head. Then on the actual collection of data, there are lots of different biases here, as well. We can't just point the model infinite data. So what do we collect? What do we decide to point the model at, how many years of data, etc., etc. And then, there's also bias with assigning attributes. Once we have that data, what do we prioritize in the model, there is bias in that selection process. So let's start out talking about individual cognitive biases, also known as heuristics or errors that influence how we think and can lead to bad judgments. It's not important to note each one of these by name, but to recognize the general effects that cognitive biases have on us. There are over 170 cognitive biases on Wikipedia, and we'll share that link so you can check each one out. It's actually very fascinating reading. But what do these cognitive biases actually look like? Well, here's one in practice that's a humorous example. And maybe you even recognize this behavior in yourself, because after all, we are all human. This person here is saying, okay, hey, you know what, I've heard both sides of the argument. It's time to do my own research to reveal the truth. And then, they plug into a Google search, the hotly debated topic, which even in that search might be a search term that influences the results. And then, the first link that agrees with you, is the jackpot, perfect. I will send that to the person who disagrees with me, that should change their mind. We all know in practice, that does not work. Why is that? Well, here are the key points to take away from the cognitive biases that influence how our brain works. First, is that we have so much information being processed in just one given day. That's information overload leads to excessive filtering. Meaning, because we have so much information, we have to figure out which signal to process, in which signal to consider noise. Second, lack of context or meaning, means we have to fill in gaps in knowledge ourselves. Lack of meaning is confusing to us. And we do not like confusing things. So we fill in the gaps ourselves with the different stories we tell. Third, acting quickly to make a decision, means we have to rely on previous conclusions. If we feel that there's a deadline or pressure, then we do not want to lose our chance. So we jumped to conclusions, stories become decisions. We see this play out on social media all the time. The first to react is not often the best decision. So when building machine learning models, we need to make sure, as researchers, that we do not allow these cognitive biases to affect us. We need to make sure we process information carefully. We need to assign context and meaning to everything we do through domain expertise. And we need to act slowly and make rational decisions as rational as they can get anyway. Bias can creep in in other ways, as well, especially, when we're collecting data. There are three different collection biases that we're going to focus on. The first, is sample bias. The definition of sample bias is collected data does not represent the full environment. Turns out there's a science of choosing this subset of data that's both large enough, and representative enough to mitigate sample bias without obviously training a model on the entirety of data. So what is an example of sample bias? Let's say that we have a photo identification model. We give it those photos, but we forget to give it photos taken at night. That means that the collected data does not represent the full environment, and therefore, the model will struggle to identify photos taken at night. How do we fix the sample bias? Well, we have to have domain expertise to figure out that the data collected actually will represent the environment there. Second is exclusion bias, the definition of exclusion bias is cleaning data removes important attributes. We deleted some feature of our data thinking that it's irrelevant to our labels and outputs. But that's only based on our pre-existing beliefs. So an example here is we delete a row tied to zip code in a loan model, because it's not important. We don't need to know where the person lives. But maybe in reality, the location could have actually fixed bias. So we need to locate every single attribute of our model before removing it. And the fix here is that we need to analyze every single attribute and every single row of our database before moving it. It could come in useful, maybe in the next version of the model, if not this one. And then, finally, is automation bias. This is our tendency to favor data sources that were automatically generated. We trust automated collection because we think that it doesn't have any human mistakes, a computer did. But, remember, humans programmed it, so it can have mistakes just as often. An example of automation bias would be if we picked, let's say, Twitter or Facebook trends data over phone survey data, because we thought, okay, people's thoughts are printing out, machines scraped it, no mistakes there. The fix is to obviously approach automated data collection with caution. Remember that there are humans behind it, mistakes can be made. We need to investigate how that data was collected and look at the programming behind it. So these are the biases to be aware of on the data collection side. Now, let's look at what happens once we've collected our data. And we're ready to assign attributes to tell our model what to focus on. The first bias is observer bias. This is us, as experimenters, seeing what we want to see. It's called experimenter bias, as well. We basically look at the data and what attributes pop up and perfect. It looks good, we're going to go with those attributes, because we have the insight into this problem already. Let's say, as an example, we had just finished a project in loan predictions. So we feel super confident going into our next loan prediction model. Instead of taking a step back and realizing that the situation may have changed, the fix for observer bias is to separate our domain expertise from our unconscious bias, and allow ourselves to be as objective as possible when selecting attributes. Second is availability bias. This is the tendency to only seek out the attributes from our existing data set. This is essentially our lack of ability to look for good alternatives. So let's say the collection team just came back to us says, great, here's your data set. We see, there's limited data in a needed column. So we're going to have to move to another attribute here, or just stick with what we have, the fix is obviously to recognize when your collection phase was incomplete. Be able to move back and forth between the two, collect more data if you're not absolutely certain that it's going to be appropriate in your model. Finally, is prejudice bias. This is our tendency to failed to pick attributes that aren't influenced by cultural norms. This is really, really hard, because, obviously, all data in society is prejudiced in one way against certain groups, because it reflects us, as humans, and we are not perfect. So this bias bias relates most directly to fairness explicitly, because an example here would be picking race or gender as an attribute instead of a protected class. So we just need to be very sensitive when selecting attributes to make sure that they are not prejudiced in any way. And that will do it for now. We're going to talk more about what to do with these biases. But we've now gone through individual cognitive biases, collection databiases, and assigning attributes biases. And now, we'll talk about how to build a training set with all these different tendencies in mind to make more fair data.