So the first thing is when we talk about resilience and say that we want to build resilience in the system, naturally, resilience in place at application, that is one of the ways in which you're going to get that. So you have redundant replicas of whatever you want to do. If it's a service, you have a redundant replica of the server. If it is a file, you have a redundant copy the files. So that if you have a failure, you can recall from that. Now, a given failure is inevitable. How do you know which replicas are up-to-date at any point of time? So naturally, management of replica is fundamental to cloud computing. I think through the lectures that we've been hearing from our guest lecturers, you know that this replicated service exists in Resource Manager, Azure Service Fabric, Google's Borg, Apache Mesos, all of them have examples of using replication in the resource manager. Similarly, when it comes to storage servers, we've talked about things like Google table, BigTable, Oracle's NoSQL storage server, all of them use replication as a way of combating the fact that there could be failure and you want to provide a resilient vignette to the outside world. Now, the types of failures that can occur, they can be grouped into two major categories. One is Byzantine, the other is fail-stop. Byzantine, this is a term that was coined by Leslie Lamport. Leslie Lamport, we've seen several times in our distributed systems and operating systems lectures. He came up with this term, drawing it from the battlefield that if you have generals and some generals are compromised, working for the other guy, then when they fail, they're not just failing quietly, they actually start giving misinformation. So that is Byzantine failure. So in a computer sense, if a system stops functioning correctly, but it starts spewing out spurious information, messages out on the network, then that'll be an example of a Byzantine failure. So that's the key property of a Byzantine failure, that it starts sending spurious information to healthy nodes, confusing them. The other possibility is fail-stop, meaning that upon failure, they just shut up. The other way of knowing that a particular system has failed, and again, you can quietly remove it out from the networks. So these are two types of failures that can occur. We talked about the fact that in order to get resilience, you have to have redundancy. Now, how much redundancy do we need depends on the type of failure that you're trying to tolerate. If it is Byzantine failure, we want to tolerate t failures, then at least a majority of the remaining system should be good. Which means that you need 2t plus 1 replicas, so that we know that t plus 1 are the good ones. Even if there are Byzantine failures, Byzantine fashion, I can live with that. On the other hand, if I know that the system is fail-stop and If I want to tolerate t failures, then all I need is one replica that is good. So either a t plus one copies, that is good enough. So these are the ways in which you can think about how many replicas that you need in order to overcome failures. Now, how do you maintain the state with these replicas? So one thing is whenever you do an update, you don't want the update to be blocking. In the sense that you want to update progress in the presence of failures. So that's a key property that we want to maintain in terms of liveness of the system. Or in other words, if there are n servers and I'm updating all n servers, I don't want to have to wait for all copies to acknowledge that the update is done, because that'll be very synchronous and it'll impede progress. So that's the reason that quorum consensus protocols have been invented. The idea is very simple. I mean, suppose we make a pact that George, and Gustav, and Vishnu are going to go to every ballgame at Georgia Tech, but they say that, "Well, we know we cannot go to all the ballgames, we had to do some project. So at least two of us will go for every ballgame." If they make that pact among themselves, then if we want to know the score of a game, if I contact at least two of them, I'll know. Because even if they made a pact that at least two out of three will go for the game. So if I call George and Gustav, maybe George didn't go that time, but Gustav will know, or maybe Gustav didn't go, and Georgia and Vishnu went. So long as they make sure that there are sufficient number of copies, that's where the idea of quorum comes from. So we do normally say there's a meeting, do we have a quorum? The reason why we say that, it's because we can actually pull some members, some subset of the members of the committee to know what may have happened in the meeting. So that's the idea behind quorum. That's what is being brought into computer science as well. So when you want to read a particular item, you read all copies of the item. If you want to write, you write w copies. So let's say that we have totally N servers, and what we're going to do is, we're going to write to some subset of the service and read from a subset of the service. We want to make sure that the reads and writes will result in the causality that we expect from the system. So with N is the total number of servers for correctness, the condition that you want to make sure is that the quorum for reading plus the quorum for writing is greater than N. This will ensure that read quorum and write quorums overlap. I mean, the idea is very simple, so suppose I have three servers and I do a write to these three servers, and I write to at least two of these servers. If I'm going to read and I randomly pick these two guys as the service to read from, I'm guaranteed that there's this overlap that's happening between the read set and the write set. That is ensuring that I will get the up-to-date copy. Even though I'm going to get S3s copy and S2s copy, I can look at the timestamp and know that S2s copy is the most recent one. That way, I can get and ensure that we can get correctness. So that is from the point of view of reading. Similarly, you want to make sure that when there are multiple rights and they could be for the same data item, you want it to be ordered in some fashion. If you want to order it in some fashion, you want to make sure that the right quorum that you choose is greater than half the number of servers. That way, you're ensuring that if a projector is going to write she's in the overlapping set, she's going to write in the order in which the writes were issued. Therefore, the timestamp from her will indicate which is the most recent rate. That way, you get this idea of one copy serializability, which is important in maintaining the state when you have lots of replicas.