[MUSIC] The last pillar of DevOps philosophy we will look at is measuring everything. Measurement helps you clearly see what's happening with your services. At Google, we believe there are three main goals of measuring everything. First, the IT team in the business can understand the current status of the service objectively. You've already learned how you can measure reliability with SLIs and SLOs. Second, the team can analyze the data and identify necessary actions to improve the status. And third, the IT team can collaborate with the business to start making better decisions and impact across the broader organization. Measuring everything with these goals in mind makes your business data driven and ultimately helps you make better decisions based on operational data you're collecting. You can't improve what you don't measure. In Site Reliability Engineering, there are three core practices that align to this pillar, measuring reliability, measuring toil, and monitoring. Let's start with reliability. In an earlier module, you learned about quantifying reliability with error budgets, SLIs, and SLOs. Choosing good SLIs is key to measuring reliability. You are making a connection between your SLIs and your user's experience so you'll want to decide what to measure based on their perspective. For example, it doesn't matter to a user whether your database is down or if your load balancers are sending requests to bad backends. They experience a slowly loading web page, and that makes them unhappy. If you can quantify slowness, you can then tell how unhappy your users are in aggregate, which lets you define your SLO. So what should you measure? CPU utilization? Memory usage? Load average? Well, as we just mentioned, these metrics don't necessarily reflect whether your users are happy. To illustrate the difference, let's look at the signal from these two metrics for the same outage in your service. If you assume that you have some way of knowing that your users were too unhappy with your service and that this time period is represented by the red area on these two graphs, then you can show that the metric on the right is a far more useful representation of user happiness. Your bad metric does show an obvious downward slope during the outage period, but there is a large amount of variance. The expected range of values seen during normal operation, shown in blue to the left, has a lot of overlap with the expected range of values seen during an outage shown in red on the right. This makes the metric a poor indicator of user happiness. On the other hand, your good metric has a noticeable dip that matches closely to the outage period. During normal operation, it has a narrow range of values that are quite different from the narrow range of values observed during an outage. The stability of the signal makes overall trends more visible and meaningful. This makes the metric a much better indicator of user happiness. SLIs need to provide a clear definition of good and bad events and a metric with a lot of variance and poor correlation with user experience is much harder to set a meaningful threshold for. Because the bad metric has a large overlap in the range of values, our choices are to set a tight threshold and run the risk of false positives or to set a loose threshold and risk false negatives. Choosing the middle ground also means accepting both risks. The good metric is much easier to set a threshold for because there is no overlap at all. The biggest risk you have to contend with is that perhaps the SLI doesn't recover after the outage as quickly as you might have hoped for. The second aspect to measure is toil. As you've learned, toil is work that is directly tied to running a service that is manual, repetitive, automatable, tactical, and without enduring value. You can measure toil in three steps. First, identify it. Who is best positioned to identify toil depends on your organization. Ideally, these people are stakeholders and those who perform the actual work. Next, select an appropriate unit of measure. This unit needs to express the amount of human effort applied to this toil. Minutes and hours are a good choice because they are objective and universally understood. And third, track the measurements continuously. Do this before, during, and after toil reduction efforts. Streamline the measurement process using tools or scripts so that collecting these measurements doesn't create additional toil. Start simple, count the number of tickets you receive, count the number of alerts, collect alert stats on cause and action. These can prove useful in identifying the source of toil. You can measure actual human time spent on toil by collecting data, either in the ticketing system directly or by asking your team to estimate the time spent on toil every day or week. There are clear benefits of measuring reliability of your services. But you may be wondering what the benefits of measuring toil are. First, it triggers a reduction effort. Identifying and quantifying toil can lead to eliminating it at its source. And second, it empowers your teams to think about toil. A toil-laden team should make data driven decisions about how best to spend their time and engineering efforts. Additional benefits include growth in engineering project work over time, some of which will further reduce toil, increase team morale and decrease team attrition and burnout, less context switching for interruptions, which raises team productivity, increased process clarity and standardization, enhanced technical skills and career growth for team members, reduced training time, fewer outages attributable to human errors, improved security and shorter response times for user requests. Finally, measuring everything involves monitoring. Monitoring allows you to gain visibility into a system, which is a core requirement for judging service health and diagnosing your service when things go wrong. Let's first look at what you should monitor. It's best practice to alert on symptoms rather than causes. Users don't care whether they can't get to your website because your router is rebooting or because the database is overloaded. Similarly, they don't care if CPU utilization is very high if they can still access the system and it feels fast. Creating a separate alert for each cause typically results in a lot of spam alerts. It's better to have fewer symptom-based alerts combined with good debugging tools, like dashboards that will allow responders to more easily identify specific cause. Ideally, Google recommends alerting based on error budget burn. You might pay someone when you are consuming your error budget very quickly. For example, when you spend ten hours worth of error budget within one hour, you might just create a ticket for a lower burn rate. For example, if you spent three days worth of budget within three days, basically only escalate to a human if you risk dropping below the SLO for the month. If your SLO is set correctly, you won't be paging people for problems that users don't care about. There are exceptions to this rule of course, because there might be a problem that does not result in user-visible symptoms. Capacity alerts can be an example of this. If you know you will exhaust a particular limit soon, it wants an alert even if there is no user-visible symptom yet. To simplify things, Google recommends monitoring for the four golden signals, latency, traffic, errors, and saturation. You can see how measurement and monitoring are important SRE practices that allow your development teams to focus on business critical work. In the next video, we'll discuss the cultural concepts of goal setting, transparency, and data-based decision-making, and how supporting these practices is important for measurement.