Hello, and welcome to the first module in the course, building resilient streaming analytics systems on GCP. My name is Raiyaan Serang, and I'm a machine learning consultant here at Google Cloud. This module discusses what stream processing is, how it fits into a big data architecture, when stream processing makes sense, and also, the challenges associated with streaming data processing. This module is all about streaming, and we will be discussing this part of the reference architecture. Data typically comes in through Cloud Pub/Sub, then that data goes through aggregation and transformation and data flow. Then you'll want to use BigQuery or Cloud Bigtable depending on whether you are trying to write aggregates or individual records coming in from streaming sources. Let's look at streaming ideas first. Why do we stream? Streaming enables us to get real-time information in a dashboard or another means to see the state of your business. One of the organizations who leverage GCP to make sense of real-time data is the New York City Cyber Command. By using Pub/Sub and Dataflow, they were able to construct a data pipeline to minimize latency at each step during the ingesting process. This is critically important for a cybersecurity organization since if data is delayed on arrival, it loses its value. The amount of data flowing through the command varies each day. On weekdays during peak times, it could be five or six terabytes. On weekends, that can drop to two or three terabytes. As the New York City Cyber Command increases visibility across agencies, it will deal with petabytes of data. Security analysts can access this data in several ways. They can run queries in BigQuery or use other tools that will provide visualizations of the data, such as Data Studio, a GCP reporting solution. Streaming is data processing on unbounded data, which is data not at rest. Stream processing is how you deal with this unbounded data. A streaming processing engine provides low latency, speculative or partial results, the ability to flexibly reason about time, controls for correctness, and the power to perform complex analysis. There are many applications for streaming systems. From a data integration perspective, stream analytics help you access your data in real-time and take the load off source databases with change data capture. These in turn allow you to make better online decisions, which can cover everything from real-time recommendations to finance back office apps. From their applications, it is clear that these systems need to handle huge amounts of data, process it quickly, and convey decisions and results almost immediately. These requirements posed certain challenges during the design phase. Let's talk a bit about those challenges which can be summarized by the three V's, volume, velocity, and variety of data. In order to work with streaming data, a data engineer must think about a few things. First, how to ingest this data into the system. Second, how to store and organize this data so that it can be processed quickly. And third, how will the storage layer be integrated with other processing layers? Volume is a challenge because the data never stops coming and grows quickly. Now, let's consider the next dimension, velocity. Depending on what you're doing, whether it's trading stocks, tracking financial information, opening subway gates, etc., you can have tens of thousands of records per second being transferred. Velocity can change as well. For example, if you are a retailer designing your point-of-sale system nationwide, you are probably going to carry along at a reasonably steady volume all year until you get to Black Friday. Then, sales and data being transferred go through the roof. So it is important to design systems that can handle that extra load. Along with velocity, the type and format of data also poses constraints on processing. Let's talk about the third challenge, variety. If we are just using structured data, like data coming from a mobile app, that's easy enough to handle. But what if we have unstructured data, like voice data or images? These are streaming records that might have to use a null to deal with that type of unstructured data. So what services and techniques does GCP provide to deal with these challenges? On the volume side, we will look at a tool to assist in autoscaling processing and analysis so that the system can handle the volume. On the velocity side, we will look at a tool that can handle the variability of the streaming process. And on the variety side, we will look at how artificial intelligence can help us with unstructured data. The three big products we are going to consider here are Cloud Pub/Sub, which will allow us to handle changing volumes of data, Cloud Dataflow, which can assist in processing data without undue delays, and BigQuery, which we will use for our ad-hoc reporting, even on streaming data. Now that we are familiar with the associated challenges, let's discuss the approach for system design. There are multiple steps involved according to a problem that may vary, but a few key steps are common. Let's take a look at some of those steps. First, some sort of data is coming in, possibly from an app, a database, or an IoT device. These are generating events. Then an action takes place. We are going to ingest and distribute those events with Cloud Pub/Sub. Cloud Pub/Sub provides an asynchronous messaging bus which can hold events until they are consumed by respective services for further processing. In a typical scenario, Dataflow consumes the messages and applies aggregation and filtering. These actions enrich the data so that meaningful insights can be generated. Next, we will write into a warehouse of some kind BigQuery or Bigtable, or maybe run things through a machine learning model. For example, we might use the streaming data to train a model in AI platform. Then finally, Dataflow may again be used in batch mode for things like backfilling. Say you need to reconstruct some of the historical events that have happened. Or maybe you need to reprocess the data differently to look at other dimensions. This is a pretty common way to put things together in GCP.