The next section of the exam guide is designing data pipelines. You already know how the data is represented in Cloud Dataproc, in Spark, it's in the RDD. And in cloud data floats, in p collection and in BigQuery, the data's in data set in tables. And you know that a pipeline is some kinds of sequence of actions or operations to be performed on the data representation. But each service handles a pipeline differently, Cloud Dataproc is a managed Hadoop service. And there are number of things you should know including standard software and the Hadoop ecosystem and components of Hadoop. However, the main thing you should know about Cloud Dataproc is how to use it differently from standard Hadoop. If you store your data external from the cluster, storing HDFS-type data in cloud storage and storing HBase-type data in Cloud Bigtable. Then you can shut your cluster down when you're not actually processing a job, that's very important. What are the two problems with Hadoop? First, trying to tweak all of its settings so it can run efficiently with multiple different kinds of jobs, and second, trying to cost justify utilization. So you search for users to increase your utilization, and that means tuning the cluster. And then if you succeed in making it efficient, it's probably time to grow the cluster. You can break out of that cycle of Cloud Dataproc by storing the data externally. And starting up a cluster and running it for one type of work and then shut it down when you're done. When you have a stateless Cloud Dataproc cluster, it typically takes only about 90 seconds for the cluster to start up and become active. Cloud Dataproc supports Hadoop, Pig, Hive, and Spark. One exam tip, Spark is important because it does part of its pipeline processing in memory rather than copying from disk. For some applications, this makes Spark extremely fast. With a Spark pipeline, you have two different kinds of operations, transforms and actions. Spark builds its pipeline used an abstraction called a directed graph. Each transform builds additional nodes into the graph but spark doesn't execute the pipeline until it sees an action. Very simply, Spark waits until it has the whole story, all the information. This allows Spark to choose the best way to distribute the work and run the pipeline. The process of waiting on transforms and executing on actions is called, lazy execution. For a transformation, the input is an RDD and the output is an RDD. When Spark sees a transformation, it registers it in the directed graph and then it waits. An action triggers Spark to process the pipeline, the output is usually a result format, such as a text file, rather than an RDD. Transformations and actions are API calls that reference the functions you want them to perform. Anonymous functions in Python, lambda functions, are commonly used to make the API calls. There are self-contained ways to make a request to Spark, each one is limited to a single specific purpose. They're defined inline, making the sequence of the code easier to read and understand. And because the code is used in only one place, the function doesn't need a name and it doesn't clutter the namespace. An interesting and opposite approach where the system tries to process the data as soon as it's received is called, eager execution. TensorFlow, for example, can use both lazy and eager approaches. You can use Cloud Dataproc in BigQuery to gather in several ways. BigQuery is great at running SQL queries, but what it isn't built for is modifying data, real data-processing work. So if you need to do some kind of analysis that's really hard to accomplish in SQL. Sometimes the answer is to extract the data from BigQuery into Cloud Dataproc and let Spark run the analysis. Also, if you needed to alter or process the data, you might read from BigQuery into Cloud Dataproc, process the data, and write it back out to another dataset in BigQuery. Here's another tip, if the situation you're analyzing has data in BigQuery, and perhaps the business logic is better expressed in terms of functional code rather than SQL. You may want to run a Spark job on the data. Cloud Dataproc has connectors to all kinds of GCP resources. You can read from GCP sources, and write to GCP sources, and use Cloud Dataproc as the interconnecting glue. You can also run open source software from the Hadoop ecosystem on a cluster. It would be wise to be at least familiar with the most popular Hadoop software and to know whether alternative services exist in the cloud. For example, Kafka has a messaging service, and the alternative on GCP would be Cloud Pub/Sub. Do you know what the alternative on GCP is to the open-source HBase? That's right, it's Cloud Bigtable and alternative to HTFS, cloud storage. Installing and running Hadoop open source software on Cloud Dataproc cluster is also available. Use initialization actions, which are init scripts, to load, install, and customize software. The cluster itself has limited properties that you can modify. But if you use cloud data proc as suggested, starting a cluster for each kind of work, you won't need to tweak the properties the way you would with Data Center Hadoop. Here is a tip about modifying the Cloud Dataproc cluster, if you need to modify the cluster, consider whether you have the right data-processing solution. There are so many services available on Google Cloud, you might be able to use a service rather than hosting your own on the cluster. If you're migrating Data Center Hadoop to Cloud Dataproc, you may already have customized Hadoop settings that you would to apply to the cluster. You may want to customize some cluster configurations so that it'd work similarly. That's supported in a limited way by cluster properties. Security in Cloud Dataproc is controlled by access to the cluster as a resource.