So, what we want to do now is add a new recipe and as you can see you can have multiple different data sources as part of your flow, but right now we want to explore the data a little bit. A recipe allows us to chain together a series of transformations on the raw data. So, let's go ahead and edit the recipe and this will load as at the time of this recording, it was 10 megabytes of your sample data. So we had 56,000 or so rows in the lower left hand corner, you'll see that we've loaded about 13,000 of those just a small sample of it into the web UI of Cloud Dataproc called the transformer view, and this will give you just a brief sample of some of the common key data values and allow you to just look across frequent histogram values, how many different categories for those types of data as well and what are all the different field values and some interesting mathematical operations like averages, medians, standard deviations, that kind of thing. Now it is just a sample, it's not part of this lab, but you can actually explore the different types of sample types you can bring in. So if you didn't want a random sample, if you want it's just more of a stratified, random sample or otherwise want to provide guidance on what types of sample data you want to bring in that's a setting within Cloud Dataproc, but right now we're fine with whatever it brings in as a random sample. So, the first task in this lab is to understand the e-commerce data as much as we can even before we start adding in any transformations or even before we think about putting that table back into the query. So, a couple of question we want to answer how many columns are in the dataset and you just keep in mind we're doing this for the e-commerce dataset now and you'll get a really good understanding of how to work within this, but if somebody gives you a random dataset and you're not familiar with it, load it into Cloud Dataprep, see the columns with the frequent values. The same process we're applying here for exploring in, it is going to be beneficial for you and your own datasets later on. So, columns in the dataset we have 32 columns. How many rows are loaded in in our sample? You see here all the way at the bottom you about 13,000, and now I'm going to ask a little bit more of the questions that, it would normally take you a select, star, group by a particular value within SQL and doing that 32 times sounds really onerous. So, that is why we get this nice visual view of the frequent values for the sample. So, first question is, what are the most common channel groupings for the dataset? So you have seven different category values here, again with the sample. Looks like we're driving a significant amount of traffic through referrals and organic search and those are the top two again. Big asterisk here is that this is just our sample of data and your sample could actually be even different than mine, because you could have loaded the sample different way or could have chosen other random records, but by enlarge referring traffic and organic search for this particular date that we've loaded in. What are the top countries that provide us visitors? Over here by enlarge for this day 80 percent from the United States and we have India, United Kingdom and a longer tail that we have here, same deal for cities. Now, let's scroll over. What is the gray bar under total transaction revenue represent? See you've got the data type in this particular case, this is an integer and Cloud Dataprep it's sometimes or it often will try to assume good intent. So if you load something in there as a string that it thinks should be an integer, it will try to auto convert that for you. So you have to be a little bit careful with the intelligence that it brings in especially when it comes to data types where you have numbers and strings and mixed together. So total transaction revenue here, the gray bar represents missing values. So, it's saying you specified, I see a lot of integer values here, so I'm going to call this an integer, but there's a lot of null or missing values here, more than half, and what does it actually represent in terms of our dataset? It means you had a lot of transactions that didn't have revenue associated with them, not everyone that visits your site is going to buy something unfortunately. So you can back out almost a little bit of a conversion rate here. So less than half of the visits here generated revenue. Now, let's take a look at some metrics, they would only have to do an aggregation function for but it'll just be a couple of simple clicks here. So you get time on site, you get the page views, and you get the session quality dimension that's zero to a 100, whether or not the closer to 100 is closer to a transaction that's going to convert. So you can even just visually look at these histograms here and get a general distribution or its sense of a distribution. So you see as you might expect, we have time on site that is hovering way towards the left side of the lower end of the scale here, same thing with page views, same thing with session quality dimension, but let's get a little bit more specific. If you click on the drop-down for the column, you can look at the details of the column. So this is time on site, if you want things like statistics in the lower left hand corner, you can see the average time on site about 900 seconds, you can see the min max, standard deviation, again just very quick at-a-glance values. Now to go back, we're just going to switch back to the grid view, do the same thing for page views. How many pages are folks visiting on average? Let's go to the column details again, and according to this we have 20 pages. Let's go back to the grid view, session coordinate dimension, what's the average quality of these sessions here? And column details according to this data, we have an average assessing quality of 38. One of the things that I will bring up here again is this is assuming that your each different row represents a unique session, and I know that from our previous labs and from what views what was before that's not necessarily the case, because this record right here as you see 91 duplicated multiple times, could indicate that there was four different products that were viewed and this is the same session that was repeated, the session quality dimension. So again, take your data relationship and the histograms and the values graphs here with a grain of salt, it'll provide you general directional trending, but there's no replacement for understanding what does the unique row represent? Or the exercise you did in the previous lab to really understand the data relationships. Let's see, we have what's the date range for this? We just loaded in one day's sample which is August 1st 2017, you see duplicative visit IDs, we know that's not unique. So we're going to have to concatenate that together with the full visitor ID to get a unique session ID. We have two different types, we have the page and event. People are viewing pages and they can also simultaneously click on something like add it to cart and that is tracked as an event. No refunds in this particular demo dataset. We have a interesting red bar here for products SKU. What does the red bar represents? We know the blue or the gray bar is missing values are no values, red bar says I get mismatch values. This is a data quality problem potentially, but again I said take the intelligence built into the tool with a grain of salt, because it looks like as you scroll down, it saw quite a few product SKU as flat integers which are valid values, and then it said, "Hey, wait a minute," I'm assuming this is invalid value because I'm expecting this could be a number and basically we're just going to say, "Hey, SKUs are absolutely allowed to be string values" as you see the majority of them are probably string values. How do we change that data type or you can drop down change type, change that back to a string which I believe it was originally in the original dataset. So it's a good idea to scan through your data and make sure that the data types it didn't do anything funky. Most popular products by row count here, we've got the top three our Nest products. The outward security camera, indoors security camera and the thermistor and most popular product categories, we have Nest bags and not set. Now what is the most popular product variant? Is it color? And the answer is no. It is the majority of the products don't have variants. So you can imagine going to a product details page and if it doesn't have a drop-down to select the size like large medium or small for your T-shirt, then it's just a variant that's just not set. What about currency code? What's the dominant transaction as you can see United States dollars, USD. No values, no valid values for item quantity or your item revenue suggesting that this was maybe an artifact of a previous schema or these are just not set up. So we can just drop those when we start about cleansing our dataset. What percentage of transaction IDs have a valid value? See look like majority of these are missing values, we have a couple of transaction IDs that are valid values. Percentage wise again, going into details 95.4 is missing, 4.6 have valid values and that'll give you a direction towards your conversion rate. Again, keeping in mind that you could have duplicative records or just a different relationship than you're expecting inside of your data. That'll give you a start in the right direction. Common page titles are here and last one we want to take a look at what are the different types of e-commerce actions. Here we have one through six in our sample dataset, with the most dominant one of course being the zero here. Again you'll note if you're already familiar with the dataset in big query that it changed this from a string to an integer, that's why you might not see those null values here, as you might be expecting to be the dominant position, because it probably interpreted those as zero for the integers. Again there are some intelligence built into the tool, but it's always good to double and triple check.