So, here we find ourselves with one of the labs. We're going to explore our eCommerce dataset with SQL inside of BigQuery. So, first up your environment should look something like this. Again, the key troubleshooting options are, make sure you're not on the Qwiklabs resources projects that often sometimes comes up stuff like tripped me up. If you want the BigQuery public data or the data-to-insights projects, these public projects to show up here. Again, at the very start of the lab document, you'll find the URL that you can actually access and as soon as you access that, boom! It should pop up here on the left. So, once the environment looks like this, then you're good to go. So, your data analyst community has provided you with this eCommerce dataset. So, data analysts and your team have given you this data table all_sessions_raw. So, a session in the web analytic sense of the word is, when somebody visits your Website, eCommerce Website, in this particular example, this is the Google Merchandise Store, we are buying those T-shirts and sunglasses Google-branded. All of the Web Analytics associated with a visitor's journey or interaction or experience with that site, is recorded in a single session. The session could include things like transactions or viewing different products, adding them into the cart and pretty much anything having to do with interacting with the Website. So, you've got all that data as raw data and you've been asked to answer a ton of questions say, you want to build these cool reports and you want to get cool insights but you're stuck with the raw data. First, how do you explore it and make sure that it's all of good qualities? That's step number one. Step number two is actually writing those insightful SQL results. So, let's see. If you've given that dataset, the first thing that you have to do is really understand what fields are available for you to query and whether or not that dataset is of good quality. Just because somebody gives you a dataset, you don't know where it came from, you don't know whether or not it has duplicate records or any kind of unclean fields that are present and especially if I'm giving you a dataset as part of this course, you can guarantee it's going to be unclean because, I like to try to trip you guys up. So, we can all learn something together. So, couple brief UI things, left-hand side again to review. You've got the project and you can have multiple datasets for a project. So, each of these different toggles is a dataset. In this first course, we're focusing on the eCommerce dataset and different data tables and here tables have this icon, views have this icon. So, this is the data table that the analysts have provided you. Clicking on that table, you get to three different facets of that data table. You can see the scheme here, which has all the fields and the descriptions are blank. But, as you saw before, flipping over to the schema Web document. Again, since this is a public dataset and it's Google Analytics, it's pretty well documented. So, if you ever have any questions on any of the fields, you can turn to that. Looking at the size, the dataset on disk, how much space are we using? We are using 5.5 gigabytes. It's 21 million rows, you're like, "Wow! That's a decently sort of big dataset." What on earth does an individual row represents? That should be the first question you should always ask of any dataset is, what does one row mean in this session? Is it one row per session? One row per visitor? One row per product view? No idea. In order to get a feel for that, you'll see a lot of secret developers who will immediately start typing and select star or giving all columns limit 10. I used to do that a lot with back in the day. But inside of BigQuery, you can get a cached view which is just a sample of that data just by clicking on preview. It's faster and within BigQuery, a key caveat is when you're using the limit clause BigQuery will still scan all of the rows and just to return those top 10. So, you're still processing a little bit of data there. Okay. So, previewing the raw data, it looks like a first thing that jumps out in my mind is, all right well, my first assumption that each row is a unique visitor, is out the window because, I see a lot of duplicate visitors here but maybe across different times. But then this one has same times. So, your first task it's kind of explore the data and see what's in there. So, if you didn't know that there was an eCommerce dataset, you'll start looking over to the right and you see fields like product price, products skew. As a tidbit, you can actually hover over this and get the data type value here if you didn't want to look at the schema. We got product names that could be fun for later, product categories, things like page titles, transactions. So, everything to do with eCommerce and you can again confirm that you have 21 million rows there. Okay. So, how do we get set up to query this table? So, first thing that you need to do, save your query and again this is a little bit of a moot point since you're working in Quick Labs and all of your queries including the ones that you're saving here, are going to be cleared when your lab ends. But, this is a really good practice when you're working with your own account is, save your query and give it a descriptive name, in this particular case, I'm calling it, the lab exploring eCommerce with SQL. I'm probably going to make my queries public all of you can see them. But, for your projects, you might be just sharing that with other project team members. I'll highly recommend you click this or just visible to you. So, in this particular case, I want to make this public. But, since I'm also working in a QwikLabs account, you're not going to be what access this query. So, I'll give you the different query that I've saved for you later. Okay. So, we've got 21 million rows to query them and we need to find out what a unique row represents. Well, the first thing that we need to do is, we need to see whether or not there is pure duplicative rows. So, this is a duplicative visitor ID but I want to see if there's any rows that are completely duplicative across this result data or dataset. That's because, we're looking for a common identifier or a key because, that's going to eventually tell us, what does each row represent here. So, the first thing you want to do is, disable Legacy SQL. I do that through the comments, you can also do that through unchecking this box here. Even if you haven't checked here, if you have this comment all the way at the top, which is actually interpreted by the SQL engine, it will use standard SQL. But I'm going to untick that just to be safe. If you're saying, "Hey, why don't we need to have this comment if I am unticking Legacy SQL there" and a great answer is, if you're sharing your SQL script with others, it also indicates that you're using standard SQL which is the latest and greatest dialect. So, I need to count duplicative rows. How do we do that? I'm going to count all the rows. Counting rows just for the aggregation function there, I'm going to give that a nice alias. Number of duplicative rows against every single row from back ticks. If you type in a back tick it will give you all of the different data tables that are available within your usual datasets rather. Datasets within the data-to-insights eCommerce within there, we're looking at all_sessions_raw. Again, if you don't get that pop up, that's because, your data-to-insights project is nonetheless side panel and make sure you actually go to the URL that was provided in the lab. Now, in order to make sure that all of these individual rows are grouped together to identify whether or not there are ones that have more than one row, in that the exact same fields across every single column, we have two groups and by every single column. Fast way to do that without introducing a ton of typos and this is also very useful for not introducing typos later on, is by going to the schema. Again, quickly access the schema by holding down the windows or command key and you can see as I'm pressing this on and off that it's highlighting all the tables that are referenced in my SQL script, I can actually and go to details and click on this and it'll bring you back to the schema. So, very useful if you got the query results here and you want to return to this schema very quickly, we can do that there. So, we need to group by all the field values. Faster way to do that, let's just start clicking. So, I'm going click on every single value here city, transaction revenue, transactions where we got time on site pageview, session quality dimension, sounds interesting. So, I'll also give you a familiarity for what fields that you can query on. It's adding the comments for you which is great and you only need to do this exercise once because, you are only checking for duplicates on that row dataset and then hopefully, if you find any, you can go back to your data analysts and basically say, "Hey I got duplicative records, can you please create a table it doesn't have any duplicate records". Now, this will identify those duplicate records but it won't filter on them yet and with a having clause which filters on aggregations and where clause filters out on rows that have not been aggregated yet. Having filters out aggregations, since we are aggregating here, we are counting the number of duplicative rows. I only want to return those that have more than one occurrence in the dataset or truly duplicative out of 21 million. Let's go ahead and run this and we'll see how many duplicate records are returned. Now, doing this operation, this massive group by operation across however, made different columns in 21 million rows is, performance wise an expensive operation just because the way BigQuery as you're going to see in future courses, breaks apart your data and stores it on disk. So, you want to avoid doing massive grip eyes unless we have to and this particular case, we have to. So, we've got 615 records that appear more than once, are purely duplicative. So, what we're going to do, we go back to our Data Analyst and basically say, "Hey remove those records". If you're interested in how you can remove duplicative records, we are going to be covering how you can do that through the web UI with Cloud DataPrep a little bit later in this course. But, here I'll show you in an advanced SQL course that's later on specialization and you can do that through an advanced function using an array aggregator. So, that's a little bit of a preview here, don't worry if the code doesn't make to on sense we're going to be going over arrays in detail when we talk about advanced SQL functions. Okay,. So, your analyse community came back to you and basically said, "All right sorry, I didn't mean to give you all these duplicative records. I've created a new table called all-sessions and I've given you a little bit more context on the key fields, since, there isn't just one, the key fields that will help your SQL queries when you begin to write them. So, they came back with this query and they gave you this. In here, they've said, "All right, these fields are really, really important and the collection of all of them, these here should never be duplicative. So, same deal here, we've added all those columns, we've counted the occurrence of those columns and this time I'm grouping by the index but it's the same thing as literally copying and pasted the wording of the columns here and having row count more than one. You can see a little bit of the description of including these field values here. But, the complete description is available at that BigQuery Google Analytics Schema which I always keep open in a separate browser tab because, I myself forget what some of the field values mean. So, while we run this query, it's basically saying that the unique visitor can have more than one visit at a particular date, at a particular time, can look at a product or more than one products. Product name is not unique because there could be different colors to the sunglasses, bought the product skew or the ID is actually going to be unique. You can have different types of pageviews, you can have a page, view or an event that's triggered and these can happen at the exact same time in case you're thinking about, "Hey, all these records should be. I want to use time as my primary key and not so fast." That can be associated with an eCommerce action like adding it to cart and so on so forth. So, we got great news report, there are no duplicative records in our dataset. Great. So, that is and again, the reason why we do this part one, is to understand the shape of the dataset, what fields are available, how do we preview those data? What are some key columns that we can begin to do our insights on? More importantly, is the dataset of good-quality before we start querying on it. Because if you query on a dataset that's not clean and you start reporting those insights and people challenge your numbers, that is, it can lead to disaster.