Welcome back. This is the third course in the data analyst specialization. Here, we're going to achieve advanced insights in BigQuery. Now for the first module, we're going to continue our journey in SQL by delving into a few of the more advanced concepts like statistical approximation functions and user defined functions. Then we'll explore how to break apart really complex data questions into step-by-step modular pieces in SQL by using common table expressions and subqueries. Let's start by revisiting the SQL functions that we covered so far. >> All right, it's time to revisit some of those functions that we skipped over in previous courses, as we move in to more and more advance SQL concepts. So the three ones that are on the agenda are the statistical functions, which include some pretty fun approximations. Analytical functions, if you've heard of things like window functions or partitioning in the query. And last but lot least, the UDFs or user defined functions, that could be written in SQL or JavaScript. Let's tackle them one at a time. First up, some pretty generic statistical functions that you can do on your data. So if you can imagine like a standard deviation or a correlation analysis. This is a great example just to apply some basic statistics. So here we're looking at the IRS data set, and we're getting the standard deviation of the employee counts. And also for the second correlation, we're passing in how correlated is the revenue that one of these charities brings in to the amount of functional expenses that it's actually spending on their charitable programs and offerings. So that's the ultimate question, is how correlated do you think those two are? And the closer it is to one, naturally, that is going to be a very tight one-to-one correlation. So just running that SQL there, and you'll see the value here. Standard deviation for employees, about 1,500 employees. And then you can see that revenues and expenses for these non-profit charities are very tightly correlated. And again, this is for all of those records in the 2015 file and data set. Look at that 0.97, extremely highly correlated. All right, so that's enough for the basic statistical functions. Let's move into some more of the other worldly ones. So here are some approximate aggregation functions. So if you wanted to approximate the amount of data that's in your data set, and say you have an absurd amount of rows. What you can use is approximate count distinct. So much like you've seen before with the count distinct there on that second line, you can get that exact count. But you can also get the approximation, which can be good enough to get the job done. Now you're thinking in the back of your head like, all right, well, the count distinct ran in under five seconds for my data set that we're doing in our labs. Why on earth would I ever want to use the approximate count distinct when I can get the exact count, right? And me coming from the finance background, we wanted everything down to the penny, right? But imagine you're in the scenario where you work at Google and you have to count all of the logs, or all of the user logins, over a multi-year period of time. And you're talking about petabytes of data, and you don't have all the time in the world. And you want to actually sacrifice a little bit of accuracy for speed in which you get your query back. Especially when you're talking about counting all the logs that are processed, or counting all the ads that are served. Very popular to do that approximate count distinct. So let's look at an example here. So this is an example from one of our BigQuery developer advocates, Phillip, he's an amazing SQL writer. And what we're going to invoke here is that approximate count distinct. So the task that we have at hand is the GitHub data set. It's public on BigQuery and that includes all of the users logins and all of the public code in GitHub. So that's a very fun public data set to play around with. And here what we're doing is we're counting the approximate number of logins as an approximate value. And what we're looking at is to see it by year. And here, you'll see those concepts that we covered in previous courses, where you have that union table wild card there. That's an asterisk in that from statement. And then the concatenation in that table suffix again is one of those concepts that we covered in the merging data sets topic. Okay, so we now have an approximate count of users. But what you might be asking in the back of your minds is, how accurate is that to like the actual count? And say we wanted to get a more realistic count of all of the users since a period of time. because this data right here can include the same user that has logged in over multiple years, how do we parse that out? And to do that, it's a little bit more of a complex query. Probably the most complex one that you've seen to date as a part of this specialization. So let's take a deep breath and just walk through it step-by-step, line-by-line. The first thing that you're going to notice is we're invoking a, what's called a common table expression, all the way at the top. And we're going to cover that with clause shortly. But in essence, you can think of that as like a temporary table. So it's basically saying, anything in those parentheses for that WITH github_year_sketches. Anything within there, that select statement, we're going to be storing that kind of like a temporary table that we can use in a subsequent query. And the reason why you do that again, is to break apart a very complex question into multiple steps. So you can do like a processing layer on the data, and then query those results within that same query. So that those with statements, or common table expressions as they're called, are extremely useful. Okay, so let's get into the guts of the query. So that first query, what we want to do is we want to get the approximate number of logins. And what we're going to do is invoke a function called the HyperLogLog for another approximation. And what that's going to do is it's going to also estimate the amount of logins, just using a different function that's available. And there's a long white paper written on HyperLogLog. And it's an externally available algorithm, and it's used all the time in things like statistical estimation as well. Okay, so once we've done that, the great thing about this is we can actually estimate year, over year, over year, over year. So we're doing this for however many years of GitHub data we have. I think the previous slide, it was like 2,000 to current. And when we're getting an estimate of year, over year, over year, over year. And in the HyperLogLog name space, those are actually called sketches. So we're getting a collection of sketches, which in a sense, are estimates of counts. And in that second block of code at the bottom, we can actually add those sketches together, think of like summing those up, right? And you can actually do that through that HyperLogLogCount.merge. And that merges all of those estimations together as an approximate number of those unique users. And that again, is from, as you see in the last line there, that's from that common table expression, the github_year_sketches that we created earlier. Okay, so there's a lot to cover. But at the of the day, it ended up with 11 million, approximately, unique GitHub users. And then a different query, we're not showing it here, where we actually ran to get the exact count, we found that this approximation was 99.7% accurate. So again, here's the trade off between whether or not you want to get 100% accuracy and have your queries run a little bit longer. Or if you're fine with just executing the approximation functions. So if you see these in code along the way, or you start writing these approximations and estimations on very large data sets. Just know that those are out there, and they're available for you to use. And feel free to read the blog article for a little bit more detail on that. So approximation functions are another tool in your tool kit for very, very large data sets, so feel free to get familiar with them. And we'll come back to those with clauses a little bit more.