Let's go ahead and walk through Lab 8, which is Unioning and Joining Datasets. If you haven't given it a try already, go ahead and pause this video and give it a try. Be sure to try to actually write out the SQL statements before looking at the solutions, but the solutions are there for your benefit and understanding. Let's practice a lot of those concepts that we've just learned. We've got BigQuery open already and we're going to compose the query which we've done. And it says, write a query that will count the number of tax filings by calendar year. For all of our IRS Form 990 filings, use the partially written query as a guide. Now, we'll be using wildcard and a table suffix. So, what I'd first like to do, I'm going to copy the question that we have in here. I'm going to use a multilevel block comment. So, I just don't forget what the question that we're going to be answering is and I'm going to copy the starter code that we have here. And what we need to do is count the number of tax filings. We've got our count here so we've given that for free which is great, by the calendar year for all IRS 990 filings. So there is a couple of pieces missing here, if we try to just execute this maybe the error will be a little bit more helpful. It says expected keyword from but got keyword as. That means there's a field that's missing here. So what year was it filed in? Now, before we get to the year that it was filed in, we need to actually see what what data table we're pulling from. So you see we've got the IRS 990 data set and then we have IRS 990 and then nothing else after it. I've pulled up the public data over here and I see that I have a lot of different data tables over here. And if I wanted to just include all tax filings by calendar year, I could just put an asterisk there. And if I wanted to actually return what that asterisk is actually matching against, we can use our magic table suffix. Valid query, again you can see how much bytes it's going to process. And the fun part, if you're just counting on metadata, you can actually process 0 bytes, because BigQuery will automatically store that as part of Part of it's optimization. So we've got a lot of records for each of these different tables. You can see we've got the year file, this is what it's matching against. It looks like it's missing that IRS 990, so that's one of the things that you can do to actually get the entire table name. You can do concat, Which will just run a quick concatenation. Rerun the query, and then you'll see the full thing there. Or if you just wanted to get the year again, you can do string manipulation functions and just pull out that year piece as well. So we've got all those tables. Let's move on and see what we need to do with it now. Yep, our solution looks relatively similar. Now we have the year filed, but there's a lot of different tables here. Now not only what we want to do is filter out for the ones that are of this format, irs_990, year year year year. So the great news is we could actually apply a where clause filter. And what I'm going to do here is, Where that table_suffix, So we could do a couple of different things. We want to filter out all of the ezs, and all of the pfs and this ein table here. There's a couple of things that you can do and I'll walk you through a few of them. You could use the, This is the hard coded way. So you could use the table suffix, then you could just specify those filings that you just wanted to return. You could actually use in and operate over a set of values. But what we want to do is actually make our prefix here for the wildcard union a little bit more specific. What we want to do is add in that underscore and it looks like all the tables that we want will immediately begin with 2000. And since 2000 and the data that comes after it is the only data that we have, we're actually going to go ahead and start with that, too. And let's see if that will filter for just the tables that we want. And look even without that where clause. And if we're going to be doing that, we actually need to update our concatnation here. And these are the actual tables that it's pulling from. This is even without a where clause on the table suffix. So let's compare that to the solution. That's exactly right we already had our concatnation in there from before. And that'll actually get out the table that it's pulling it from. And this is the year filed, perfect. And again, if you wanted to parse out to get just the year, What you could do is instead of prefixing the whole table name there, you can just prefix the number 2 and you'll get just the year. Modify your query to only include tax filings from tables on or after 2013. Also, include total revenue and the average of total functional expenses as well. So let's copy these new fields we want to add. We're going to do a few more aggregations, total revenue and total functional expenses. Let's bring those into there and it looks like we're going to be doing average functions on them as well. So we're going to add a few aggregations. Take the average of all those expenses. I'm going to call that. Don't forget, any time you're using a calculated field that it's going to need an alias. So we're going to call that average expenses. This is going to be average revenue. And I'll format my query to make sure everything starts looking pretty good. And we wanns filter for anything that is on or after 2013. So now if we want to do this kind of filtering, we're going to need our prefix for our Union Wildcard filter is as specific as it can possibly be. Now we need to do an actual filter on the table itself that it's returning. There's some interesting operators that you can actually perform here. You can use after, on or after again is greater than or equal to. You can do greater than or equal to 2013 which is probably what we want to do here. But you could use things like between 2013 and 2017, you could do that as well. Or you could do in and hard code all the values in that set. But I think what we'll actually want to do is on or after the year 2013, and that looks good. And let's go ahead and execute that and see what we get. Zero results. Now take a look and see what I did wrong. So I said on or after the year of 2013, but the table suffix that we're actually pulling is a hard coded 2 and then u,u u, year after that. So it's actually going to be this since we already have that 2 in there. And then there we have it. On or after 2013, 2014, 2015 we've got average expenses. You can do a round function on that if you wanted to and average revenue and the number of filings. One of the key things that might trip you up is just making sure to match whatever you're matching here as part of your prefix. Let's continue on. Similarly again, same approach in this solution you'll see it's 2, 0. Generally the more specific you are in your prefix, the better. Now just keep in mind that you have to keep that consistent throughout. If you're concatenating on that, make sure that's the same here, and then not including that 2, 0 in that table suffix. The results look good. Now on to practicing joining. We want to find all the organization names, the charity names for 2015 with some, meaning non-zero, revenue and expenses. We're going to need to join against the tax filing table data and the organizational details table. So now this where it starts to get really fun. Let's copy over our code block. I'm going to clear out all this code that we had before, standard SQL still at the top. So the first thing that we want to do is we need to find our two tables here. So we've got the tax EIN, we've got the organizational tables so we need to say find the tax. So that's going to be, as I manually type this in, bigquery. Let's see what this full name is. Bigquery-public-data, I'm always nervous when typing this stuff in our code. Bigquery-public-data.irs, look at that auto completes, irs_990. And I want to do .irs_990_2015, so those are the 2015 tax filings and if you get it right with the back tick, it will turn into that blue color there for you. And in there we have the tax table. And now we need the organizational details table. I'm just going to copy and paste this. And that is instead of the year for the annual filing. We're just going to use EIN as the organizational detail and look up. Now we need a join condition where the EIN number is equal to the organization's EIN number, that's going to match. And we need to have some revenue or some expenses. So let's pull up again the schema for the 2015 table. And here, let's look at, I think we want total revenue. That is a field. Great. Where we have total revenue is greater than zero or another condition, or. Let's see if there's a total. Expenses, expenses, fundraising expenses, gaming expenses, let's see what kind of expenses that we have, travel expenses, total functional expenses is great. Some revenue or expenses or total functional expenses is greater than zero. So you can do this. You can chain together these different statements or one of the other ways that you could do this. I think this is kind of very, very readable, but we could also do is just a little bit of math. You could basically say hey, I need one of these to be non-zero so what you might just get away with is just saying add them both up together and their sum total must be greater than 0 which I believe is this solution. This solution, that's what we're going to be using. And limit to 100. The 100 limit is here. Let's go ahead and run that query and see our results. Our process about 70 megabytes and again, taking a look here we have the matching EIN numbers, we have the name. And then we have revenue and expenses. 0 revenue here, but it has some expenses, and again, vice versa as you might see in future records. And let's go ahead and compare that with the solution. Same thing here. We have a matching key and we're using the addition here. And then it's just these particular table names you have, no big deal. Clear the query editor to practice working with nulls. You'll see a lot of nulls again when it comes to joins. If you remember our Venn diagrams, that means where the records are present in one table but not matching, or no corresponding records in the other. So you want to find where tax records exists for 2015, but no corresponding organizational names. This is a huge data quality problem in Insight. We have tax records that exist for 2015, but no corresponding for name. So what's the join we want to use for finding data across both tables? That's going to be your full outer join. Remembering that Venn diagram, so basically saying, give me all the records from the tax table and all the records from the organizational details table. And setup our joint condition as we have for tax.ein, that's our primary identifier year. And where, now here's the really interesting part. So everything looks good as far for our join condition. But we need to find where tax records exist for 2015, so what table is that? If the tax records exist, that's going to be this table, the annual tax filings table so no corresponding organizational name. What you could do is, in truth, what you could actually do is where the organizational name is missing. But if we want to get hyper-specific and say that the entire record doesn't exist in the other table, you could do this. The identifier for the record is not present in the other table. And I actually have messed up our join condition. We want to do org.ein. Everything looks good. Go ahead and run that query, and if the data quality is good, should return no records. Interesting, so there are 1351 records where there is a filing in 2015 table, but there is no organization id, so you can't get the name of these, it's null. That's super interesting. I trust SQL, but I really am interested and want to confirm. So, how do you do that? Copy over one of these EIN numbers. And then I'm just going to do a select star from where the EIN number is equal to this. I think it's stored as a string. We'll see. If it's an integer, we'll just remove those quotes from this particular table and this should, if logic holds, return zero results. It is an integer, that's fine. So this EIN doesn't even exist in the organizational details table. That's fascinating. Yep, zero results. There's no address, but there is a filing for it inside of the 2015 filings table. Again, this data quality discovering a lot of these issues ahead of time is going to really, really lead you to some good insights later on down the road. So let's see if we've also found those in the lab, hopefully so. Got the solution and we said that org.ein is null. You could also probably get away with the org.name is null, run the query. And yes, there are a 1,351 tax filings that don't belong to any specific charity name. I mean the EIN exists, but that charity lookup table is obviously missing those records. If you've covered the mastery of looking at those nulls and manipulating that Venn diagram in your head, then you've gotta strong mastery of joins. So let's review very briefly. We manipulated a lot of the use of the table suffix for filtering out union wildcards. We have worked on full outer joins. That's a very interesting to look at records that are in one table but not the other. Then again a full outer join is just a combination of that left join and that right join. And again to check for unique records you can use that count and count distinct or a count in a group by. Nice work.