So the next thing we need in the data is to know which year we're dealing with. And in the data, we actually have a variable which is the date. And the date comes in a format of eight characters, the first four of which of the year, 1999 or whatever it happens to be. Next two characters are the month, say for example, 7 for July, 07 for July. And then the day of the month, for example, 14 would be the 14th of the month, 14th day of the month. So we have that individual date, but what we're actually interested in for the data analysis is knowing which year it is. So we need to create a measure of year from this date variable. And you can imagine it's a relatively simple idea is we've got this eight-character string, the first four characters are the year, and then the last four characters relate to the month and the day. Well, we just want those first four characters. So in effect, we want to cut this string in half and keep the first half of it as a measure of year. And that's what the next line of code does. We use this command .astype('str'), which tells Python that we want to treat this variable as if it were a string. So in Python you can have strings and you can have integers. So a string is a list of characters, integers are effectively numerical data. And in order to cut the data, it needs to treat the data as if it were a string. So that's what this part of the command tells it to do. And then the final part of this command then tells it to take the first four characters. The 0 represents the first character and the command tells It to end at the fourth character, so it will take characters 0, 1, 2, 3. And if we then run the code, we can see that we will indeed produce the variable we need. So here on the left-hand side you can see the date variable which is this eight-character number. And then right at the end, we've added in this variable we've created, year, which is just the first four characters of date. So we now have the year, which is the same as the season, of course, in baseball. We have the season year for each game in our dataset. So now we need to create the variables for slugging percentage and on-base percentage from our data. Now in order to do that, we need to do two things. We're not just interested in slugging percentage and on-base percentage created by each team. But we're also interested in the slugging percentage and on-base percentage of their opponents throughout the season. And what we have in our data is a list of games. For each game we have two teams. So in other words, we have two sides to every game, every row in our data. We have the performance of the home team and we have the performance of the visiting team. And we're going to want to find a way to extract those two sides of the data from the data frame. Now the statistics, the actual variables we need to include in this are contained here in the data. We have at bats, second base, thirrd base, home run, sacrifice flies, base on balls, hit by pitch, which we need to include to construct slugging percentage. You might note that the one statistic that's not directly included is the value of singles. We have doubles, triples, and home runs, but we don't have singles. Well, a single is every hit that gets you on base but isn't a double, triple, or a home run. So we can define singles simply as hits minus second base, third base, and home runs. So in order to construct these statistics, we are going to need to think of each game from two perspectives. One of the perspective of the home team and one of the visiting team. Each team throughout the season is sometimes a home team, sometimes a visiting team. And what we're going to do is create two databases which is the record of teams as home teams and then a second one, which is the record of teams as visiting teams. And then we're going to combine those two datasets to get the aggregate statistics for each team. So we'll start by looking at the home team. So we're going to create a new data frame called Teamshome. And we're going to then measure the aggregate performance statistics of each team using this group by command. Which is something we use a lot of the time in order to organized the data. And here we're going to group by home team and by year. And then we're going to get the total aggregates for visitor team performance and home team performance. So in this sense, the home team performance will tell us about the statistics for the team. So its positive output statistics, its statistics on offense. And the visitor team statistics will tell us about its performance on defense, the success of its opponents. And notice that in the data, the team is referred to as the home team. But when we actually use this data later on, we're going to rename the home team as just the team. Because we're going to aggregate each team's statistics as home team and away team. And the name team will be the variable which is used to combine the two datasets. So if we run that, we will see here, we get now the statistics for each of the teams, In each of the seasons. So you can see here, we have for example, Anaheim for 1999, 2000, 2001, 2002, 2003, so the five seasons we're interested in. The visitor at bats, the visitor hits, visiter second base, doubles, and the visitor triples, and so on. And then we have the home team performance here, home team at bats, home team hits, and so on. So that's one side of the data. So that's looked at as the perspective of teams when they are the home teams. Now we're going to do the same thing for teams as visiting teams, call these Teamsaway. And again, we're going to use the group by, and we use exactly the same statistics, but we're now going to use group by using visitor. So this will be the performance of the visiting teams. Now, the trick here is to recognize that now when we talk about visitor, we're talking about the team that we're actually focusing on. And when we're talking about the home team, we're actually talking about the opponent of the team were interested in. So if we run this, we can now see for example, here Anaheim in 1999, we can see here its performance as a visiting team. So these visitor statistics are Anaheim statistics. And then as we move along, once we get to the home columns, these home team columns relate to the performance of teams at home playing against Anaheim as a visiting team. So now we want to merge these two data sets to get the combined performance of each team, both as visiting team and away team across the season. And this is where we take advantage of the fact that we renamed the home team and away team as just team. So we can use that as a criterion to merge. So when we merge, we merge on team and year. And this will match up the statistics of each team as home team and visiting team across each of the seasons. And we create that, this is this new data frame, Teams2. And we can now see here these columns. Now, one thing to notice is that the names now have changed. The use of the variable x and y in these data frames enables us to record which of the two data frames we merged the data came from. So for example, the home at bats x represents the at bats of Anaheim. For example, the first row Anaheim in 1999, represents their at bats of the Anaheim team. So the equivalent statistic in the y data frame would be the visitor at bats y. These would be the at bats of the Anaheim team when it was a visiting team using the away team data frame. Likewise, the the opponent at bats against Anaheim in 1999 would be taken from the visitor at bats x here. So that's the opponents of Anaheim when Anaheim was at home plus the home team at bats y. If I can just find that, that's the home team at bats y. That is the at bats of teams playing against Anaheim when Anaheim were the visitors. So each of these statistics can be matched up via the x and y's. And that's what we're going to need to generate the aggregate statistics for on base percentage and slugging percentage across the entire season. Okay, so just before we go on to do that, though, let's add one other thing. How many win did each team have across the entire season? Well, the number of wins for each team must be the number of teams they had as a home team, which is the hwin_x variable. So home wins when they're are home team plus the away win teams, the visiting team wins when they are in the visiting team statistics. So that's the awin_y variable. So total wins for each team is hwin_x plus awin_y. And we now create that variable as the total number of wins for the team. Which we will ultimately used to calculate the win percentage for each team.