Hi. In this video, I'm going to talk about Walmart trip type classification challenge which was held in Kaggle couple of years ago. I won the first place in that competition. And now, I will tell you about most interesting parts of the problem and about my solution. That said, this presentation consists of four parts. First, we will state the problem. Second,we will understand what data format and data reprocessing. And third, we will talk about models, their relative quality and their relation to the general staking scheme. And finally, we will overview some possibilities to generate new features here. So, let's start. In our data, we had purchases people made in Walmart visiting their shop in two weeks, and we had to classify them into 38 visiting trip types or classes. Let's take a quick look at features in the data. Trip type column represents the target, visit number represents ID which unites purchases made by one customer in one shopping trip. For example, a customer which made visit number seven, purchased two items which are located in the second and in the third lines of this data frame. Notice that all rows with the same visit number have the same trip type. An important moment, is that we have to predict a trip type for visit number, and not for each row in the train data. And as you can see, in the train we have around 647,000 rows and only 95,000 visits. Back to the features, next feature is weekday which obviously represents the weekday of the visit. Next is UPC. UPC is an exact ID of a purchased item. Then, scan count. Scan count is the exact number of items purchased. Note that minus one here represents not the purchase but the return. Next features, department description, with 68 unique values is a broad category for an item. And finally, fineline number, with around 5000 unique values, is a more refined category for an item. So, after we understood what this feature represents, let's recall that we have to make one prediction for each visit number. Let's take a look at the data for the visit number eight. We can see here that this particular visit has a lot of purchases in category paint and accessories, which means that trip type number 26 may represent a visit with most purchases in that category. Now, how to approach model train in here. Let's take another look at the data and assess our possibilities. Should we predict trip type for each item on the list or should we choose another way? Of course both of them are possible, but in the first one, we'll predict trip type for each row with each data set, we'll miss important interactions between items which belong to the same visit. For example, trip type may have a number of 26, if more than half of its items are from paint and accessories. But, if we will not account for interaction between these items, it can be quite hard to predict. So, the second option of uniting all purchase in the visit and making a data set where each row represents a complete visit, seems more reasonable. And, as can be expected, this approach leads to more significant benefits in the competition. I'm going to show you the easiest way to change the data format to the desired one. Let's choose the department description feature for the purpose of an example. First, let's group the data frame by visit number and calculate how many times each department description is present in a visit. Then, let's unstack last group by column so we will get a unique column for each department description value. Now, this is the format we wanted. Each row represents a visit and each column is a feature described in that visit. We can use this group by approach for other features besides department description. Also note that items in the visit are actually very similar to words in a text. After our confirmation, each feature here represents counts, so we could apply ideas which usually works with text, for example, tf-idf transformation. As you can guess, a lot of possibilities emerge here. Great. After this is done and we process data in the desired format, let's move to choosing a model. Based on what we already have discussed, can you guess if we should expect the significant difference in scores between linear models and tree-based models here? Think about this a bit. For example, is there a reason why linear models will under perform in comparison to tree based-models? Yes, there is. Again, I'm talking about interactions here. Indeed, tree-based models in neural network have significant superiority in quality in this competition for this very reason. But still, one can use linear models and TNN to produce useful method features here. Despite the fact that they didn't imply interactions, they were a valuable asset in my general staking scheme. I will not go into further details of staking here because we already covered most ideas in other videos about competitions. Instead, we'll talk a bit about feature generation. Except for interactions between items purchased in one visit, one could try to exploit interactions between features. The interesting and unexpected result here was that one fineline number can belong to multiple department descriptions, which means that fineline number is not a more detailed department description as you can think. Using this interaction, one can further improve his model. Another interesting feature generation idea was connected to the time structure of the data. Take a look at this plot, it represents the change in the weekday feature relative to the row number. It looks like the data is ordered by time here. And the data appears to consist of 31 days, but train test split wasn't time based. So, you could derive features like day number in the data set, number of a visit in a day, and the total amount of visits in a day. So, this is it. We just discussed the most interesting parts of this competition. Changing the data format to a more suitable, generating features while doing sold, working with models while doing stacking. And finally, doing some for additional feature engineering. The challenge itself proved useful and interesting. And I would recommend you to check it out and try approaches we have talked about.