[MUSIC] So just to remind ourselves, we're in a situation when we cannot use A/B testing of control groups. And we tried to and literally restore information about target events within the clients where we didn't know the target event. And here came one problem. We want to compare it to models, but the way we restore a target event is model-specific. And we kind of do that independently and calculate expected benefit curve objectively for each model. In order to solve this, we will go in for metalearning and we will discuss how to reconstruct the target event using all possible information coming from predictions from all models. Metalearning, why should we bother about it and how can it help us to solve our problem? Well, generally speaking, metalearning is a special branch of machine learning that helps us to analyze the performance of machine learning models. What it depends on, why it is better or worse. And actually, it collects metadata, data about model performance, in order to build and train its own models, or better say meta models, that predict or explain why this or that model performs better and when it performs better. Some applications of metalearning can be quite interesting. First of all, this is smart hyperparameter optimization. Well, everybody knows that one of the easiest ways to calculate optimal hyperparameters is just to make a large, large grid and iterate through that grid. And every time to calculate your model performance metrics and choose the combination of hyperparameters that is the best. Well, for some models and for some datasets that can be resource-consuming and time-consuming. And imagine yourself that we have kind of database of different machine learning experiments when we ran different models and different algorithms with different settings and different number of hyperparameters. So by analyzing this set of data and having the model performance as our target event, we can build a metamodel that will advise a user to use this or that combination of hyperparameters for this kind of task. So the hyperparameter optimization, in this case, will be faster, and we will start to look for these better hyperparameters in areas with better luck. The second application of metalearning is definitely AutoML libraries. Usually, when you feed your dataset into AutoML library, it can kind of go through different algorithms, try different hyperparameters and settings, try different subsets of your initial dataset, and come up with a final model that you can deploy. Well, metalearning, in this case, can help AutoML library to automatically choose the algorithm that is likely better for this ML task. Also, it kind of saves your time and resources, and helps you to reach your goal more efficiently. The final, maybe the final, maybe there are some more applications of metalearning, is building model performance predictions. Well, basically, if you have your model implemented in business process for a long time, you can systematically calculate its performance on historical data. And therefore, when you calculate its performance, you generate your metadata for this model. Basically, then you can apply ML methods to predict whether or not this model is going to be at high level in next period, or is there a great risk that this model is going to decay and its quality is going to fall. So it's useful for early warning signals within model performance monitoring. So the intuition behind using metalearning in our problem is that you see we have two models, we're not sure which model works better for this particular client. So which model we should kind of privilege with a more weight, which model we should trust more? It seems that this subfield of machine learning can help us. Let's see what we can do with that. Again, we have two probabilistic estimates for target event, which is not really good. That gives us problems to construct benefit curve. And we want the single column of restored target event for clients. You see we have kind of two estimates for this target event, one which is provided by the first model and the second provided by the second model. And naturally, we would like to use all information. So we would like to take into account both models' predictions and come up with sort of one blended target event approximation. So this kind of gives us incentive to use some weight for its model prediction. In order to calculate this weight, we will use metalearning scheme. So first of all, we'll have to enrich our dataset with metainformation of which model gives us more precise predictions given target event known. So in the upper part of the table, you can see two rows where it seems that XGBoost works better because its predictions are closer to the correct answer of the target event. This logic is quite similar to logloss function. We can just calculate and mark those rows where logloss for this particular model is less, is better. So finally, we will have kind of flag, additional column within our dataset, which is going to tell us which client was better predicted by which model. It's binary, so in our case, flag equals one means that XGBoost is better, is closer to the correct answer than logistic regression, and flag equals zero happens otherwise. Again, I have to mention that these flags can be calculated only for clients with known target event. For the rest of the clients, we just skip it. So this flag is going to be our new target. This metainformation is going to be new target for ML method, or some simple statistical method, that will help us calculate the probability that the XGBoost is better than logistic regression for this particular client. So we expect this probability as the output of our metalearning algorithm. So the last column is called probability aux, because it's an auxiliary classifier, our metamodel that we have trained on this known data. So basically, this metamodel can take everything as its input, X's of the clients, some model parameters, or some dataset information, such as number of rows, number of columns, etc. So as soon as we have this probability of an auxiliary classifier, we can produce one single probabilistic estimate of Y approximation. So it's done quite easily, we just calculate the mathematical expectation given two predictions of our base models. And we use weight as a tool to come up with mathematical expectations. So our weight equals probability of our auxiliary classifier. The formula is on the right. To wrap up, metalearning uses models to predict model performance. This can be used when we don't know which model we should trust better. And we can come up with a classifier that helps us to understand on our historical data, which model gave closer answer to correct answers and which are not. And a very important point about that is that this metamodel, this auxiliary classifier, can take X's of clients as input. And it means that we kind of figure out which client segments are likely to be well predicted by this or that model. As soon as we do it, we can go to single proxy for target event within rejected clients by combining two imputed target events based on first and the second model with this particular weights. Anyway, no matter how sophisticated our metalearning, semi-supervised learning, or reject inference techniques are, we should understand that this is our best guess given that the target event is unknown. One of the best options is to run a control group. [MUSIC]