From the Lab to Production.

How Alice tests their Machine Learning models
At Alice, machine learning models are essential tools for delivering a high-quality service. However, it is important to evaluate these models carefully to ensure that they are performing as expected. In this article, we will discuss the key steps involved in evaluating machine learning models, including creating an evaluation dataset, choosing the appropriate evaluation metrics, and identifying and addressing any problems with the model before they impact our service.
Índice
Why is it important to evaluate the models?
It is important to evaluate ML models to compare its performance and select the best one for a specific task. The model could make incorrect predictions, which could have negative consequences for our service. Or it could be biased, for example, discriminating against certain groups of people. Also, it could be computationally expensive to run, which could impact the response time of our service.
Creating a high-quality evaluation dataset is essential
In order to evaluate a machine learning model, we need to create a high-quality dataset. At Alice, we invest heavily in creating evaluation datasets that are representative of the data that our models will see in production. This helps us ensure that our models are performing as well as they can and that they are not biased toward any particular group of people or data points.
Three main steps to create our evaluation dataset
- We simulate what will happen later in production by adding real production samples to the dataset. This helps us to ensure that the model will perform well on the data that it will actually see in production.
- We also add State of the Art (SotA) datasets to get a wider point of view. This helps us to compare the performance of our model to other models and to identify any potential areas for improvement.
- Finally, we capture or generate Alice internal data to address already identified corner cases not covered by the previous data. This helps us to ensure that the model will be able to handle all the possible scenarios that it might encounter in production.
By following these steps, we can create an evaluation dataset that is representative of the data that the model will see in production. This helps us to ensure that the model is performing as well as it can.
In addition to creating a representative dataset, we also try to balance the samples according to user demographics (i.e. age, gender, ethnic). Some SotA datasets already include these attributes as labels. This helps us to measure any kind of possible bias in the model and to ensure that it is fair and equitable.
Metrics are essential for comparing the performance of machine learning models
They provide a quantitative measure of how well the model is performing, and they can be used to identify the best model for a particular task. SotA metrics such as precision, recall, ERR, MAP or AUC (depending on the task to solve) can be used to obtain a basic yet reliable overview of model performance.
In addition to these standard metrics, we may also develop custom metrics, graphs and visualization for specific tasks. For example, if we are building a model to detect fraud, we may develop a metric that measures the number of false positives and false negatives.
We also track metrics by different attributes to identify any bias as early as possible. For example, we may track precision and recall by user demographics to see if the model is performing differently for different groups of people.
Finally, we use MLFlow (among others) to store and track model metrics. MLFlow is a powerful framework that allows us to easily compare the performance of different models and to visualize the results, taking advantage of its pyplot, matplotlib and HTML renderization features.

Problems identification
After running the evaluation and selecting the best possible model, it is important to deploy the model in production with caution. At Alice, we perform a number of validations before the model is deployed to ensure that it is performing as expected and that it is not making any wrong predictions.
First, we use a few tools developed internally using Streamlit to assert that the model behaves as expected. These tools help us to identify any potential problems with the model, such as overfitting or underfitting. We also use these tools to test the model’s robustness to different types of input data.
Second, we try to cheat and spoof the models in live. This means that we try to trick the model into making wrong predictions. We do this by using fake images, handcrafted tools, and other techniques. This helps us to identify any vulnerabilities in the model that could be exploited by attackers.

Once the performance of the model has been validated by the team, it is deployed in our Staging environment. This environment is a replica of our production service, so it allows us to test the model in a real-world setting. During the deployment process, the model has to pass a number of end-to-end tests to ensure that it is working properly.
Finally, the team performs another validation step to make sure that the model is integrated properly and everything is running smoothly. If any malfunction is detected at any stage, the model is rolled back to the lab until it is fixed.
By following these steps, we can help to ensure that our ML models are performing as expected and that they are not making any wrong predictions. This helps to protect our users and our business from potential harm.
Julian.