Deep Learning Models: Monitoring and Re-training

Daniel Pérez Cabo

Daniel Pérez Cabo

Deep learning has revolutionized numerous fields, from computer vision to natural language processing. However, training and maintaining these models can be challenging. This is where monitoring and evaluations on representative datasets become very important.

Why monitor?

When a model goes into production it will have previously undergone a thorough evaluation on a data set sufficiently representative of the scenario to be expected in production. Based on this evaluation it is assumed that the model will perform correctly once deployed. But is it correct to assume that the model will work correctly indefinitely?

This is where monitoring comes in. Once the model is in production it is advisable, if not necessary, to monitor the model’s behavior periodically to ensure that the model’s performance remains within the expected parameters. If not, act as soon as possible to solve it.

In the case of models trained by supervision (supervised models), i.e. with data that have some kind of labeling, real-time monitoring is not deterministic in the sense that we do not have the corresponding label for each production sample. So what can be done in these cases?

Knowledge of the production data

For these cases it is essential to know perfectly the scenario in which the model is going to work: the distribution of the data as well as peculiarities that may exist within the data distribution.

For example, let’s take a document fraud detection model that identifies samples of documents that are black and white photocopies. We start from a base in which the model has been validated on a representative dataset of images such as those that will have to be analyzed in production. On this same database a working point has been set for which the model is able to detect 90% of documents that are black and white photocopies (True Positive Ratio or TPR) while rejecting 2% of documents that are not black and white photocopies (False Rejection Ratio or FRR). Thus we have an estimated working point:

  • TPR = 90%
  • FRR = 2%

If I have been able to correctly characterize this type of fraud in the production data I will have an estimate of the total number of fraudulent samples I can expect and the total volume of documents reaching production. For example, on average 1000 documents are processed per day of which 100 are usually black and white photocopies.

With this it is possible to get an idea of how the model should be performing in production.

Of the 100 fraudulent documents (black and white photocopies) the model should detect 90% as fraud. That is 90 documents.

Of the 900 documents that are not black and white photocopies, the model should misclassify 2% as fraud. That is 18 documents.

Therefore, if a value of approximately 108 documents classified as black and white photocopies is observed in production, it means that the model is working as expected. If the number of documents detected as fraudulent per black and white photocopy is far from that value, then two things may be happening: on the one hand, we may have modeled incorrectly or the distribution of genuine and fraudulent in our use case may have changed; on the other hand, the model may not be working as expected in this environment.

Looking at the data

In addition to the general monitoring of the system behavior in production. Another relevant and extremely useful task for continuous model improvement is the observation and analysis of specific data where the model is not working properly. Whether proactively or if a deviation has been detected in the operating parameters of the model in production (more samples identified as fraud than would be expected or less) it is interesting to analyze why the model misclassifies production samples. Here it is ideal to have tools that allow to filter and visualize the samples being processed by the model in order to analyze what may be causing fraud detection errors in our model. In case of detecting indications of any feature in the visualized data that may be causing an undesired effect on the model’s performance, it probably indicates that this feature is not correctly contemplated in the distribution of the training data or in the evaluation prior to the model’s production.

But what can I do in these cases?

If possible, the ideal would be to add these data to training and evaluation (correctly separated) but there are occasions (most of them) when it is not possible to use these data. And then? Well, the solution is to characterize these data as well as possible and capture or generate samples as representative as possible to allow training and evaluation of this new reality of the data observed in production.

Improving the model based on the data

Once we have data on which the behavior of the model is not as expected, it is time to re-train our model, validate it and take it to production to start the monitoring and improvement loop again.


Monitoring and training with observed data are essential practices in deep learning. They allow us to improve the performance of our models, detect problems early, and gain a better understanding of our models.

Diagram about the different stages of a machine learning process

If you liked it, share it on