ETL for identity verification AI-based systems
The state of the art identity verification systems are currently based on artificial intelligence (AI) technology. These systems are fueled with data:
- Quantity: these systems need huge amounts of labeled (or unlabeled) data to be trained.
- Speed: these systems need to be trained periodically with new data to be able to detect new kinds of spoofing that attackers are constantly figuring out in order to fool the system.
- Quality: the quality of the data is key in order to have reliable identity verification systems.
In our dedication to enhancing our systems, at Alice Biometrics we integrate DataOps principles to ensure optimum performance of our verification systems. In this post, we explain the key role of ETL processes within our operations, highlighting how they enhance our identity verification systems.
What is ETL?
Extract, Transform, Load (ETL) aims to ensure that data is accessible, meaningful, and ready for usage by different teams or applications within an organization. These processes handle large volumes of data from various sources, refine that data, and make it suitable for analytics, machine learning (ML) models, reporting, etc. In this post, we’ll focus on the ETL processes that make data suitable for ML models that empower our identity verification systems.
Let’s break down each component of ETL and how it aligns with the operations at Alice Biometrics:
Extraction involves gathering data from different sources, which can include databases, files, external APIs, and more. These sources often have varying formats, and structures.
At our case, we collect data from disparate sources including internal data from our database, data captures performed by the Alice team to cover specific cases, or external data such as open datasets provided by companies or universities, among others.
Transformation is the process of cleaning, enriching, and structuring the raw data retrieved in the Extract phase, making the data consistent and suitable for analysis. This step is crucial for ensuring the accuracy and relevance of the data.
In our organization, this step involves aggregating annotations or labels (previously performed by Alice’s team) to our data to be able to perform ML training in a supervised manner. Other kinds of annotations are also aggregated automatically through inference of different ML models. In this step, the data is also structured to fit the schema of our tables defined in our data warehouse.
Loading is the phase where the transformed data is stored in a destination system, typically a data warehouse or a data lake, making it readily available for data users or applications.
At Alice Biometrics, the transformed data is loaded into our powerful data warehouse in BigQuery. This centralized repository serves as a key resource for ML users, allowing them to access easily the data that feed the ML models.
Various options exist for an analytical database, including data warehouses and data lakes, with different tools and services available for implementation. However, this will be discussed in a subsequent post.
Let’s now explore additional concepts associated with these ETL processes.
ETL is usually orchestrated using data pipelines, that is, these processes are executed periodically in a controlled and automated manner (data orchestration).
There are multiple tools for orchestrating ETL processes through data pipelines including Apache Airflow, AWS Glue, Azure Data Factory, etc. In Alice, we use cronjobs in Kubernetes for orchestrating our ETL processes.
Batch or streaming ETL
There are two different approaches regarding data processing for performing ETL processes:
- Batch ETL involves processing data in batches at scheduled intervals or triggered based upon some event. Data is collected over a period, and then the entire batch is processed at once.
- Streaming ETL processes data in real-time, that is, as soon as it’s generated or ingested.
The choice between batch and streaming ETL depends on the specific needs and requirements of the use cases, latency constraints, infrastructure capabilities, the volume of data to handle, and other relevant factors.
In this particular use case, we strongly recommend choosing batch processing whenever possible. At Alice, we initially embraced streaming processing, but encountered various challenges such as increased complexity and a high volume of write operations to the data warehouse. Given that our use case has relatively relaxed latency requirements (there’s no immediate need for access the data), along with other considerations, we made the strategic decision to transition to batch processing. This shift has allowed us to address these challenges effectively and optimize our overall data processing approach.
We also process other datasets in a bulk fashion due to their static nature.
Data infrastructure is the foundation upon which the entire DataOps ecosystem is built, including the ETL processes. It encompasses the technology, and architecture that support the storage, movement, processing, scaling, and management of data. Mechanisms for monitoring and alerting are also essential to ensure the health, availability, and performance of the data infrastructure and the data pipelines it supports.
The data infrastructure in Alice Biometrics relies on both cloud and on-premises resources, including tools and services such as Elasticsearch, BigQuery, Kubernetes, RabbitMQ, and Kibana, among others.
By leveraging ETL processes, we collect data from various sources, making it suitable and easily accessible for fueling our ML models. This allows us to keep our identity verification systems reliable, being fueled with quality data and capable of adapting quickly to new data patterns.
Ultimately, with the automation of data preparation, among other automated ML-specific tasks such as training, evaluation, and deployment, we achieve a fully automated pipeline for continuously retraining our ML models. This will be covered in a following post but it basically allows our identity verification systems to be updated frequently, capture emerging data patterns (e.g. new kinds of attacks), and make sure that our systems are not negatively impacted.