Defining the core accuracy metrics of biometric systems
During the evaluation of Biometrics-based systems, one of the first steps involves defining the core accuracy metrics. This is an important step that helps us to assess if the behavior of our system corresponds with our expectations. Verification scores measure the differences between two given biometrics samples and it’s normally given in 2 forms: a) similarity: the higher the more likely to belong to the same identity, and b) distance: the lower the more likely to belong to the same identity. Starting from this basic concept, in this article we will review certain key metrics to understand how biometric systems are evaluated.
Matching Scores Distributions: It’s a simple yet effective visualization of the classifier performance showing the distribution of genuine and impostor accesses with respect to the matching score. It normally includes the estimated verification threshold.
FMR – False Match Rate: An empirical estimate of the probability (percentage of times) at which the system incorrectly accepts that a biometric sample belongs to the claim identity when the sample actually belongs to a different subject (impostor). This metric is an algorithmic level verification error. Similarly, given a vector of Ni impostor scores, v, the false match rate (FMR) is computed as the proportion below some threshold, T :
where H(x) is the unit step function, and H(0) taken to be 1.
FNMR – False Non-Match Rate: An empirical estimate of the probability (percentage of times) at which the system incorrectly rejects a claimed identity when the sample actually belongs to the trusted subject (genuine). This metric is an algorithmic level verification error.
Given a vector of Ng genuine scores, u, the false non-match rate (FNMR) is computed as the proportion above some threshold, T :
FAR/FRR – False Acceptance/Rejection Rate: FAR and FMR are often used interchangeably in the literature, such as FNMR and FRR. However, their subtle difference is that FAR and FRR are system-level errors which include samples that failed to be acquired or compared. System level verification errors requires observation of:
a) FTC – Failure To Capture: during sample generation, it appears when the system is not able to capture the sample. The percentage of time when the biometric system fails to capture the biometric characteristic. This FTC rate is only applicable when the system has an automatic capturing functionality to “count” such failure. Failure to capture can come from the device itself (e.g. H/W problems) or the person “carrying” the biometric characteristic (e.g extremely faint fingerprints),
b) FTE – Failure To Enrol: when there is no reference for this subject. The percentage of time when the users are not able to enrol in the system (proportion of failed template generation attempts.). Such errors occur typically, when the system rejects poorly captured input characteristics during enrolment phase or because the software selectively refuses to process the input image (the software should throw an exception): this would typically occur if a face is not detected. Ideally, the database contains only good quality samples of the biometrics, and
c) FTA – Failure To Acquire: when there is no probe feature vector
Threshold for Specific Working Points: The threshold, T , can take on any value. We typically generate a set of thresholds from quantiles of the observed impostor scores, v, as follows. Given some interesting false match rate range, we form a vector of K thresholds corresponding to FMR measurements evenly spaced on a logarithmic scale
where Q is the quantile function.
Note that FAR values monotonically increase when the verification threshold increases, whilst FRR values do exactly the opposite. It is, therefore, not possible to minimise the two error rates simultaneously.
EER – Equal Error Rate: The rate at which FMR is equal to FNMR (threshold-independent).
ensures that the threshold found will satisfy the equality condition between FNMR and FMR as closely as possible.
HTER – Half Total Error Rate: HTER is defined as the average of FNMR and FMR (threshold dependent), that is:
ROC – Receiver Operating Characteristic curve: Plot of the rate of false positives (i.e impostor attempts accepted) on the y-axis against the corresponding rate of true positives (i.e genuine attempts accepted) on the x-axis plotted parametrically as a function of the decision threshold. Other versions are also considered (e.g TPR vs FPR, note that in that case curve’s shape would be different).
DET – Detection Error Trade-Off curve: It’s a modified ROC curve which plots error rates on both axes (false positives on the x-axis and false negatives on the y-axis). See an example in the following Figure.
APCER – Attack Presentation Classification Error Rate: Proportion of attack presentations using the same PAI species incorrectly classified as bona fide presentations in a specific scenario. APCER is calculated for each evaluated PAI species using the equation below, where NPAI is the number of attack presentations for the given PAI species, and Res takes value 1 if the presentation is classified as an attack presentation, and value 0 if classified as a bona fide presentation given a threshold (θ).
Then to calculate the worst scenario we use the following equation. We can observe that the number of PAIS ∈ P, where P is the set of selected PAIS, is determinant to the APCER result. If we use a fine-grained PAI species there is a greater probability that an PAI will penalize the worst case scenario.
BPCER – Bona fide Presentation Classification Error Rate: Proportion of bona fide presenta- tions incorrectly classified as attack presentations in a specific scenario.
where NBF is the number of bona fide presentations, and Res takes value 1 if the presentation is classifiers as an attack presentation, and value 0 if classified as a bona fide presentation given a threshold (θ).
ACER-Average Classification Error Rate: This metric is defined as the average of the APCERmax(PAI) and the BPCER for a pre-defined decision threshold.
All these metrics have been used for many years to compare face-PAD approaches. However, it is sometimes necessary to look for new metrics or to reformulate them in order to make fairer evaluations. In future articles we will review some new metrics that we use internally to improve the evaluation of our systems.