Home

Donate
Perspective

Why We Shouldn’t Trust Facial Recognition’s Glowing Test Scores

Teo Canmetin, Juliette Zaccour, Luc Rocher / Aug 18, 2025

In 2020, a Black man named Robert Williams was wrongfully arrested in Detroit after being misidentified by facial recognition software, a mistake police later admitted was due to a poor-quality surveillance image. In 2024, Shaun Thompson, a London-based knife crime-prevention activist, was wrongfully identified by live facial recognition technology as a criminal suspect and subjected to an “‘intimidating” and “aggressive” police stop. An independent review of the Live Facial Recognition trials by London’s Metropolitan Police found that out of 42 matches, only eight could be confirmed as absolutely accurate.

Failures in facial recognition technology are far from uncommon, and numerous examples continue to be reported in the press. Despite these repeated failures, the technology is rapidly being integrated into our daily lives, in airports, retail stores, and policing.

Its deployment is often justified by impressive accuracy statistics. For the latest and best-performing models, standardized evaluations now report figures as high as 99.95% accuracy. Out of context, these numbers suggest that facial recognition has progressed to be extremely accurate. But there's a problem: these near-perfect numbers fail to reflect reality. Facial recognition appears to be significantly less accurate in real-world settings.

To understand why lab results differ from real-world outcomes, we must closely examine gold standard assessments conducted by organizations such as the US National Institute of Standards & Technology (NIST). These assessments play a significant role in establishing how algorithmic accuracy is measured and defined.

The NIST Facial Recognition Technology Evaluation (FRTE) is a benchmark that has become central to legitimizing the global rollout of facial recognition systems, including its use by the UK’s Metropolitan Police Service. While invaluable for tracking advances in the field, these benchmarks are not necessarily a good choice for anticipating how systems cope with real-world challenges.

Lab evaluations appear objective but often ignore how the technology could perform well in an airport but not on a rainy street, or inside a crowded stadium. As a result, organizations can report impressive accuracy figures based on generalizations from controlled settings, creating a misleading picture of how these systems truly perform when confronted with diverse, messy and unpredictable real-world environments.

Benchmark datasets for evaluating facial recognition

To create a facial recognition benchmark, the first step is to build a dataset of facial images. These images serve as a reference pool among which algorithms attempt to correctly identify a single face. Several factors limit a benchmark’s ability to predict accuracy of video surveillance-based identification in the real world. This includes image quality, dataset size, and demographic diversity.

Issue 1: Benchmark images are overly ideal compared to real world conditions

Uniform static images make it easier for computer scientists to compare different models together and track which algorithms appear better at distinguishing faces. But clear, static images differ significantly from the variable and often degraded conditions found in live surveillance footage.

Deployed facial recognition systems must contend with a range of visual challenges that significantly hinder their performance. These obstacles can lead to false identifications when individuals are incorrectly matched. They can also lead to missed identifications when the system fails to recognize someone it should.

In operational settings, faces may be partially obscured. Imagine you are walking down the street. Perhaps you are wearing sunglasses or masks, items which can obscure anatomical traits. Or perhaps the light is a little more dappled than usual, and the camera doesn’t see you head on. Plenty of other factors can degrade image quality, including environmental factors such as weather, motion blur, and crowd density. These real-world variables are typically not captured in controlled benchmark evaluations.

Efforts have been made in the NIST FRTE to include images that better reflect real-world conditions. In addition to mugshot images, fourteen percent of the dataset now comprises images captured by US-based service desk webcams, intended to simulate less controlled environments. However, these images still fail to reflect the complexity of images captured in real-world deployments.

Issue 2: Benchmark datasets are too small

In facial recognition, the relationship between accuracy and population size is complex. As a general rule of thumb, when the pool of individuals to identify from grows larger, the task becomes significantly harder for the algorithm, and its accuracy tends to decline. Benchmark tests such as the NIST mugshot evaluation use large datasets — up to 12 million individual faces — and claim almost immaculate levels of accuracy even at these scales.

But in reality, algorithms are known for identifying people at a much larger scale, some scanning hundreds of millions of faces on the Internet. When scaled to population-level use such as nationwide policing, our recent research shows that accuracy rates could fall much further, amplifying the rate of false matches. Despite the significant high-stake implications of deploying this technology in the context of policing, current benchmarks do little to reflect how algorithmic performance degrades at scale.

Issue 3: Benchmark datasets are not representative of real-world demographics

Facial recognition algorithms are built using training datasets which can lack real-world demographic diversity. This inherent bias leads to disparities in model performance across different groups. For example, a model trained on lighter skin tones may have lower accuracy with darker skin tones. The consequences are thus most likely felt by racial and ethnic minorities. Similarly, underrepresentation of various ages, genders, and other facial characteristics in training data can significantly reduce performance for those groups.

Demographics represented in evaluation datasets are therefore critical for assessing a benchmark’s reliability. Evaluation datasets must accurately reflect the diversity of populations on which the technology will be used. One key benchmark used to justify the Metropolitan Police's deployment of live facial recognition is produced by the UK's National Physical Laboratory (NPL), serving as central evidence for the technology’s deployment across London. While this report expands on NIST’s benchmarks by testing algorithms in live settings, its evaluation data often fails to be representative, especially for specific contexts.

For instance, despite concerns around school children being increasingly subjected to humiliating and aggressive police stops as a result of live facial recognition, evaluation data used to test algorithms remain much less representative of younger age ranges. In the UK NPL report, individuals between the ages of 12-18 are under-represented, and those under the age of 12 entirely omitted, bringing into question the deployment of this technology on youth.

Charting a research agenda for rigorous evaluations

In contrast with the widespread citation of benchmark figures showing near-perfect accuracy, actual performance of deployed facial recognition remains far more complex to evaluate.

To address this critical gap and ensure responsible deployment, a robust, independent research agenda is urgently needed to evaluate facial recognition technology in real-world contexts. This agenda must move beyond controlled lab environments and focus on understanding how these systems truly perform under operational conditions.

This could include:

  • developing realistic benchmarks and evaluation methods tailored to operational contexts and intended use;
  • assessing system performance under large-scale conditions and across diverse demographic groups;
  • defining clear and potentially legally-binding thresholds for ‘sufficient accuracy’ in high-stakes applications; and
  • facilitating independent research and oversight by providing secure access to real-world deployment data.

At the moment, we believe that research evaluating real-world impact is not only paramount but also missing, due to the difficulty in undertaking this line of work independently.

Without dedicated and transparent research into these critical areas, decisions about the deployment of facial recognition systems will continue to be based on out-of-context lab results, rather than a clear understanding of their real-world impacts and inherent limitations.

Authors

Teo Canmetin
Teo Canmetin is a Shirley Scholar (MSc) and Research Assistant at the Oxford Internet Institute, University of Oxford.
Juliette Zaccour
Juliette Zaccour is a PhD candidate at the Oxford Internet Institute. Her work centers on how data sharing practices impact research integrity, AI evaluations, and data privacy. She has a background in cultural analysis and holds a Master in Information.
Luc Rocher
Luc Rocher is an Associate Professor, Senior Research Fellow, and UKRI Future Leaders Fellow at the Oxford Internet Institute, University of Oxford. They lead the Synthetic Society Lab, a research group that conducts human-centered computing research to understand how data, digital infrastructure, a...

Related

Perspective
The High Stakes of Biometric SurveillanceJune 23, 2025

Topics