Comparative Analysis of DeepFace Attribute Classification on the FairFace Validation Set
Publication Date : Apr-21-2026
Author(s) :
Volume/Issue :
Abstract :
Automated facial analysis systems increasingly estimate demographic attributes such as age, gender, and race from images, yet performance disparities across demographic groups remain a substantial concern. This study evaluates the off-the-shelf DeepFace attribute classifier, accessed through the deepface Python library, on the FairFace validation set. FairFace labels are treated as ground truth, and DeepFace is applied without additional fine-tuning. To align the label spaces, FairFace’s East Asian and Southeast Asian categories are merged into a single Asian class to match DeepFace’s race output. Performance is assessed using age-range accuracy, mean signed age error (age bias), overall accuracy, balanced accuracy, per-class precision, recall, and F1 score (F1), macro-averaged F1, Cohen’s kappa coefficient (κ), and chi-square tests of race-related error disparities. On 10,954 validation images, DeepFace achieved age-range accuracy of 0.28437 with a positive age bias of 3.65926 years, gender accuracy of 0.72074, and race accuracy of 0.59540. Gender results showed strong asymmetry between female recall (0.44537) and male recall (0.96616), while race results showed substantially lower F1 scores for Indian, Latino/Hispanic, and Middle Eastern groups than for Asian and Black groups. These findings show that strong face-verification performance does not necessarily translate into equitable demographic attribute prediction and underscore the need for subgroup-level evaluation and fairness-aware model development.
