When Text and Image Meet Face Emotion Retrieval: Benchmark and a Variegated Dataset
Resumo
Automatic facial emotion recognition, despite deep learning advancements, faces challenges with large manually annotated datasets and demographic imbalance. While visual-language models (VLMs) show promise in aligning visual and textual emotion data, content-based facial emotion retrieval is underexplored, and larger models exhibit demographic sensitivity. This study introduces Facial INtensity of Emotions (FINE), the first large-scale dataset balancing gender, race, and Ekman's six emotions, across four intensity of emotion levels. We propose a retrieval protocol and baseline using fine-tuned contrastive language-image pre-training-based VLMs, demonstrating that fine-tuning on FINE consistently improves accuracy and reduces performance variance across demographics and emotion classes, thereby mitigating representation bias. However, misclassification rates still increase at the highest intensity levels even after fine-tuning, indicating expression magnitude remains an open challenge. This work confirms fine-tuning's value in enhancing generalization and reducing variability in emotion retrieval, establishing FINE as a robust benchmark.
Palavras-chave:
Visualization, Emotion recognition, Accuracy, Systematics, Sensitivity, Protocols, Training data, Benchmark testing, Data models, Faces
Publicado
30/09/2025
Como Citar
ANDRADE, Fillipe; CARIGÉ, Rui; SCHINEIDER, Arthur; OLIVEIRA, Luciano.
When Text and Image Meet Face Emotion Retrieval: Benchmark and a Variegated Dataset. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 206-211.
