Fine-Tuning a Video Masked Autoencoder to Develop an Augmented Reality Application for Brazilian Sign Language Interpretation


Approximately 5% of the Brazilian population experiences some level of hearing loss. Over the years, various methods have been proposed to facilitate communication between individuals with and without hearing disabilities. Typically, these solutions focus on developing techniques or mechanisms that help people with disabilities understand communication from those without disabilities. Most methods rely on neural networks or similar architectures, often neglecting the large vision models developed in recent years. In this paper, we propose a high-level design for a head-mounted device (HMD) that can translate Brazilian Sign Language into spoken or written words. As an initial result, we used the MINDS-Libras video dataset for Brazilian Sign Language and fine-tuned a Vision Transformer architecture known as VideoMAE, achieving promising results. The accuracy reached up to 84% with only 50 videos in the training subset for each word.
Palavras-chave: Vision Transformers, Augmented Reality, Brazilian Sign Language


