Fine-Tuning a Video Masked Autoencoder to Develop an Augmented Reality Application for Brazilian Sign Language Interpretation

Rodrigo Zempulski Fanucchi; Arlindo Rodrigues Galvão Junior; Gabriel da Mata Marques; Lucas Brandão Rodrigues; Anderson da Silva Soares; Telma Woerle Lima Soares

Rodrigo Zempulski Fanucchi Universidade Federal de Goiás http://orcid.org/0000-0002-6843-250X
Arlindo Rodrigues Galvão Junior Universidade Federal de Goiás http://orcid.org/0000-0003-2151-8039
Gabriel da Mata Marques Universidade Federal de Goiás http://orcid.org/0009-0004-4687-240X
Lucas Brandão Rodrigues Universidade Federal de Goiás http://orcid.org/0009-0004-0720-0619
Anderson da Silva Soares Universidade Federal de Goiás http://orcid.org/0000-0002-2967-6077
Telma Woerle Lima Soares Universidade Federal de Goiás http://orcid.org/0000-0002-4927-2221

Resumo

Approximately 5% of the Brazilian population experiences some level of hearing loss. Over the years, various methods have been proposed to facilitate communication between individuals with and without hearing disabilities. Typically, these solutions focus on developing techniques or mechanisms that help people with disabilities understand communication from those without disabilities. Most methods rely on neural networks or similar architectures, often neglecting the large vision models developed in recent years. In this paper, we propose a high-level design for a head-mounted device (HMD) that can translate Brazilian Sign Language into spoken or written words. As an initial result, we used the MINDS-Libras video dataset for Brazilian Sign Language and fine-tuned a Vision Transformer architecture known as VideoMAE, achieving promising results. The accuracy reached up to 84% with only 50 videos in the training subset for each word.

Palavras-chave: Vision Transformers, Augmented Reality, Brazilian Sign Language

Referências

Rahaf Abdulaziz Alawwad, Ouiem Bchir, and Mohamed Maher Ben Ismail. 2021. Arabic sign language recognition using Faster R-CNN. International Journal of Advanced Computer Science and Applications 12, 3 (2021).

Nojood M Alharthi and Salha M Alzahrani. 2023. Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition. Applied Sciences 13, 21 (2023), 11625.

Aya F Alnabih and Ashraf Y Maghari. 2024. Arabic Sign Language letters recognition using vision transformer. Multimedia Tools and Applications (2024), 1–15.

Kshitij Bantupalli and Ying Xie. 2018. American sign language recognition using deep learning and computer vision. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 4896–4899.

IBGE Censo Demográfico. 2010. Available in: [link]. Accessed on 06-19-2024.

A Elhagry and RG Elrayes. [n.d.]. Egyptian Sign Language Recognition Using CNN and LSTM. arXiv 2021. arXiv preprint arXiv:2107.13647.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.

H Hienz, B Bauer, and KF Kraiss. 2000. Video-based continuous sign language recognition using statistical methods. In Proc. of International Conference on Automatic Face and Gesture Recognition FG, Vol. 2000. 440–445.

Shagun Katoch, Varsha Singh, and Uma Shanker Tiwary. 2022. Indian Sign Language recognition system using SURF with SVM and CNN, Array, Volume 14, 2022, 100141, ISSN 2590-0056.

Deep R Kothadiya, Chintan M Bhatt, Tanzila Saba, Amjad Rehman, and Saeed Ali Bahaj. 2023. Signformer: Deep vision transformer for sign language recognition. IEEE Access 11 (2023), 4730–4739.

Nobuhiko Mukai, Shoya Yagi, and Youngha Chang. 2021. Japanese sign language recognition based on a video accompanied by the finger images. In 2021 Nicograph International (NicoInt). IEEE, 23–26.

Samara Fernandes Pimentel, Paulo Weskley de Almeida Ferreira, Luciano Teran, and Marcelle Pereira Mota. 2020. LocaLibras: an inclusive geolocation application. In Proceedings of the 19th Brazilian Symposium on Human Factors in Computing Systems. 1–6.

Tamires Martins Rezende, Sílvia Grasiella Moreira Almeida, and Frederico Gadelha Guimarães. 2021. Development and validation of a Brazilian sign language database for human gesture recognition. Neural Computing and Applications 33, 16 (2021), 10449–10467.

Marcelo Sandoval-Castaneda, Yanhong Li, Diane Brentari, Karen Livescu, and Gregory Shakhnarovich. 2023. Self-supervised video transformers for isolated sign language recognition. arXiv preprint arXiv:2309.02450 (2023).

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078–10093.

Tiago Trotta, Leonardo Rocha, Telma Rosa de Andrade, Marcelo de Paiva Guimarães, and Diego Roberto Colombo Dias. 2022. C-Libras: A Gesture Recognition App for the Brazilian Sign Language. In International Conference on Computational Science and Its Applications. Springer, 603–618.

Christian Vogler and Dimitris Metaxas. 1997. Adapting hidden Markov models for ASL recognition by using three-dimensional computer vision methods. In 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, Vol. 1. IEEE, 156–161.