Adapting ASR Models to Technical Scenarios: A Case Study in the Brazilian Automotive Repair Domain
Abstract
This work proposes a pipeline for adapting automatic speech recognition models to the domain of automotive repair shops in Brazilian Portuguese. The process includes real data collection, manual transcription, and dataset curation, with adjustments to the Conformer, and Wav2Vec 2.0 models, while Whisper model served as a comparative baseline. The approach aims to improve speech recognition accuracy in noisy environments with specific technical vocabulary. The Conformer model achieves the best results, with a word error rate of 12.97 percent and a character error rate of 5.46 percent, surpassing the Whisper Large-v3 in both transcription accuracy and inference speed.References
Ahlawat, H., Aggarwal, N., and Gupta, D. (2025). Automatic speech recognition: A survey of deep learning techniques and approaches. International Journal of Cognitive Computing in Engineering, 6:201–237.
Alvarenga, J. P. R., Merschmann, L. H. d. C., and Luz, E. J. d. S. (2023). A data-centric approach for portuguese speech recognition: Language model and its implications. IEEE Latin America Transactions, 21(4):546–556.
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., Von Platen, P., Saraf, Y., Pino, J., et al. (2021). Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Candido Junior, A., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., and Aluísio, S. M. (2022). Coraa asr: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. Lang. Resour. Eval., 57(3):1139–1171.
Chen, S. F., Beeferman, D., and Rosenfeld, R. (1998). Evaluation metrics for language models. Carnegie Mellon University.
de Azevedo, D. M., Rodrigues, G. S., and Ladeira, M. (2022). A probabilistically-oriented analysis of the performance of asr systems for brazilian radios and tvs. In Intelligent Systems: 11th Brazilian Conference, BRACIS 2022, Campinas, Brazil, November 28 – December 1, 2022, Proceedings, Part II, page 169–180, Berlin, Heidelberg. Springer-Verlag.
Gonçalves, Y., Alves, J., Sá, B., Silva, L., Macedo, J., and da Silva, T. C. (2024). Speech recognition models in assisting medical history. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 485–497, Porto Alegre, RS, Brasil. SBC.
Gris, L. R. S., Casanova, E., de Oliveira, F. S., da Silva Soares, A., and Candido Junior, A. (2022). Brazilian portuguese speech recognition using wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21–23, 2022, Proceedings, page 333–343, Berlin, Heidelberg. Springer-Verlag.
Gris, L. R. S., Casanova, E., Oliveira, F. S. d., Soares, A. d. S., and Candido Junior, A. (2021). Desenvolvimento de um modelo de reconhecimento de voz para o português brasileiro com poucos dados utilizando o wav2vec 2.0. In Congresso da Sociedade Brasileira de Computação CSBC. SBC.
Grover, M., Bamdev, P., Singla, Y., Hama, M., and Shah, R. (2020). Audino: A modern annotation tool for audio and speech.
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, pages 5036–5040.
Karl, A., Fernandes, G., Pires, L., Serpa, Y., and Caminha, C. (2024). Synthetic ai data pipeline for domain-specific speech-to-text solutions. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 37–47, Porto Alegre, RS, Brasil. SBC.
Maulik, U. B., Mitra, P., and Sarkar, S. (2025). Enhancing domainspecific asr performance using finetuning and zero-shot prompting: A study in the medical domain. In Proceedings of the 2024 Sixth Doctoral Symposium on Intelligence Enabled Research (DoSIER 2024), pages 1–6, Jalpaiguri, India.
Medeiros, E., Corado, L., Rato, L., Quaresma, P., and Salgueiro, P. (2023). Domain adaptation speech-to-text for low-resource european portuguese using deep learning. Future Internet, 15(5).
Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., and Pletschacher, S. (2021). A survey of ocr evaluation tools and metrics. In Proceedings of the 6th International Workshop on Historical Document Imaging and Processing, HIP ’21, page 13–18, New York, NY, USA. Association for Computing Machinery.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. [link].
Alvarenga, J. P. R., Merschmann, L. H. d. C., and Luz, E. J. d. S. (2023). A data-centric approach for portuguese speech recognition: Language model and its implications. IEEE Latin America Transactions, 21(4):546–556.
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., Von Platen, P., Saraf, Y., Pino, J., et al. (2021). Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Candido Junior, A., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., and Aluísio, S. M. (2022). Coraa asr: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. Lang. Resour. Eval., 57(3):1139–1171.
Chen, S. F., Beeferman, D., and Rosenfeld, R. (1998). Evaluation metrics for language models. Carnegie Mellon University.
de Azevedo, D. M., Rodrigues, G. S., and Ladeira, M. (2022). A probabilistically-oriented analysis of the performance of asr systems for brazilian radios and tvs. In Intelligent Systems: 11th Brazilian Conference, BRACIS 2022, Campinas, Brazil, November 28 – December 1, 2022, Proceedings, Part II, page 169–180, Berlin, Heidelberg. Springer-Verlag.
Gonçalves, Y., Alves, J., Sá, B., Silva, L., Macedo, J., and da Silva, T. C. (2024). Speech recognition models in assisting medical history. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 485–497, Porto Alegre, RS, Brasil. SBC.
Gris, L. R. S., Casanova, E., de Oliveira, F. S., da Silva Soares, A., and Candido Junior, A. (2022). Brazilian portuguese speech recognition using wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21–23, 2022, Proceedings, page 333–343, Berlin, Heidelberg. Springer-Verlag.
Gris, L. R. S., Casanova, E., Oliveira, F. S. d., Soares, A. d. S., and Candido Junior, A. (2021). Desenvolvimento de um modelo de reconhecimento de voz para o português brasileiro com poucos dados utilizando o wav2vec 2.0. In Congresso da Sociedade Brasileira de Computação CSBC. SBC.
Grover, M., Bamdev, P., Singla, Y., Hama, M., and Shah, R. (2020). Audino: A modern annotation tool for audio and speech.
Gulati, A., Qin, J., Chiu, C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., and Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, pages 5036–5040.
Karl, A., Fernandes, G., Pires, L., Serpa, Y., and Caminha, C. (2024). Synthetic ai data pipeline for domain-specific speech-to-text solutions. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 37–47, Porto Alegre, RS, Brasil. SBC.
Maulik, U. B., Mitra, P., and Sarkar, S. (2025). Enhancing domainspecific asr performance using finetuning and zero-shot prompting: A study in the medical domain. In Proceedings of the 2024 Sixth Doctoral Symposium on Intelligence Enabled Research (DoSIER 2024), pages 1–6, Jalpaiguri, India.
Medeiros, E., Corado, L., Rato, L., Quaresma, P., and Salgueiro, P. (2023). Domain adaptation speech-to-text for low-resource european portuguese using deep learning. Future Internet, 15(5).
Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., and Pletschacher, S. (2021). A survey of ocr evaluation tools and metrics. In Proceedings of the 6th International Workshop on Historical Document Imaging and Processing, HIP ’21, page 13–18, New York, NY, USA. Association for Computing Machinery.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. [link].
Published
2025-09-29
How to Cite
SILVA, Daniel Ribeiro da; BORBA, Maria Eduarda Silva; OLIVEIRA, Gustavo dos Reis; PIMENTA, Pedro Reis; SILVA, Állan Christoffer Pereira; DUTRA, Guilherme Correia; OLIVEIRA, Sávio Salvarino Teles de.
Adapting ASR Models to Technical Scenarios: A Case Study in the Brazilian Automotive Repair Domain. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 385-396.
DOI: https://doi.org/10.5753/stil.2025.37840.
