ModBERTBr: A ModernBERT-based Model for Brazilian Portuguese
Abstract
A key model in the Large Language Model (LLM) field is the Bidirectional Encoder Representations from Transformers (BERT), known for its effectiveness and versatility. The current state-of-the-art variant of BERT is ModernBERT, and despite excelling in efficiency and performance, it is limited to English. This paper addresses this notable gap by introducing ModBERTBr, a novel pre-trained model based on the ModernBERT architecture, which is explicitly tailored for Brazilian Portuguese and incorporates cutting-edge research and technologies. Through both intrinsic and extrinsic evaluations, ModBERTBr was assessed against multiple baseline models, showing consistent improvements and competitive performance compared to its predecessors.References
Ali, M., Fromm, M., Thellmann, K., Rutmann, R., Lübbering, M., Leveling, J., Klug, K., Ebert, J., Doll, N., Buschhoff, J. S., Jain, C., Weber, A. A., Jurkschat, L., Abdelwahab, H., John, C., Suarez, P. O., Ostendorff, M., Weinbach, S., Sifa, R., Kesselheim, S., and Flores-Herr, N. (2024). Tokenizer choice for llm training: Negligible or crucial? arXiv preprint arXiv:2310.08754.
Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Gao, T., Jin, J., Ke, Z. T., and Moryoussef, G. (2025). A comparison of deepseek and other llms. arXiv preprint arXiv:2502.03688.
Goldman, O., Caciularu, A., Eyal, M., Cao, K., Szpektor, I., and Tsarfaty, R. (2024). Unpacking tokenization: Evaluating text compression and its correlation with model performance. arXiv preprint arXiv:2403.06265.
Gu, A. and Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Hägele, A., Bakouch, E., Kosson, A., Allal, L. B., Werra, L. V., and Jaggi, M. (2024). Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392.
Izsak, P., Berchansky, M., and Levy, O. (2021). How to train BERT with an academic budget. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Koroteev, M. V. (2021). BERT: A review of applications in natural language processing and understanding. CoRR, abs/2103.11943.
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. CoRR, abs/1804.10959.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Nguyen, T., Raghu, M., and Kornblith, S. (2020). Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d’Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M. J., Hechtman, B. A., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
Shen, Y., Stallone, M., Mishra, M., Zhang, G., Tan, S., Prasad, A., Soria, A. M., Cox, D. D., and Panda, R. (2024). Power scheduler: A batch size and token number agnostic learning rate scheduler. arXiv preprint arXiv:2408.13359.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417. Springer International Publishing, Cham.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H. W., Narang, S., Yogatama, D., Vaswani, A., and Metzler, D. (2021). Scale efficiently: Insights from pre-training and fine-tuning transformers. CoRR, abs/2109.10686.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762.
Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., and Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663.
Wettig, A., Gao, T., Zhong, Z., and Chen, D. (2023). Should you mask 15 arXiv preprint arXiv:2202.08005.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. CoRR, abs/2002.04745.
Xuan, H., Stylianou, A., Liu, X., and Pless, R. (2020). Hard negative examples are hard, but useful. CoRR, abs/2007.12749.
Zago, R. and Pedotti, L. (2024). Bertugues: A novel bert transformer model pre-trained for brazilian portuguese. Semina: Ciências Exatas e Tecnológicas, 45:e50630.
Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Gao, T., Jin, J., Ke, Z. T., and Moryoussef, G. (2025). A comparison of deepseek and other llms. arXiv preprint arXiv:2502.03688.
Goldman, O., Caciularu, A., Eyal, M., Cao, K., Szpektor, I., and Tsarfaty, R. (2024). Unpacking tokenization: Evaluating text compression and its correlation with model performance. arXiv preprint arXiv:2403.06265.
Gu, A. and Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
Hägele, A., Bakouch, E., Kosson, A., Allal, L. B., Werra, L. V., and Jaggi, M. (2024). Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392.
Izsak, P., Berchansky, M., and Levy, O. (2021). How to train BERT with an academic budget. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Koroteev, M. V. (2021). BERT: A review of applications in natural language processing and understanding. CoRR, abs/2103.11943.
Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. CoRR, abs/1804.10959.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Nguyen, T., Raghu, M., and Kornblith, S. (2020). Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations.
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d’Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark, A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M. J., Hechtman, B. A., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
Shen, Y., Stallone, M., Mishra, M., Zhang, G., Tan, S., Prasad, A., Soria, A. M., Cox, D. D., and Panda, R. (2024). Power scheduler: A batch size and token number agnostic learning rate scheduler. arXiv preprint arXiv:2408.13359.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417. Springer International Publishing, Cham.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H. W., Narang, S., Yogatama, D., Vaswani, A., and Metzler, D. (2021). Scale efficiently: Insights from pre-training and fine-tuning transformers. CoRR, abs/2109.10686.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. CoRR, abs/1706.03762.
Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., and Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663.
Wettig, A., Gao, T., Zhong, Z., and Chen, D. (2023). Should you mask 15 arXiv preprint arXiv:2202.08005.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. CoRR, abs/2002.04745.
Xuan, H., Stylianou, A., Liu, X., and Pless, R. (2020). Hard negative examples are hard, but useful. CoRR, abs/2007.12749.
Zago, R. and Pedotti, L. (2024). Bertugues: A novel bert transformer model pre-trained for brazilian portuguese. Semina: Ciências Exatas e Tecnológicas, 45:e50630.
Published
2025-09-29
How to Cite
WU, Wallace Ben Teng Lin; GARCIA, Luis Paulo Faina.
ModBERTBr: A ModernBERT-based Model for Brazilian Portuguese. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 2044-2055.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14516.
