Fine-tuned model evaluation on Transformer Fragments for Identifying Idiomatic Expressions in Portuguese
Abstract
This work addresses the challenge of identifying Idiomatic Expressions (IEs) in Portuguese, a problem whose main challenge lies in the semantic non-compositionality and ambiguity of these structures. The scarcity of annotated data and the limitations of the models in capturing idiomaticity motivated the construction of an annotated corpus and the proposal of a method based on the use of Transformer fragments to identify IEs in sentences. The method uses attention weights from the BERTimbau model, focusing on a specific head sensitive to relevant syntactic relations in IEs, and integrates linguistic heuristics to penalize literal uses. The results demonstrate high precision (1.0) with no false positives, and a recall of 66.7%, resulting in an F1 score of 0.8. Furthermore, the work compares the results with methods already used in the literature using other fine-tuned BERT architecture models.References
Barreto, S. d. O. G., Marcilese, M., and de Oliveira, A. J. A. (2018). Idiomaticidade, familiaridade e informação prévia no processamento de expressões idiomáticas do pb. Letras de Hoje, 53(1):119–129.
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis of bert’s attention. In Proceedings of ACL, pages 311–330.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
Cook, P., Fazly, A., and Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Gregoire, N., Evert, S., and Kim, S. N., editors, Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, pages 41–48, Prague, Czech Republic. Association for Computational Linguistics.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1986–1997, Berlin, Germany. Association for Computational Linguistics.
Crespo, M. C. R. M., de Souza Jeannine Rocha, M. L., Sturzeneker, M. L., Serras, F. R., de Mello, G. L., Costa, A. S., Palma, M. F., Mesquita, R. M., de Paula Guets, R., da Silva, M. M., Finger, M., de Sousa, M. C. P., Namiuti, C., and do Monte, V. M. (2023). Carolina: a general corpus of contemporary brazilian portuguese with provenance, typology and versioning information.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., and Villavicencio, A. (2021). Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2730–2741, Online. Association for Computational Linguistics.
Hashempour, R. and Villavicencio, A. (2020). Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Zock, M., Chersoni, E., Lenci, A., and Santus, E., editors, Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, pages 72–80, Online. Association for Computational Linguistics.
King, M. and Cook, P. (2018). Leveraging distributed representations and lexico-syntactic fixedness for token-level prediction of the idiomaticity of English verb-noun combinations. In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 345–350, Melbourne, Australia. Association for Computational Linguistics.
Phelps, D., Fan, X.-R., Gow-Smith, E., Tayyar Madabushi, H., Scarton, C., and Villavicencio, A. (2022). Sample efficient approaches for idiomaticity detection. In Bhatia, A., Cook, P., Taslimipoor, S., Garcia, M., and Ramisch, C., editors, Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 105–111, Marseille, France. European Language Resources Association.
Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*.
Rohanian, O., Rei, M., Taslimipoor, S., and Ha, L. A. (2020). Verbal multiword expressions for identification of metaphor. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. Brazilian Conference on Intelligent Systems (BRACIS). arXiv preprint arXiv:2009.10683.
Tagnin, S. E. O. (2013). O jeito que a gente diz: combinações consagradas em inglês e português. Disal, Barueri.
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., and Villavicencio, A. (2022). SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. In Emerson, G., Schluter, N., Stanovsky, G., Kumar, R., Palmer, A., Schneider, N., Singh, S., and Ratan, S., editors, Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107–121, Seattle, United States. Association for Computational Linguistics.
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., and Villavicencio, A. (2021). AStitchInLanguageModels: Dataset and methods for the exploration of idiomaticity in pre-trained language models. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3464–3477, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Tenney, I., Das, D., and Pavlick, E. (2019). Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, pages 4593–4601.
Xatara, C. M. (2001). Tipologia das expressões idiomáticas. ALFA: Revista de Linguística, 42(1).
Zeng, Z. and Bhat, S. (2021). Idiomatic expression identification using semantic compatibility. Transactions of the Association for Computational Linguistics, 9:1546–1562.
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis of bert’s attention. In Proceedings of ACL, pages 311–330.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
Cook, P., Fazly, A., and Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Gregoire, N., Evert, S., and Kim, S. N., editors, Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, pages 41–48, Prague, Czech Republic. Association for Computational Linguistics.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1986–1997, Berlin, Germany. Association for Computational Linguistics.
Crespo, M. C. R. M., de Souza Jeannine Rocha, M. L., Sturzeneker, M. L., Serras, F. R., de Mello, G. L., Costa, A. S., Palma, M. F., Mesquita, R. M., de Paula Guets, R., da Silva, M. M., Finger, M., de Sousa, M. C. P., Namiuti, C., and do Monte, V. M. (2023). Carolina: a general corpus of contemporary brazilian portuguese with provenance, typology and versioning information.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., and Villavicencio, A. (2021). Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2730–2741, Online. Association for Computational Linguistics.
Hashempour, R. and Villavicencio, A. (2020). Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Zock, M., Chersoni, E., Lenci, A., and Santus, E., editors, Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, pages 72–80, Online. Association for Computational Linguistics.
King, M. and Cook, P. (2018). Leveraging distributed representations and lexico-syntactic fixedness for token-level prediction of the idiomaticity of English verb-noun combinations. In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 345–350, Melbourne, Australia. Association for Computational Linguistics.
Phelps, D., Fan, X.-R., Gow-Smith, E., Tayyar Madabushi, H., Scarton, C., and Villavicencio, A. (2022). Sample efficient approaches for idiomaticity detection. In Bhatia, A., Cook, P., Taslimipoor, S., Garcia, M., and Ramisch, C., editors, Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 105–111, Marseille, France. European Language Resources Association.
Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*.
Rohanian, O., Rei, M., Taslimipoor, S., and Ha, L. A. (2020). Verbal multiword expressions for identification of metaphor. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. Brazilian Conference on Intelligent Systems (BRACIS). arXiv preprint arXiv:2009.10683.
Tagnin, S. E. O. (2013). O jeito que a gente diz: combinações consagradas em inglês e português. Disal, Barueri.
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., and Villavicencio, A. (2022). SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. In Emerson, G., Schluter, N., Stanovsky, G., Kumar, R., Palmer, A., Schneider, N., Singh, S., and Ratan, S., editors, Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107–121, Seattle, United States. Association for Computational Linguistics.
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., and Villavicencio, A. (2021). AStitchInLanguageModels: Dataset and methods for the exploration of idiomaticity in pre-trained language models. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3464–3477, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Tenney, I., Das, D., and Pavlick, E. (2019). Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, pages 4593–4601.
Xatara, C. M. (2001). Tipologia das expressões idiomáticas. ALFA: Revista de Linguística, 42(1).
Zeng, Z. and Bhat, S. (2021). Idiomatic expression identification using semantic compatibility. Transactions of the Association for Computational Linguistics, 9:1546–1562.
Published
2025-09-29
How to Cite
OLIVEIRA, Ricardo Gomes de; SANTOS, Laila Pereira Mota; SOUSA, Lílian Teixeira de; SANTOS, Marcos Adriano Pereira dos; CLARO, Daniela Barreiro; ARAÚJO, Rerisson Cavalcante de.
Fine-tuned model evaluation on Transformer Fragments for Identifying Idiomatic Expressions in Portuguese. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 283-294.
DOI: https://doi.org/10.5753/stil.2025.37832.
