Fine-tuned model evaluation on Transformer Fragments for Identifying Idiomatic Expressions in Portuguese
Resumo
O presente trabalho aborda o desafio da identificação de Expressões Idiomáticas (EIs) em língua portuguesa, um problema cujo principal desafio está na não composicionalidade semântica e ambiguidade dessas estruturas. A escassez de dados anotados e limitações dos modelos em capturar a idiomaticidade motivaram a construção de um corpus anotado e a proposta de um método que se baseia no uso de fragmentos de Transformer para identificação das EIs em sentenças. O método utiliza pesos de atenção do modelo BERTimbau, focando em uma cabeça específica sensível a relações sintáticas relevantes em EIs e integra heurísticas linguísticas para penalizar usos literais. Os resultados demonstram alta precisão do método (1.0) sem falsos positivos, e uma revocação de 66,7%, resultado em uma pontuação de F1 de 0.8. Além disso o trabalho compara os resultados com métodos já utilizados na literatura de outros modelos de arquitetura BERT ajustados.Referências
Barreto, S. d. O. G., Marcilese, M., and de Oliveira, A. J. A. (2018). Idiomaticidade, familiaridade e informação prévia no processamento de expressões idiomáticas do pb. Letras de Hoje, 53(1):119–129.
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis of bert’s attention. In Proceedings of ACL, pages 311–330.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
Cook, P., Fazly, A., and Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Gregoire, N., Evert, S., and Kim, S. N., editors, Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, pages 41–48, Prague, Czech Republic. Association for Computational Linguistics.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1986–1997, Berlin, Germany. Association for Computational Linguistics.
Crespo, M. C. R. M., de Souza Jeannine Rocha, M. L., Sturzeneker, M. L., Serras, F. R., de Mello, G. L., Costa, A. S., Palma, M. F., Mesquita, R. M., de Paula Guets, R., da Silva, M. M., Finger, M., de Sousa, M. C. P., Namiuti, C., and do Monte, V. M. (2023). Carolina: a general corpus of contemporary brazilian portuguese with provenance, typology and versioning information.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., and Villavicencio, A. (2021). Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2730–2741, Online. Association for Computational Linguistics.
Hashempour, R. and Villavicencio, A. (2020). Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Zock, M., Chersoni, E., Lenci, A., and Santus, E., editors, Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, pages 72–80, Online. Association for Computational Linguistics.
King, M. and Cook, P. (2018). Leveraging distributed representations and lexico-syntactic fixedness for token-level prediction of the idiomaticity of English verb-noun combinations. In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 345–350, Melbourne, Australia. Association for Computational Linguistics.
Phelps, D., Fan, X.-R., Gow-Smith, E., Tayyar Madabushi, H., Scarton, C., and Villavicencio, A. (2022). Sample efficient approaches for idiomaticity detection. In Bhatia, A., Cook, P., Taslimipoor, S., Garcia, M., and Ramisch, C., editors, Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 105–111, Marseille, France. European Language Resources Association.
Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*.
Rohanian, O., Rei, M., Taslimipoor, S., and Ha, L. A. (2020). Verbal multiword expressions for identification of metaphor. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. Brazilian Conference on Intelligent Systems (BRACIS). arXiv preprint arXiv:2009.10683.
Tagnin, S. E. O. (2013). O jeito que a gente diz: combinações consagradas em inglês e português. Disal, Barueri.
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., and Villavicencio, A. (2022). SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. In Emerson, G., Schluter, N., Stanovsky, G., Kumar, R., Palmer, A., Schneider, N., Singh, S., and Ratan, S., editors, Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107–121, Seattle, United States. Association for Computational Linguistics.
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., and Villavicencio, A. (2021). AStitchInLanguageModels: Dataset and methods for the exploration of idiomaticity in pre-trained language models. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3464–3477, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Tenney, I., Das, D., and Pavlick, E. (2019). Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, pages 4593–4601.
Xatara, C. M. (2001). Tipologia das expressões idiomáticas. ALFA: Revista de Linguística, 42(1).
Zeng, Z. and Bhat, S. (2021). Idiomatic expression identification using semantic compatibility. Transactions of the Association for Computational Linguistics, 9:1546–1562.
Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). What does bert look at? an analysis of bert’s attention. In Proceedings of ACL, pages 311–330.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116.
Cook, P., Fazly, A., and Stevenson, S. (2007). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Gregoire, N., Evert, S., and Kim, S. N., editors, Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, pages 41–48, Prague, Czech Republic. Association for Computational Linguistics.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the compositionality of nominal compounds: Giving word embeddings a hard time. In Erk, K. and Smith, N. A., editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1986–1997, Berlin, Germany. Association for Computational Linguistics.
Crespo, M. C. R. M., de Souza Jeannine Rocha, M. L., Sturzeneker, M. L., Serras, F. R., de Mello, G. L., Costa, A. S., Palma, M. F., Mesquita, R. M., de Paula Guets, R., da Silva, M. M., Finger, M., de Sousa, M. C. P., Namiuti, C., and do Monte, V. M. (2023). Carolina: a general corpus of contemporary brazilian portuguese with provenance, typology and versioning information.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
Garcia, M., Kramer Vieira, T., Scarton, C., Idiart, M., and Villavicencio, A. (2021). Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2730–2741, Online. Association for Computational Linguistics.
Hashempour, R. and Villavicencio, A. (2020). Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Zock, M., Chersoni, E., Lenci, A., and Santus, E., editors, Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, pages 72–80, Online. Association for Computational Linguistics.
King, M. and Cook, P. (2018). Leveraging distributed representations and lexico-syntactic fixedness for token-level prediction of the idiomaticity of English verb-noun combinations. In Gurevych, I. and Miyao, Y., editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 345–350, Melbourne, Australia. Association for Computational Linguistics.
Phelps, D., Fan, X.-R., Gow-Smith, E., Tayyar Madabushi, H., Scarton, C., and Villavicencio, A. (2022). Sample efficient approaches for idiomaticity detection. In Bhatia, A., Cook, P., Taslimipoor, S., Garcia, M., and Ramisch, C., editors, Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 105–111, Marseille, France. European Language Resources Association.
Pires, T., Schlinger, E., and Garrette, D. (2019). How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy. Association for Computational Linguistics.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*.
Rohanian, O., Rei, M., Taslimipoor, S., and Ha, L. A. (2020). Verbal multiword expressions for identification of metaphor. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. Brazilian Conference on Intelligent Systems (BRACIS). arXiv preprint arXiv:2009.10683.
Tagnin, S. E. O. (2013). O jeito que a gente diz: combinações consagradas em inglês e português. Disal, Barueri.
Tayyar Madabushi, H., Gow-Smith, E., Garcia, M., Scarton, C., Idiart, M., and Villavicencio, A. (2022). SemEval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. In Emerson, G., Schluter, N., Stanovsky, G., Kumar, R., Palmer, A., Schneider, N., Singh, S., and Ratan, S., editors, Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107–121, Seattle, United States. Association for Computational Linguistics.
Tayyar Madabushi, H., Gow-Smith, E., Scarton, C., and Villavicencio, A. (2021). AStitchInLanguageModels: Dataset and methods for the exploration of idiomaticity in pre-trained language models. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3464–3477, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Tenney, I., Das, D., and Pavlick, E. (2019). Bert rediscovers the classical nlp pipeline. In Proceedings of ACL, pages 4593–4601.
Xatara, C. M. (2001). Tipologia das expressões idiomáticas. ALFA: Revista de Linguística, 42(1).
Zeng, Z. and Bhat, S. (2021). Idiomatic expression identification using semantic compatibility. Transactions of the Association for Computational Linguistics, 9:1546–1562.
Publicado
29/09/2025
Como Citar
OLIVEIRA, Ricardo Gomes de; SANTOS, Laila Pereira Mota; SOUSA, Lílian Teixeira de; SANTOS, Marcos Adriano Pereira dos; CLARO, Daniela Barreiro; ARAÚJO, Rerisson Cavalcante de.
Fine-tuned model evaluation on Transformer Fragments for Identifying Idiomatic Expressions in Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 283-294.
DOI: https://doi.org/10.5753/stil.2025.37832.
