LLM-SEMREL: Towards a Better Coreference Resolution for Portuguese

  • Evandro Fonseca Blip
  • Joaquim Neto Blip

Resumo


This paper aims to describe LLM-SEMREL, a new Portuguese semantic database built automatically using currently available large language models. The motivation for this project stems from the lack of rich semantic resources for the Coreference Resolution task in Portuguese. As a result, we provide a new resource that can be used to improve current models and build new ones. LLM-SEMREL is composed of 1,229,399 semantic relations, distributed among 261,731 words and their descriptions.

Palavras-chave: Coreference, Data Augmentation, Lexical Resources

Referências

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Rozière, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I. M., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., and et al. (2024). The llama 3 herd of models. CoRR, abs/2407.21783.

Fonseca, E., Vieira, R., and Vanin, A. A. (2016). Improving coreference resolution with semantic knowledge. In Computational Processing of the Portuguese Language - 12th International Conference, PROPOR 2016, Tomar, Portugal, July 13-15, 2016, Proceedings, volume 9727 of Lecture Notes in Computer Science, pages 213–224. Springer.

Fonseca, E. B. (2018). Resolução de correferência nominal usando semântica em língua portuguesa. PhD thesis. Escola Politécnica.

Gonçalo Oliveira, H. (2012). Onto. PT: Towards the Automatic Construction of a Lexical Ontology for Portuguese. PhD thesis, Ph. D. thesis, University of Coimbra.

Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., and Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2545–2568. Association for Computational Linguistics.

Jiang, F. and Cohn, T. (2021). Incorporating syntax and semantics in coreference resolution with heterogeneous graph attention network. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1584–1591. Association for Computational Linguistics.

Khosla, S. and Rose, C. (2020). Using type information to improve entity coreference resolution. In Proceedings of the First Workshop on Computational Approaches to Discourse, pages 20–31. Association for Computational Linguistics.

Lima, T., Collovini, S., Leal, A. L., Fonseca, E., Han, X., Huang, S., and Vieira, R. (2018). Analysing semantic resources for coreference resolution. In Computational Processing of the Portuguese Language - 13th International Conference, PROPOR 2018, Canela, Brazil, September 24-26, 2018, Proceedings, volume 11122 of Lecture Notes in Computer Science, pages 284–293. Springer.

OpenAI (2023). GPT-4 technical report. CoRR, abs/2303.08774.

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T. P., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., Antonoglou, I., Anil, R., Borgeaud, S., Dai, A. M., Millican, K., Dyer, E., Glaese, M., Sottiaux, T., Lee, B., Viola, F., Reynolds, M., Xu, Y., Molloy, J., Chen, J., Isard, M., Barham, P., Hennigan, T., McIlroy, R., Johnson, M., Schalkwyk, J., Collins, E., Rutherford, E., Moreira, E., Ayoub, K., Goel, M., Meyer, C., Thornton, G., Yang, Z., Michalewski, H., Abbas, Z., Schucher, N., Anand, A., Ives, R., Keeling, J., Lenc, K., Haykal, S., Shakeri, S., Shyam, P., Chowdhery, A., Ring, R., Spencer, S., Sezener, E., and et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. CoRR, abs/2403.05530.

Speer, R. and Havasi, C. (2012). Representing general relational knowledge in conceptnet 5. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pages 3679–3686.

Ueda, R. (2005). Dicionário br.ispell. [link]
Publicado
28/11/2024
FONSECA, Evandro; NETO, Joaquim. LLM-SEMREL: Towards a Better Coreference Resolution for Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 511-519. DOI: https://doi.org/10.5753/stil.2024.31170.