O Obstáculo das Ambiguidades Estruturais da Língua para Modelos de Linguagem Linguisticamente Motivados

João Pedro Gonçalves Munhoz; Oto Araújo Vale

doi:10.5753/stil.2025.37867

João Pedro Gonçalves Munhoz UFSCar
Oto Araújo Vale UFSCar

DOI: https://doi.org/10.5753/stil.2025.37867

Resumo

Este artigo apresenta o UDCode, uma convenção para a codificação de informações morfossintáticas e de dependência da Universal Dependencies. O objetivo é avaliar como a granularidade das anotações da UD para o português brasileiro impacta o reconhecimento de entidades mencionadas de tempo. Testes exploratórios revelaram que a subdeterminação na categorização de advérbios compromete a precisão, gerando alta taxa de falsos positivos. O resultado evidencia que a eficácia de um modelo linguisticamente motivado depende do nível de detalhe das anotações. Conclui-se que trabalhos futuros devem focar na revisão das diretrizes de anotação para incluir categorias adverbiais mais refinadas ou em métodos que compensem essa falta de especificidade.

Referências

Barros, C. D. and Vale, O. A. (2024). Roda viva: um corpus oral e a universal dependencies. In Anais Eletrônicos do XVI Encontro de Linguística de Corpus e da XII Escola Brasileira de Linguística Computacional, volume 1, pages 89–94, Brasília.

Blackwell, R. E., Barry, J., and Cohn, A. G. (2024). Towards reproducible llm evaluation: Quantifying uncertainty in llm benchmark scores. arXiv preprint arXiv:2410.03492.

Hillier, D., Guertler, L., Tan, C., Agrawal, P., Ruirui, C., and Cheng, B. (2024). Super tiny language models. arXiv preprint arXiv:2405.14159.

Hu, Y., Ameer, I., Zuo, X., Peng, X., Zhou, Y., Li, Z., Li, Y., Li, J., Jiang, X., and Xu, H. (2023). Zero-shot clinical entity recognition using chatgpt. arXiv preprint arXiv:2303.16416.

Ilari, R., de Castilho, A. T., and Gnerre, M. B. M. (2014). Gramática do português culto falado no Brasil: Palavras de classe aberta.

Kamp, H. and Reyle, U. (1993). From discourse to logic. Studies in Linguistics and Philosophy. Springer, Dordrecht, Netherlands, 1993 edition.

Liao, Q. V. and Vaughan, J. W. (2023). Ai transparency in the age of llms: A humancentered research roadmap. arXiv preprint arXiv:2306.01941, 10.

Lopes, L. (2024). portTokenizer. [link].

Lopes, L. and Pardo, T. (2024). Towards portparser a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese Vol. 1, pages 401–410, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.

Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., and Gómez-Berbís, J. M. (2013). Named entity recognition: fallacies, challenges and opportunities. Computer Standards & Interfaces, 35(5):482–489.

Mota, C. and Santos, D., editors (2008). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca.

Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., and Zeman, D. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.

Panigutti, C., Hamon, R., Hupont, I., Fernandez Llorca, D., Fano Yela, D., Junklewitz, H., Scalzo, S., Mazzini, G., Sanchez, I., Soler Garrido, J., and Gomez, E. (2023). The role of explainable ai in the context of the ai act. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1139–1150, New York, NY, USA. Association for Computing Machinery.

Rai, A. (2020). Explainable ai: From black box to glass box. Journal of the academy of marketing science, 48:137–141.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT models for brazilian portuguese. In Intelligent Systems, Lecture notes in computer science, pages 403–417. Springer International Publishing, Cham.

Tsitseklis, K., Stavropoulou, G., and Papavassiliou, S. (2024). Custom named entity recognition vs chatgpt prompting: a paleontology experiment. In 2024 Panhellenic Conference on Electronics & Telecommunications (PACET), pages 1–5. IEEE.

Universal Dependencies contributors (2025a). CoNLL-U Format. Universal Dependencies.

Universal Dependencies contributors (2025b). UD Portuguese Bosque. Universal Dependencies.

Zhong, X. and Cambria, E. (2021). Literature review. In Socio-Affective Computing, pages 15–34. Springer International Publishing, Cham.