Getting Logic From LLMs: Annotating Natural Language Inference with Sabiá
Resumo
We discuss the difficulties of annotation for Natural Language Inference in Portuguese, comparing human and Large Language Model annotations. We used 200 sentence pairs from the ASSIN2 dataset and re-annotated them for the inference task. A semanticist conducted the first annotation, and a second round was conducted using Sabiá-3, a large language model trained on Brazilian Portuguese data. We found that Sabiá-3 has the same agreement score as human annotators, but the LLM and human annotators disagree in cases involving different linguistic phenomena. While humans tend to disagree on pairs involving pragmatics or cultural knowledge, Sabiá-3 tends to mislabel sentences that share context but with no clear, logical relations among them. It shows that although LLMs are now statistically as effective as humans, LLMs and humans have different patterns for disagreement or mistaken annotations for Natural Language Inference.
Referências
Bencke, L., Pereira, F. V., Santos, M. K., and Moreira, V. (2024). InferBR: A natural language inference dataset in Portuguese. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9050–9060, Torino, Italia. ELRA and ICCL.
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
Chaves Rodrigues, R., Tanti, M., and Agerri, R. (2023). Natural Portuguese Language Benchmark (Napolab).
Condoravdi, C., Crouch, D., De Paiva, V., Stolle, R., and Bobrow, D. (2003). Entailment, intensionality and text understanding. In HLT-NAACL 2003 workshop on Text meaning.
Dagan, I., Glickman, O., and Magnini, B. (2006). The pascal recognising textual entailment challenge. In Quinonero-Candela, J., Dagan, I., Magnini, B., and d’Alche Buc, F., editors, Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, Berlin, Heidelberg. Springer Berlin Heidelberg.
Dagan, I., Roth, D., Sammons, M., and Zanzotto, F. (2013). Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–222. Morgan and Claypool Publishers.
Dang, H., Mecke, L., Lehmann, F., Goller, S., and Buschek, D. (2022). How to prompt? opportunities and challenges of zero- and few-shot learning for human-ai interaction in creative applications of generative models.
Davani, A. M., Dıaz, M., and Prabhakaran, V. (2022). Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
Fonseca, E., Borges dos Santos, L., Criscuolo, M., and Aluisio, S. (2016). Visao geral da avaliacao de similaridade semantica e inferencia textual. Linguamatica, 8(2).
Kalouli, A., Real, L., and de Paiva, V. (2017). Textual inference: getting logic from humans. Proceedings of the 12th International Conference on Computational Semantics (IWCS 2017).
Kalouli, A.-L., Hu, H., Webb, A. F., Moss, L. S., and de Paiva, V. (2023). Curing the SICK and Other NLI Maladies. Computational Linguistics, 49(1):199–243.
Kalouli, A.-L., Real, A. B. L., Palmer, M., and de Paiva, V. (2019). Explaining simple natural language inference. Proceedings of the 13th Linguistic Annotation Workshop (LAW 2019), ACL.
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC 2014.
Parsons, T. (1990). Events in the semantics of English: A study in Subatomic Semantics. MIT Press/Cambrige, London.
Pavlick, E. and Kwiatkowski, T. (2019). Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
Real, L., Fonseca, E., and Oliveira, H. G. (2020). Organizing the assin 2 shared task. ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese.
Real, L., Rodrigues, A., Vieira e Silva, A., Albiero, B., Thalenberg, B., Guide, B., Silva, C., de Oliveira Lima, G., Camara, I. C. S., Stanojevic, M., Souza, R., and de Paiva, V. (2018). Sick-br: A portuguese corpus for inference. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings, page 303–312, Berlin, Heidelberg. Springer-Verlag.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osorio, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*. In Moniz, N., Vale, Z., Cascalho, J., Silva, C., and Sebastiao, R., editors, Progress in Artificial Intelligence, pages 441–453, Cham. Springer Nature Switzerland.
Salvatore, F. d. S. (2020). Analyzing natural language inference from a rigorous point of view. PhD thesis, Universidade de Sao Paulo.
Uma, A. N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., and Poesio, M. (2022). Learning from disagreement: A survey. J. Artif. Int. Res., 72:1385–1470.
Williams, A., Nangia, N., and Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. Proceedings of the NAACL, 2018.
Zhang, X. F. and de Marneffe, M.-C. (2021). Identifying inherent disagreement in natural language inference. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y., editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4908–4915, Online. Association for Computational Linguistics.