Diplomatrix-BR: Um Corpus Paralelo de Redações de Autoria Humana e de LLMs no Concurso de Diplomacia Brasileira

  • Rodrigo Cavalcanti UFF
  • Gabriela Casini UFF
  • Gabriel Assis UFF
  • Livy Real JusBrasil / UFAM
  • Daniela Vianna JusBrasil
  • Paulo Mann UFRJ
  • Aline Paes UFF

Resumo


Modelos de Língua de Larga Escala (LLMs) avançaram significativamente na geração de textos coerentes e bem estruturados, mas a avaliação de suas saídas ainda representa um desafio, especialmente em geração aberta e de alto nível. Esse problema é ainda mais evidente em línguas menos representadas, como o português, em que os benchmarks existentes costumam ser restritos em escopo e domínio. Apresentamos o Diplomatrix-BR, um novo benchmark baseado em redações do exame de admissão à carreira diplomática no Brasil (CACD), acompanhado de suas notas oficiais atribuídas por avaliadores humanos e de textos gerados por LLMs sobre os mesmos temas. Aplicamos uma variedade de métricas linguísticas e automáticas para comparar produções humanas e de modelos, oferecendo indícios sobre se LLMs são capazes de escrever com profundidade real ou se apenas simulam coerência por meio de fluência superficial. O Diplomatrix-BR estabelece as bases para a avaliação da geração em contextos de poucos recursos e de alta complexidade, ao mesmo tempo em que evidencia a fragilidade de métricas automáticas.

Referências

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., and Zhang, Y. (2024). Phi-4 Technical Report.

Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2025). Sabiá-3 Technical Report.

Alibaba-Cloud (2025). Qwen2.5 Technical Report.

Amorim, E. and Veloso, A. (2017). A Multi-aspect Analysis of Automatic Essay Scoring for Brazilian Portuguese. In Kunneman, F., Iñurrieta, U., Camilleri, J. J., and Ardanuy, M. C., editors, Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 94–102, Valencia, Spain. Association for Computational Linguistics.

Biber, D. (1988). Variation Across Speech and Writing. Cambridge University Press, Cambridge.

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M. I., Gonzalez, J. E., and Stoica, I. (2024). Chatbot Arena: an open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org.

Cohere (2024). Command R+. Acesso em: jun. 2025.

Cufuna, D. S. A., Duart, J. M., and Rangel-de Lazaro, G. (2024). Augmented reality in higher education: Interactions in llm-based teaching and learning. In The Learning Ideas Conference, pages 105–114. Springer.

Deng, M., Tan, B., Liu, Z., Xing, E., and Hu, Z. (2021). Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7580–7605, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

d’Alte, P. and d’Alte, L. (2023). Para uma avaliação do chatgpt como ferramenta auxiliar de escrita de textos acadêmicos. Revista Bibliomar, São Luís, 22(1):122–138.

Erdem, E., Kuyu, M., Yagcioglu, S., Frank, A., Parcalabescu, L., Plank, B., Babii, A., Turuta, O., Erdem, A., Calixto, I., Lloret, E., Apostol, E.-S., Truică, C.-O., Šandrih, B., Martinčić-Ipšić, S., Berend, G., Gatt, A., and Korvel, G. (2022). Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning. Journal of Artificial Intelligence Research (JAIR), 73.

French, S., Dickerson, A., and Mulder, R. A. (2024). A review of the benefits and drawbacks of high-stakes final examinations in higher education. Higher Education, 88(3):893–918.

Gao, M., Hu, X., Yin, X., Ruan, J., Pu, X., and Wan, X. (2025). Llm-based nlg evaluation: Current status and challenges. Computational Linguistics, pages 1–28.

Garcia, E. A. S. (2024). Open Portuguese LLM Leaderboard. [link].

Gemma-Team (2024). Gemma 2: Improving Open Language Models at a Practical Size.

Gemma-Team (2025). Gemma 3 technical report.

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. (2024). A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594.

Guo, Y., Shang, G., and Clavel, C. (2024). Benchmarking linguistic diversity of large language models. arXiv preprint arXiv:2412.10271.

He, J., Long, W., and Xiong, D. (2022). Evaluating discourse cohesion in pre-trained language models. In Proceedings of 3rd Workshop on Computational Approaches to Discourse (CODI 2022), page 28.

Hickman, L., Dunlop, P. D., and Wolf, J. L. (2024). The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testing. International Journal of Selection and Assessment, 32(4):499–511.

Honoré, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7:172–177.

Huang, C.-Y., Wei, J., and Huang, T.-H. K. (2024). Generating educational materials with different levels of readability using llms. In Proceedings of the Third Workshop on Intelligent and Interactive Writing Assistants, pages 16–22.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b.

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L. R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T. L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2024). Mixtral of Experts.

João Fellet (2014). Itamaraty e seleção de diplomatas: pai e filho contam trajetória. [link]. Acesso em 09 jun. 2025.

Kaliterna, M., Žuljević, M. F., Ursić, L., Krka, J., and Duplančić, D. (2024). Testing the capacity of Bard and ChatGPT for writing essays on ethical dilemmas: A cross-sectional study. Scientific Reports, 14(1):26046.

Kostic, M., Witschel, H. F., Hinkelmann, K., and Spahic-Bogdanovic, M. (2024). Llms in automated essay evaluation: A case study. In Proceedings of the AAAI Symposium Series, volume 3, pages 143–147.

Leal, S. E., Duran, M. S., Scarton, C. E., Hartmann, N. S., and Aluísio, S. M. (2024). NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Language Resources and Evaluation, 58(1):73–110.

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023a). G-eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.

Liu, Y., Zhang, Z., Zhang, W., Yue, S., Zhao, X., Cheng, X., Zhang, Y., and Hu, H. (2023b). Argugpt: evaluating, understanding and identifying argumentative essays generated by gpt models. arXiv preprint arXiv:2304.07666.

Locatelli, M. S., Miranda, M. P., da Silva Costa, I. J., Prates, M. T., Thomé, V., Monteiro, M. Z., Lacerda, T., Pagano, A., Neto, E. R., Meira Jr, W., et al. (2024). Examining the behavior of llm architectures within the framework of standardized national exams in brazil. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 879–890.

Machin, S., McNally, S., and Ruiz-Valenzuela, J. (2020). Entry through the narrow door: The costs of just failing high stakes exams. Journal of Public Economics, 190:104224.

Marinho, J., Anchiêta, R., and Moura, R. (2021). Essay-BR: a Brazilian Corpus of Essays. In Anais do III Dataset Showcase Workshop, pages 53–64, Online. Sociedade Brasileira de Computação.

Martínez, G., Hernández, J. A., Conde, J., Reviriego, P., and Merino-Gómez, E. (2024). Beware of words: Evaluating the lexical diversity of conversational LLMs using ChatGPT as case study. ACM Trans. Intell. Syst. Technol. Just Accepted.

MetaAI (2024). The Llama 3 Herd of Models.

Microsoft (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.

Ministério das Relações Exteriores (2025). Editais e guias de estudo — carreira diplomática. [link]. Acesso em 09 jun. 2025.

Muñoz-Ortiz, A., Gómez-Rodríguez, C., and Vilares, D. (2024). Contrasting linguistic patterns in human and llm-generated news text. Artificial Intelligence Review, 57(10):265.

OpenAI (2024). Gpt-4o System Card.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In Isabelle, P., Charniak, E., and Lin, D., editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Rafaela Zem (2024). O que faz um diplomata? entenda cargo que tem salário inicial de R$ 20,9 mil e aceita qualquer graduação. [link]

Rodriguez, P. U., Jafari, A., and Ormerod, C. M. (2019). Language models and automated essay scoring. arXiv preprint arXiv:1909.09482.

Sardinha, T. B. (2024). Ai-generated vs human-authored texts: A multidimensional comparison. Applied Corpus Linguistics, 4:100083.

Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101.

Sudhakaran, S., González-Duque, M., Freiberger, M., Glanois, C., Najarro, E., and Risi, S. (2023). Mariogpt: Open-ended text2level generation through large language models. Advances in Neural Information Processing Systems, 36:54213–54227.

Tang, X., Chen, H., Lin, D., and Li, K. (2024). Harnessing llms for multi-dimensional writing assessment: Reliability and alignment with human judgments. Heliyon, 10(14).

Ullah, A. and Yameen, A. (2024). Comparative analysis of linguistic features of academic text and ai-generated text. In Heritage International Journal of Linguistics & Literature. Global Heritage Research Center for Languages and Literature.

van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). Transformers: State-of-the-Art Natural Language Processing. In Liu, Q. and Schlangen, D., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). BERTScore: Evaluating Text Generation with Bert. In International Conference on Learning Representations.

Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., and Hashimoto, T. B. (2024). Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12:39–57.

Zhou, K., Blodgett, S. L., Trischler, A., Daumé III, H., Suleman, K., and Olteanu, A. (2022). Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 314–324, Seattle, United States. Association for Computational Linguistics.

Étienne Brunet (1978). Le vocabulaire de Jean Giraudoux : structure et évolution : statistique et informatique appliquées à l’étude des textes à partir des données du Trésor de la langue française. Le vocabulaire des grands écrivains français. Slatkine, Genève. ASIN: B0000E99PZ.
Publicado
29/09/2025
CAVALCANTI, Rodrigo; CASINI, Gabriela; ASSIS, Gabriel; REAL, Livy; VIANNA, Daniela; MANN, Paulo; PAES, Aline. Diplomatrix-BR: Um Corpus Paralelo de Redações de Autoria Humana e de LLMs no Concurso de Diplomacia Brasileira. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 192-205. DOI: https://doi.org/10.5753/stil.2025.37825.