From Corpus creation to large language model training: What can go wrong in a Research Internship?

  • Lucas B. Bulcão Mota UFBA
  • Aline Athaydes UFBA
  • Babacar Mane UFBA
  • Daniela Barreiro Claro UFBA
  • Marlo Souza UFBA
  • Fernando Humberto UFBA

Abstract


This paper presents the experience of a scientific initiation project focused on the development of a chatbot specialized in Consumer Law. One of the main challenges faced was the creation of a synthetic dataset to enable the fine-tuning of a language model. Throughout the process, several technical and methodological difficulties were identified, ranging from data collection and structuring to model training. The objective of this work is to report these challenges, highlighting the importance of error as part of the scientific learning process and reflecting on the lessons learned in the development of AI-based legal systems.

References

Bardin, L. (2011). Análise de conteúdo. Edições 70, São Paulo, 1 edition. Traduzido por Luís Antero Reto e Augusto Pinheiro.

ConJur (2023). Cada ações judiciais estaduais e federais sobre consumo.

Ebrahimi, S., Chen, K., Asudeh, A., Das, G., and Koudas, N. (2024). Axolotl: Fairness through assisted self-debiasing of large language model outputs. arXiv preprint arXiv:2403.00198. Disponível em: [link].

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. (2021). Lora: Low-rank adaptation of large language models.

Hugging Face. Hugging face: The ai community building the future. [link]. Acesso em: jun. 2025.

Malaquias Junior, R., Pires, R., Romero, R., and Nogueira, R. (2024). Juru: Legal brazilian large language model from reputable sources. arXiv preprint arXiv:2403.18140. Disponível em: [link].

OpenAI (2024). Gpt-4o mini: advancing cost-efficient intelligence. Online. Disponível via anúncio oficial da OpenAI; modelo lançado em 18 de julho de 2024.

Qwen Team (2025). Qwen3 technical report. Technical report, Qwen Research. Relatório técnico sobre a série de modelos Qwen3, incluindo Qwen3-8B.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
Published
2025-09-29
MOTA, Lucas B. Bulcão; ATHAYDES, Aline; MANE, Babacar; CLARO, Daniela Barreiro; SOUZA, Marlo; HUMBERTO, Fernando. From Corpus creation to large language model training: What can go wrong in a Research Internship?. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 703-707. DOI: https://doi.org/10.5753/stil.2025.37875.