LattesRex: Building ChatBots for Semi-Structured Documents

  • Lucas Darcio UFAM
  • Karina Soares Santos Serasa
  • Amanda Spellen UFAM
  • Esther Soares UFSCar
  • Livy Real UFAM / Jusbrasil
  • Altigran Soares da Silva UFAM

Resumo


Apresentamos o LattesRex, um sistema de perguntas e respostas baseado em LLMs para auxiliar na análise de currículos da Plataforma Lattes. Propomos uma abordagem estruturada modular inspirada em RAG, explorando metadados para estruturar as entradas enviadas à LLM. Conduzimos uma avaliação detalhada, com validação de linguistas, variando (i) o porte dos modelos, (ii) a extensão dos documentos e (iii) a complexidade das consultas. Os resultados indicam que a estruturação dos dados escala a solução sem perda de qualidade. Contribuímos com uma arquitetura replicável, uma avaliação qualitativa sistemática e reflexões relevantes para o uso de LLMs em contextos reais. Todos os recursos serão disponibilizados publicamente.

Referências

Alves, A. D., Yanasse, H. H., and Soma, N. Y. (2011a). Lattesminer: a multilingual dsl for information extraction from lattes platform. In Proceedings of the Co-located Workshops of SPLASH 2011, pages 85–89.

Alves, A. D., Yanasse, H. H., and Soma, N. Y. (2011b). Sucupira: a system for information extraction of the lattes platform to identify academic social networks. In Proceedings of the 6th Iberian Conference on Information Systems and Technologies, pages 1–6.

Antu, S. A., Chen, H., and Richards, C. K. (2023). Using llm to improve efficiency in literature review for undergraduate research. In Proceedings of the Workshop on Empowering Education with LLMs, pages 8–16.

Asai, A. and et al. (2024). Openscholar: Synthesizing scientific literature with retrieval-augmented lms. arXiv preprint. arXiv:2411.14199.

Beltagy, I., Lo, K., and Cohan, A. (2019). Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Blakemore, D. (1987). Semantic Constraints on Relevance. Blackwell, New York, NY, USA.

Brown, T. B., Mann, B., Ryder, N., and et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.

Cai, H., Cai, X., Chang, J., and et al. (2024). Sciassess: Benchmarking llm proficiency in scientific literature analysis. arXiv preprint. arXiv:2403.01976.

Cohan, A., Feldman, S., Beltagy, I., Downey, D., and Weld, D. S. (2020). Specter: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Cota, J. M. M. C., Laender, A. H. F., and Prates, R. O. (2021). Science tree: a platform for exploring the brazilian academic genealogy. Journal of the Brazilian Computer Society, 27(1):1–20.

Dias, T. M. R. and Moita, G. F. (2018). Um retrato da produç ao cient’ifica brasileira baseado em dados da plataforma lattes. Brazilian Journal of Information Science: Research Trends, 12(1).

Felizardo, K. R., Lima, M. S., Deizepe, A., Conte, T. U., and Steinmacher, I. (2024). Chatgpt application in systematic literature reviews in software engineering: an evaluation of its accuracy to support the selection activity. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’24, page 25–36, New York, NY, USA. Association for Computing Machinery.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.

Liang, W., Zhang, Y., Cao, H., and et al. (2023). Can large language models provide useful feedback on research papers? a large-scale empirical analysis. arXiv preprint. arXiv:2310.01783.

McRoy, S. W. (2021). Discourse and dialog. In Principles of Natural Language Processing, chapter 7. University of Wisconsin–Madison Pressbooks / Open Publishing.

Mena-Chalco, J. P. and Cesar-Junior, R. M. (2009). Scriptlattes: an open-source knowledge extraction system from the lattes platform. Journal of the Brazilian Computer Society, 15(4):31–39.

Mendonça, N. C., Rodrigues, M. A. F., and Mendonça, L. R. (2023). Qlattes: An open-source tool for qualis annotation and visualization in the lattes platform. In Anais do XL Semin’ario Integrado de Software e Hardware (SEMISH 2023).

Perlin, M. S., Santos, A. A. P., Imasato, T., and Borenstein, D. (2017). The brazilian scientific output published in journals: a study based on a large cv database. Journal of Informetrics, 11(1):18–31.

Zhang, N., Zhang, C., Tan, Z., Yang, X., Deng, W., and Wang, W. (2025). Credible plan-driven rag method for multi-hop question answering.

Zhang, Y., Das, S. S. S., and Zhang, R. (2024). Verbosity ̸= veracity: Demystify verbosity compensation behavior of large language models.
Publicado
29/09/2025
DARCIO, Lucas; SANTOS, Karina Soares; SPELLEN, Amanda; SOARES, Esther; REAL, Livy; SILVA, Altigran Soares da. LattesRex: Building ChatBots for Semi-Structured Documents. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 125-136. DOI: https://doi.org/10.5753/stil.2025.37819.