Benchmarking LLMs in Geoscience: A Serverless Approach using GeoBench and AWS

Otávio Parraga; Arthur Fachel; Rodolfo S. Antunes; Luiz Gonzaga Jr; Maurício Roberto Veronez; Rodrigo C. Barros; Lucas S. Kupssinskü

doi:10.5753/pesquisanuvem.2026.22263

Otávio Parraga PUCRS
Arthur Fachel PUCRS
Rodolfo S. Antunes UNISINOS
Luiz Gonzaga Jr UNISINOS
Maurício Roberto Veronez UNISINOS
Rodrigo C. Barros PUCRS
Lucas S. Kupssinskü PUCRS

DOI: https://doi.org/10.5753/pesquisanuvem.2026.22263

Resumo

Este artigo apresenta uma avaliação sistemática de Large Language Models (LLMs) de pesos abertos para tarefas geocientíficas, utilizando o benchmark GeoBench. Para superar restrições de hardware local ao avaliar modelos massivos, implementamos uma infraestrutura em nuvem serverless na AWS, utilizando API Gateway, Lambda e Amazon Bedrock. Essa arquitetura permitiu inferência em larga escala e o aumento automatizado de dados.

Referências

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J.-M., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Deng, C., Zhang, T., He, Z., Chen, Q., Shi, Y., Xu, Y., Fu, L., Zhang, W., Wang, X., Zhou, C., et al. (2024). K2: A foundation language model for geoscience knowledge understanding and utilization. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 161–170.

Dramsch, J. S. (2020). 70 years of machine learning in geoscience in review. Advances in geophysics, 61:1–55.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Garcez, V. H., Parraga, O., Marques, A., Spigolon, A. L. D., De Barros, G., Gonzaga, L., Veronez, M. R., Barros, R. C., and Kupssinskü, L. S. (2025). Which is the best llm for geosciences? In IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium, pages 6374–6378. IEEE.

Lin, Z., Deng, C., Zhou, L., Zhang, T., Xu, Y., Xu, Y., He, Z., Shi, Y., Dai, B., Song, Y., et al. (2023). Geogalactica: A scientific large language model in geoscience. arXiv preprint arXiv:2401.00434.

Marques Jr, A., Horota, R. K., De Souza, E. M., Kupssinskü, L., Rossa, P., Aires, A. S., Bachi, L., Veronez, M. R., Gonzaga Jr, L., and Cazarin, C. L. (2020). Virtual and digital outcrops in the petroleum industry: A systematic review. Earth-Science Reviews, 208:103260.

Meta AI (2025). Llama 4: Multimodal intelligence. [link]. Accessed: 2026-01-08.

Mistral AI (2025). Introducing mistral 3. [link]. Accessed: 2026-01-29.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.

Parraga, O., More, M. D., Oliveira, C. M., Gavenski, N. S., Kupssinskü, L. S., Medronha, A., Moura, L. V., Simões, G. S., and Barros, R. C. (2023). Fairness in deep learning: A survey on vision and language research. ACM Computing Surveys.

Whitmeyer, S. J., Nicoletti, J., and De Paor, D. G. (2010). The digital revolution in geologic mapping. Gsa Today, 20(4/5):4–10.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.

Zhong, J., Shen, W., Li, Y., Gao, S., Lu, H., Chen, Y., Zhang, Y., Zhou, W., Gu, J., and Zou, L. (2025). A comprehensive survey of reward models: Taxonomy, applications, challenges, and future. arXiv preprint arXiv:2504.12328.