Large Language Models and Free-Software Principles: Pathways to Brazilian Digital Sovereignty
Resumo
This paper critically examines how contemporary Large Language Models, LLMs, align with free software principles and what that alignment, or lack of alignment, means for Brazil and its pursuit of digital sovereignty. We introduce a three tier openness taxonomy that distinguishes proprietary, semi open, and fully open models, and we audit eighteen leading systems for license terms, data transparency, and reproducibility. Energy projections indicate that pre training a model with one hundred billion parameters demands about six million GPU hours and around one gigawatt hour of electricity, far beyond the ten PFLOPS currently available in Brazil’s public high performance computing facilities. Our analysis identifies four systemic bottlenecks: outdated infrastructure, limited Portuguese corpora, prohibitive economic costs, and persistent talent flight. Verified national initiatives such as SoberanIA, CPQD’s GPT BR 2.8B, SerproLLM, and the open data communities Querido Diário and Brasil.io show that combining global checkpoints with local fine tuning can reduce training costs by as much as ninety percent. Based on these findings, we recommend, first, multiyear investment in distributed state owned GPU clusters; second, a Public Interest Data Act that prioritizes Portuguese text under Creative Commons licenses; third, procurement rules that recognize only licenses approved by the Open Source Initiative; fourth, expanded artificial intelligence residency programs to retain talent; and fifth, a Latin American consortium for multilingual pre training. We conclude that true sovereignty over large language models will emerge only through the synergy of transparent licensing, sovereign infrastructure, rich public corpora, and continuous human capital development.
Referências
Ministério do Planejamento, Orçamento e Gestão, “Portal do software público brasileiro,” [link], 2012, accessed: Jun. 27, 2025.
Ministério da Ciência, Tecnologia e Inovações, “Estratégia brasileira de inteligência artificial (ebia),” Disponível online, 2021, accessed: Jun. 27, 2025.
——, “Plano brasileiro de inteligência artificial (pbia) 2024–2028,” Disponível online, 2024, accessed: Jun. 27, 2025.
CrazyStack. (2025) Tendências de ia no mercado brasileiro 2025. Accessed: Jul. 11, 2025. [Online]. Available: [link]
A. Masood, “Open source licensing modalities in large language models: Insights, risks and opportunities,” LINK, 2024, accessed: Jun. 27, 2025.
SemiAnalysis. (2024) Openai is doomed: Et tu microsoft? Accessed: Jul. 11, 2025. [Online]. Available: [link]
Comitê Gestor da Internet no Brasil (CGI.br). (2023) Regulação de plataformas digitais: relatório do cgi.br mapeia consensos e dissensos entre setores. LINK.
J. Manchanda, L. Boettcher, M. Westphalen, and J. Jasser, “The open source advantage in large language models (llms),” arXiv preprint arXiv:2412.12004, 2024.
R. M. Stallman, Free Software, Free Society: Selected Essays of Richard M. Stallman, 1st ed. Boston, MA: GNU Press, 2002. [Online]. Available: [link]
Open Source Initiative, “The open source definition – version 1.9,” [link], 2015, accessed: Jun. 27, 2025.
E. S. Raymond, The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, 1st ed. Sebastopol, CA: O’Reilly Media, 1999.
J. Naskali, “A proposed addition to open-source licensing for securing user freedom to run software,” in Proceedings of the Conference on Technology Ethics (Tethics 2021). CEUR-WS, 2021. [Online]. Available: [link]
C. S. F. dos Santos Maciel, “Governança digital e transparência pública: avanços, desafios e oportunidades,” Liinc em Revista, vol. 16, no. 2, p. e5240, 2020.
J. Kaplan, S. McCandlish, T. Henighan, and et al., “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: [link]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
OpenAI, “Gpt-4 technical report,” OpenAI, Tech. Rep. arXiv:2303.08774, 2023. [Online]. Available: [link]
Google DeepMind, “Gemini 1.5: A next-generation multimodal model,” [link], 2024, accessed: Jun. 27, 2025.
R. Sapkota, S. Raza, and M. Karkee, “Comprehensive analysis of transparency and accessibility of chatgpt, deepseek, and other sota large language models,” arXiv preprint arXiv:2502.18505, 2025.
Anthropic, “Introducing the claude 3 model family,” [link], 2024, accessed: Jun. 20, 2025.
R. Anil, Z. Zhou, W. Chan, and outros, “Gemini 1: A family of highly capable multimodal models,” arXiv preprint arXiv:2403.05530, 2024. [Online]. Available: [link]
Meta AI, “Llama 3 community license agreement,” [link], 2024, accessed: Jun. 27, 2025.
Mistral AI, “Mistral ai non-commercial license,” [link], 2024, accessed: Jun. 27, 2025.
H. Liao, “Performance optimization of deepseek moe architecture in multi-scale prediction of stock returns,” World Journal of Information Technology, vol. 3, no. 2, pp. 1–9, 2025.
R. Pires, H. Abonizio, T. S. Almeida, and R. Nogueira, “Sabiá: Portuguese large language models,” in Brazilian Conference on Intelligent Systems. Springer, 2023, pp. 226–240.
T. L. Scao, A. Fan, C. Akiki, and et al., “Bloom: A 176b-parameter open-access multilingual language model,” in Proceedings of ACL 2022, 2022, pp. 520–538. [Online]. Available: [link]
S. Black, B. Gyawali, D. Knight, and et al., “Gpt-neox-20b: An open-source autoregressive language model,” arXiv preprint arXiv:2204.06745, 2022.
W. Knight. (2023) The myth of “open source” ai. WIRED. [Online]. Available: [link]
——. (2023, Apr.) Openai’s ceo says the age of giant ai models is already over. WIRED. Entrevista na qual Sam Altman afirma que o custo de treinamento do GPT-4 superou US$100 milhões. [Online]. Available: [link]
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2021.
C. Osthoff, S. Herrera, T. Teixeira, M. Coelho, M. Melo, G. Costa, F. Cabral, M. E. Welter, B. Fagundes, A. Carneiro et al., “A arquitetura do supercomputador sdumont e os desafios da pesquisa brasileira na área de computação de alto desempenho,” in Escola Regional de Alto Desempenho de São Paulo (ERAD-SP). SBC, 2020, pp. 1–5.
Empresa de Pesquisa Energética, “Balanço energ ético nacional 2023,” 2023, disponível em: [link]. Accessed: Jul. 2, 2025.
Boston Consulting Group and The Network, “Decoding digital talent: Where do people want to work?” 2019, accessed: Jul. 12, 2025. [Online]. Available: [link]
Comitê Gestor da Internet no Brasil (CGI.br), “Cartilha de boas práticas para a internet no brasil – 13a edição,” [link], 2021, accessed: Jun. 27, 2025.
Secretaria de Administração do Estado do Piauí, “Programa soberania: Assistente virtual para serviços ao cidadão,” [link], 2024, accessed: Jun. 23, 2025.
Departamento de Ciência da Computação da UFMG, “Novo inct em inteligência artificial responsável será sediado no dcc/ufmg,” 2025, available: LINK. Accessed: Jul. 13, 2025.
CPQD and Rede ANID, “gpt-br-2.8b: Modelo de linguagem em português com 2,8 bilhões de parâmetros,” [link], 2023, accessed: Jun. 27, 2025.
Serviço Federal de Processamento de Dados (Serpro) and Secretaria de Governo Digital. (2025) Serpro e sgd lançam guia sobre ia e anunciam o serprollm. Notícia “IA em Ação”, 18 fev 2025. [Online]. Available: [link]
Open Knowledge Brasil. (2025) Querido diário: Coletando e abrindo os diários oficiais do brasil. [Online]. Available: [link]
Brasil.io. (2025) Brasil.io: Dados públicos para todos. Accessed: Jul. 13, 2025. [Online]. Available: [link]
