News dataset for investment assessment in São Paulo state regions

Abstract


This paper presents a structured dataset of news reports about productive investments in the São Paulo State of Brazil (2016–2024), collected and classified by Fundação Seade and preceding the curation performed by the Survey of Announced Investments in the State of São Paulo (PIESP). The dataset includes data such as title, source, full text, and manual relevance labels. We describe the process of data collection and organization. We proceed to discuss potential applications, such as semantic clustering, named entity recognition, and automatic summarization. We also tackle class imbalances in recent data and possible mitigations through sampling. The dataset is intended to support research in regional economics, text mining, and machine learning.

Keywords: Productive Investments, News, Public Policies, Economy, São Paulo

References

Albuquerque, H. O. et al. (2023). Named entity recognition: a survey for the Portuguese language. Procesamiento del Lenguaje Natural.

Barros, T. et al. (2021). Sumarizacão automática de notícias crime no contexto da polícia federal. In Anais Estendidos do XXXVI Simpósio Brasileiro de Bancos de Dados, pages 127–133, Porto Alegre, RS, Brasil. SBC.

Campello, R J G B.. et al. (2013). Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg. Springer Berlin Heidelberg.

Cavalcanti, A. et al. (2024). Avaliação de técnicas de balanceamento de dados na detecção de fraude em transações online de cartão de crédito. In Anais do XXXIX SBBD, pages 694–700, Porto Alegre, RS, Brasil. SBC.

Chawla, N. V. et al. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.

Davis, P. (2022). Indicadores e dados municipais: Um banco de dados para avaliar a eficiência das despesas públicas. In Anais do IV Dataset Showcase Workshop, pages 79–90, Porto Alegre, RS, Brasil. SBC.

Freitas, J. B., Clarindo, J. P., and Aguiar, C. (2023). Spsafe: um dataset sobre dados de criminalidade no estado de são paulo. In Anais do V Dataset Showcase Workshop, pages 48–57, Porto Alegre, RS, Brasil. SBC.

Fundação Sistema Estadual de Análise de Dados (SEADE) (2025). Anexo metodológico — seade investimentos.

Goodfellow I. et al. (2020). Generative adversarial networks. CACM, 63(11):139–144.

Grootendorst, M. (2022). BERTopic: neural topic modeling with a class-based TF-IDF procedure. ArXiv Ref. 2203.05794, page 10.

McInnes, L., Healy, J., and Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

Reips, L., Musicante, M., Vargas-Solar, G., Pozo, A., and Hara, C. (2023). Enow - extrator de dados de notícias da web. In Anais Estendidos do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 78–83, Porto Alegre, RS, Brasil. SBC.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
Published
2025-09-29
MELO, Gabriel L.; NERES SOUSA, João V. C.; OLIVEIRA, Willian D.; MINGARDO, Lucas; FREIRE, Carlos; TRAINA, Agma J. M.; TRAINA JR., Caetano. News dataset for investment assessment in São Paulo state regions. In: DATASET SHOWCASE WORKSHOP (DSW), 7. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1-11. DOI: https://doi.org/10.5753/dsw.2025.247813.