Semantic Clustering of Civic Proposals: A Case Study on Brazil’s National Participation Platform
Abstract
Promoting participation on digital platforms such as Brasil Participativo has emerged as a top priority for governments worldwide. However, due to the sheer volume of contributions, much of this engagement goes underutilized, as organizing it presents significant challenges: (1) manual classification is unfeasible at scale; (2) expert involvement is required; and (3) alignment with official taxonomies is necessary. In this paper, we introduce an approach that combines BERTopic with seed words and automatic validation by large language models. Initial results indicate that the generated topics are coherent and institutionally aligned, with minimal human effort. This methodology enables governments to transform large volumes of citizen input into actionable data for public policy.
References
Clemente, A. J. (2018). Leonardo secchi. análise de políticas públicas: Diagnóstico de problemas, recomendação de soluções. são paulo: Cengage learning, 2016.
Constantino, K., Cruz, V. A. L., Zucheratto, O. M., França, C., Carvalho, M., Silva, T. H., Laender, A. H., and Gonçalves, M. A. (2022). Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In Simpósio Brasileiro de Banco de Dados (SBBD), pages 304–316. SBC.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2020). Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.
Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
Hott, H. R., Silva, M. O., Oliveira, G. P., Brandão, M. A., Lacerda, A., and Pappa, G. (2023). Evaluating contextualized embeddings for topic modeling in public bidding domain. In Brazilian Conference on Intelligent Systems, pages 410–426. Springer.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Saravia, E. and Ferrarezi, E. (2007). Políticas públicas. Coletâneas. Volumes, 1.
Silva, M. O., Oliveira, G. P., Costa, L. G., and Pappa, G. L. (2024a). Evaluating domain-adapted language models for governmental text classification tasks in portuguese. In Simpósio Brasileiro de Banco de Dados (SBBD), pages 247–259. SBC.
Silva, M. O., Oliveira, G. P., Costa, L. G., and Pappa, G. L. (2024b). Govbert-br: A bert-based language model for brazilian portuguese governmental data. In Brazilian Conference on Intelligent Systems, pages 19–32. Springer.
Silva, M. O., Paula, A. F., Oliveira, G. P., Vaz, I. A., Hott, H., Gomide, L. D., Reis, A. P., Mendes, B. M., Bacha, C. A., Costa, L. L., et al. (2022). Lipset: Um conjunto de dados com documentos rotulados de licitações públicas. In Dataset Showcase Workshop (DSW), pages 13–24. SBC.
Silva, N. F. d., Silva, M. C. R., Pereira, F. S., Tarrega, J. P. M., Beinotti, J. V. P., Fonseca, M., Andrade, F. E. d., and de Carvalho, A. C. d. L. (2021). Evaluating topic models in portuguese political comments about bills from brazil’s chamber of deputies. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29–December 3, 2021, Proceedings, Part II 10, pages 104–120. Springer.
Silveira, R., Fernandes, C. G., Araujo Monteiro Neto, J., Furtado, V., and Pimentel Filho, J. E. (2021). Topic modelling of legal documents via legal-bert. Topic Modelling of Legal Documents via LEGAL-BERT.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems, pages 403–417. Springer.
