HALF: Human-Assisted Labeling Feedback Method for Subject Mining
Abstract
The increasing volume of unstructured texts in Official Gazettes highlights the need for robust semantic search engines. To address this, we propose a hybrid approach combining machine learning, human supervision, and a Large Language Model (LLM). The HALF (Human-Assisted Labeling Feedback) method, leveraging GPT-4o Mini, classified subjects in publications from the Official Gazette of Ceará. It assigned subjects to 1.044 publications with 0.8889 accuracy compared to ground truth. This approach enhances semantic search, improves retrieval and decision-making, and extends to other legal domains. Moreover, it offers a scalable solution, outperforming traditional unsupervised methods in accuracy and relevance.References
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Cação, F. N., Costa, A. R., Unterstell, N., Yonaha, L., Stec, T., and Ishisaki, F. (2021). Deeppolicytracker: Tracking changes in environmental policy in the brazilian federal official gazette with deep learning. In ICML 2021 Workshop on Tackling Climate Change with Machine Learning.
Castano, S., Ferrara, A., Furiosi, E., Montanelli, S., Picascia, S., Riva, D., and Stefanetti, C. (2024). Enforcing legal information extraction through context-aware techniques: The aske approach. Computer Law & Security Review, 52:105903.
Christopher, D. M., Prabhakar, R., and Hinrich, S. (2008). Introduction to information retrieval.
Dobša, J. and Kiers, H. A. (2022). Improving classification of documents by semi-supervised clustering in a semantic space. In Conference of the International Federation of Classification Societies, pages 121–129. Springer International Publishing Cham.
Eisenstein, J. (2018). Natural language processing. Jacob Eisenstein, 507.
Guimarães, G. M., da Silva, F. X., Queiroz, A. L., Marcacini, R. M., Faleiros, T. P., Borges, V. R., and Garcia, L. P. (2024). Dodfminer: an automated tool for named entity recognition from official gazettes. Neurocomputing, 568:127064.
Jurafsky, D. and Martin, J. H. (2025). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition. Online manuscript released January 12, 2025.
Pinto, F. A. D. G., de Barros Santos, J., Lifschitz, S., and Haeusler, E. H. (2023). A benchmarking for public information by machine learning and regular language. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico, pages 60–71. SBC.
Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Zangari, A., Marcuzzo, M., Rizzo, M., Giudice, L., Albarelli, A., and Gasparetto, A. (2024). Hierarchical text classification and its foundations: A review of current research. Electronics, 13(7):1199.
Zhang, Y., Yang, R., Xu, X., Xiao, J., Shen, J., and Han, J. (2024). Teleclass: Taxonomy enrichment and llm-enhanced hierarchical text classification with minimal supervision. arXiv preprint arXiv:2403.00165.
Cação, F. N., Costa, A. R., Unterstell, N., Yonaha, L., Stec, T., and Ishisaki, F. (2021). Deeppolicytracker: Tracking changes in environmental policy in the brazilian federal official gazette with deep learning. In ICML 2021 Workshop on Tackling Climate Change with Machine Learning.
Castano, S., Ferrara, A., Furiosi, E., Montanelli, S., Picascia, S., Riva, D., and Stefanetti, C. (2024). Enforcing legal information extraction through context-aware techniques: The aske approach. Computer Law & Security Review, 52:105903.
Christopher, D. M., Prabhakar, R., and Hinrich, S. (2008). Introduction to information retrieval.
Dobša, J. and Kiers, H. A. (2022). Improving classification of documents by semi-supervised clustering in a semantic space. In Conference of the International Federation of Classification Societies, pages 121–129. Springer International Publishing Cham.
Eisenstein, J. (2018). Natural language processing. Jacob Eisenstein, 507.
Guimarães, G. M., da Silva, F. X., Queiroz, A. L., Marcacini, R. M., Faleiros, T. P., Borges, V. R., and Garcia, L. P. (2024). Dodfminer: an automated tool for named entity recognition from official gazettes. Neurocomputing, 568:127064.
Jurafsky, D. and Martin, J. H. (2025). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition. Online manuscript released January 12, 2025.
Pinto, F. A. D. G., de Barros Santos, J., Lifschitz, S., and Haeusler, E. H. (2023). A benchmarking for public information by machine learning and regular language. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico, pages 60–71. SBC.
Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Zangari, A., Marcuzzo, M., Rizzo, M., Giudice, L., Albarelli, A., and Gasparetto, A. (2024). Hierarchical text classification and its foundations: A review of current research. Electronics, 13(7):1199.
Zhang, Y., Yang, R., Xu, X., Xiao, J., Shen, J., and Han, J. (2024). Teleclass: Taxonomy enrichment and llm-enhanced hierarchical text classification with minimal supervision. arXiv preprint arXiv:2403.00165.
Published
2025-07-20
How to Cite
SANTOS, Bruno Rogério S. dos; ROCHA, Leonardo Sampaio; MENEZES, Vinícius de M.; COSTA JÚNIOR, Evilásio; LESSA, Pedro Henrique L..
HALF: Human-Assisted Labeling Feedback Method for Subject Mining. In: LATIN AMERICAN SYMPOSIUM ON DIGITAL GOVERNMENT (LASDIGOV), 12. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 215-226.
ISSN 2763-8723.
DOI: https://doi.org/10.5753/lasdigov.2025.9138.
