PREAnoTe: A corpus annotation approach for pre-trained Large Language Model fine-tuning
Abstract
Fine-tuning a Language Model (LM) requires a large, categorized and annotated corpus. However, corpora are scarce, and manual annotation is costly. As an alternative, the Distant Supervision approach has emerged, which can use Semantic Resources (SR). Nevertheless, there are gaps in using SR to minimize the annotation cost. This article proposes PREAnoTe, an approach that supports annotation using regular expression rules, guided by a metamodel and SR. The experiments showed promising results, achieving 95% accuracy for entities and 76% for relations, culminating in an adjusted LM with 86% precision and coverage.
Keywords:
Large Language Model, Natural Language Processing, Named Entity Recognition, Relation Extraction
References
Avelino., J., Rosa., G., Danon., G., Cordeiro., K., and C. Cavalcanti., M. (2024). Knowledge Graph generation from text using Supervised Approach supported by a Relation Metamodel: An application in C2 domain. In Proceedings of the 26th International Conference on Enterprise Information Systems - Volume 1: ICEIS, pages 281–288. INSTICC, SciTePress.
BRASIL (2018). Glossário de termos e expressões para uso no Exército. Exército. Estado-Maior.
Caseli, H. M. and Nunes, M. G. V., editors (2023). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN. [link].
Collovini, S., Gonçalves, P. N., Cavalheiro, G., Santos, J., and Vieira, R. (2020). Relation Extraction for Competitive Intelligence. In International Conference on Computational Processing of the Portuguese Language, pages 249–258. Springer.
Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., and Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature communications, 12(1):2017.
Hogan, A., Blomqvist, E., Cochez, M., D’amato, C., Melo, G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A.-C. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., and Zimmermann, A. (2021). Knowledge Graphs. ACM Computing Surveys, 54(4).
Kent, W. (2012). Data and reality: a timeless perspective on perceiving and managing information. Technics publications.
Liu, P., Qian, L., Zhao, X., and Tao, B. (2023). The construction of Knowledge Graphs in the Aviation Assembly Domain Based on a Joint Knowledge Extraction Model. IEEE Access, 11:26483–26495.
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant Supervision for Relation Extraction without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
Zhou, J., Li, X., Wang, S., and Song, X. (2022). NER-based Military Simulation Scenario development process. The Journal of Defense Modeling and Simulation, 20(4):563–575.
BRASIL (2018). Glossário de termos e expressões para uso no Exército. Exército. Estado-Maior.
Caseli, H. M. and Nunes, M. G. V., editors (2023). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN. [link].
Collovini, S., Gonçalves, P. N., Cavalheiro, G., Santos, J., and Vieira, R. (2020). Relation Extraction for Competitive Intelligence. In International Conference on Computational Processing of the Portuguese Language, pages 249–258. Springer.
Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., and Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature communications, 12(1):2017.
Hogan, A., Blomqvist, E., Cochez, M., D’amato, C., Melo, G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Navigli, R., Neumaier, S., Ngomo, A.-C. N., Polleres, A., Rashid, S. M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., and Zimmermann, A. (2021). Knowledge Graphs. ACM Computing Surveys, 54(4).
Kent, W. (2012). Data and reality: a timeless perspective on perceiving and managing information. Technics publications.
Liu, P., Qian, L., Zhao, X., and Tao, B. (2023). The construction of Knowledge Graphs in the Aviation Assembly Domain Based on a Joint Knowledge Extraction Model. IEEE Access, 11:26483–26495.
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant Supervision for Relation Extraction without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Cerri, R. and Prati, R. C., editors, Intelligent Systems, pages 403–417, Cham. Springer International Publishing.
Zhou, J., Li, X., Wang, S., and Song, X. (2022). NER-based Military Simulation Scenario development process. The Journal of Defense Modeling and Simulation, 20(4):563–575.
Published
2024-10-14
How to Cite
AVELINO, Jones O.; ROSA, Giselle F.; DANON, Gustavo R.; CORDEIRO, Kelli F.; CAVALCANTI, Maria Cláudia.
PREAnoTe: A corpus annotation approach for pre-trained Large Language Model fine-tuning. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 806-812.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2024.242494.
