Corpus Memórias Paroquiais: Advances in Named Entity Recognition
Abstract
This paper describes recent developments in NER on the Parish Memories historical corpus. The corpus has received new annotation categories for describing fauna and flora. A study about the adaptability of the model for dealing with original data without normalization is also discussed.
References
Aguilar, G., Maharjan, S., Monroy, A. P. L., and Solorio, T. (2017). A multi-task approach for named entity recognition in social media data. In Proceed ings of the 3rd Workshop on Noisy User-generated Text, pages 148–153.
Albuquerque, H. O., Souza, E., Gomes, C., Pinto, M. H. d. C., Ricardo Filho, P., Costa, R., Lopes, V. T. d. M., da Silva, N. F., de Carvalho, A. C., and Oliveira, A. L. (2023). Named entity recognition: a survey for the portuguese language. Procesamiento del Lenguaje Natural, 70:171–185.
Amoia, M. and Martinez, J. M. (2013). Using comparable collections of historical texts for building a diachronic dictionary for spelling normalization. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities, pages 84–89.
Baron, A. and Rayson, P. Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate conference in corpus linguistics.
Bollmann, M. and Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. arXiv preprint arXiv:1610.07844.
Burns, P. R. (2013). Morphadorner v2: A Java library for the morphological adornment of English language texts. Northwestern University, Evanston, IL.
Cameron, H. F., Olival, F., Vieira, R., and Neto, J. F. S. (2022). Named entity annotation of an 18th century transcribed corpus: problems, challenges. In Trojahn, C., Finatto, M. J., de Paiva, V., and Vieira, R., editors, Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP 2022) co-located with International Conference on the Computational Processing of Portuguese (PROPOR 2022), Virtual Event, Fortaleza, Brazil, 21st March, 2022, volume 3128 of CEUR Workshop Proceedings, pages 18–25. [link].
Capela, J. V. (2003). Freguesias do Distrito de Braga nas Memorias Paroquiais de 1758. Universidade do Minho.
Cosme, J. and Varandas, J. (2009). Memórias Paroquiais (1758), v.1. Caleidoscópio XVIII, 517pp edition.
Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., and Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Comput. Surv., 56(2).
Grilo, S., Bolrinha, M., Silva, J., Vaz, R., and Branco, A. (2020). The BDCamões collection of Portuguese literary documents: a research resource for digital humanities and language technology. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 849–854, Marseille, France. European Language Resources Association.
Madahil, A. R. (1937). Informações paroquiais do distrito de aveiro de 1721. In de Aveiro, A. D., editor, Arquivo do Distrito de Aveiro, Vol. III.
Nunes, R. O., Santos, J., Spritzer, A., Balreira, D. G., Freitas, C. M. D. S., Olival, F., Cameron, H. F., and Vieira, R. (2025). Assessing European and Brazilian Portuguese LLMs for NER in specialised domains. In Brazilian Conference on Intelligent Systems, pages 215–230. Springer.
Olival, F., Cameron, H. F., and Vieira, R. (2023). As Memórias Paroquiais: do manuscrito ao digital. Atas da Jornada de Humanidades Digitais do CIDEHUS, Universidade de Évora.
Pettersson, E., Megyesi, B., and Tiedemann, J. An SMT approach to automatic annotation of historical text. In Proceedings of the workshop on computa tional historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18, 087, pages 54–69. Linkoping University Electronic Press.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of Portuguese with transformer Albertina pt. In EPIA Conference on Artificial Intelligence, pages 441–453. Springer.
Rodrigues, M. R. S. and Neto, M. S. (2012). Informações paroquiais e história local: a diocese de Coimbra (século XVIII). Palimage Editores.
Samardžić, T., Scherrer, Y., and Glaser, E. (2015). Normalising orthographic and dialectal variants for the automatic processing of Swiss German. In Proceedings of the 7th Language and Technology Conference, pages 294–298. University of Zurich.
Santos, J., Cameron, H. F., Olival, F., Farrica, F., and Vieira, R. (2024). Named entity recognition specialised for Portuguese 18th-century history research. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese Vol. 1, pages 117–126, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.
Silva, A. V. (2023). Uma revis£o para o reconhecimento de entidades nomeadas aplicado † lngua portuguesa. Linguam°tica, 15(2):69–85.
Vieira, R., Olival, F., Cameron, H., Santos, J., Sequeira, O., and Santos, I. (2021). Enriching the 1758 portuguese parish memories (alentejo) with named entities. Journal of Open Humanities Data, 7:20.
Zilio, L., Finatto, M. J. B., and Vieira, R. (2022). Named entity recognition applied to Portuguese texts from the 18th century. In Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP) co-located with International Conference on the Computational Processing of Portuguese (PROPOR 2022) Virtual Event, Fortaleza, Brazil, CEUR Workshop Proceedings, v. 3128.
Albuquerque, H. O., Souza, E., Gomes, C., Pinto, M. H. d. C., Ricardo Filho, P., Costa, R., Lopes, V. T. d. M., da Silva, N. F., de Carvalho, A. C., and Oliveira, A. L. (2023). Named entity recognition: a survey for the portuguese language. Procesamiento del Lenguaje Natural, 70:171–185.
Amoia, M. and Martinez, J. M. (2013). Using comparable collections of historical texts for building a diachronic dictionary for spelling normalization. In Proceedings of the 7th workshop on language technology for cultural heritage, social sciences, and humanities, pages 84–89.
Baron, A. and Rayson, P. Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate conference in corpus linguistics.
Bollmann, M. and Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. arXiv preprint arXiv:1610.07844.
Burns, P. R. (2013). Morphadorner v2: A Java library for the morphological adornment of English language texts. Northwestern University, Evanston, IL.
Cameron, H. F., Olival, F., Vieira, R., and Neto, J. F. S. (2022). Named entity annotation of an 18th century transcribed corpus: problems, challenges. In Trojahn, C., Finatto, M. J., de Paiva, V., and Vieira, R., editors, Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP 2022) co-located with International Conference on the Computational Processing of Portuguese (PROPOR 2022), Virtual Event, Fortaleza, Brazil, 21st March, 2022, volume 3128 of CEUR Workshop Proceedings, pages 18–25. [link].
Capela, J. V. (2003). Freguesias do Distrito de Braga nas Memorias Paroquiais de 1758. Universidade do Minho.
Cosme, J. and Varandas, J. (2009). Memórias Paroquiais (1758), v.1. Caleidoscópio XVIII, 517pp edition.
Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., and Doucet, A. (2023). Named entity recognition and classification in historical documents: A survey. ACM Comput. Surv., 56(2).
Grilo, S., Bolrinha, M., Silva, J., Vaz, R., and Branco, A. (2020). The BDCamões collection of Portuguese literary documents: a research resource for digital humanities and language technology. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 849–854, Marseille, France. European Language Resources Association.
Madahil, A. R. (1937). Informações paroquiais do distrito de aveiro de 1721. In de Aveiro, A. D., editor, Arquivo do Distrito de Aveiro, Vol. III.
Nunes, R. O., Santos, J., Spritzer, A., Balreira, D. G., Freitas, C. M. D. S., Olival, F., Cameron, H. F., and Vieira, R. (2025). Assessing European and Brazilian Portuguese LLMs for NER in specialised domains. In Brazilian Conference on Intelligent Systems, pages 215–230. Springer.
Olival, F., Cameron, H. F., and Vieira, R. (2023). As Memórias Paroquiais: do manuscrito ao digital. Atas da Jornada de Humanidades Digitais do CIDEHUS, Universidade de Évora.
Pettersson, E., Megyesi, B., and Tiedemann, J. An SMT approach to automatic annotation of historical text. In Proceedings of the workshop on computa tional historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18, 087, pages 54–69. Linkoping University Electronic Press.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of Portuguese with transformer Albertina pt. In EPIA Conference on Artificial Intelligence, pages 441–453. Springer.
Rodrigues, M. R. S. and Neto, M. S. (2012). Informações paroquiais e história local: a diocese de Coimbra (século XVIII). Palimage Editores.
Samardžić, T., Scherrer, Y., and Glaser, E. (2015). Normalising orthographic and dialectal variants for the automatic processing of Swiss German. In Proceedings of the 7th Language and Technology Conference, pages 294–298. University of Zurich.
Santos, J., Cameron, H. F., Olival, F., Farrica, F., and Vieira, R. (2024). Named entity recognition specialised for Portuguese 18th-century history research. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese Vol. 1, pages 117–126, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.
Silva, A. V. (2023). Uma revis£o para o reconhecimento de entidades nomeadas aplicado † lngua portuguesa. Linguam°tica, 15(2):69–85.
Vieira, R., Olival, F., Cameron, H., Santos, J., Sequeira, O., and Santos, I. (2021). Enriching the 1758 portuguese parish memories (alentejo) with named entities. Journal of Open Humanities Data, 7:20.
Zilio, L., Finatto, M. J. B., and Vieira, R. (2022). Named entity recognition applied to Portuguese texts from the 18th century. In Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP) co-located with International Conference on the Computational Processing of Portuguese (PROPOR 2022) Virtual Event, Fortaleza, Brazil, CEUR Workshop Proceedings, v. 3128.
Published
2025-09-29
How to Cite
VIEIRA, Renata; CAMERON, Helena; OLIVAL, Fernanda; SANTOS, Joaquim.
Corpus Memórias Paroquiais: Advances in Named Entity Recognition. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 478-489.
DOI: https://doi.org/10.5753/stil.2025.37848.
