MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition

  • João Lucas Luz Lima Sarcinelli USP
  • Marina Lages Gonçalves Teixeira USP
  • Jade Bortot de Paiva USP
  • Diego Furtado Silva USP

Resumo


Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: Mapeamento e Anotações de Registros hIstóricos para NER (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.
Publicado
29/09/2025
SARCINELLI, João Lucas Luz Lima; TEIXEIRA, Marina Lages Gonçalves; PAIVA, Jade Bortot de; SILVA, Diego Furtado. MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 364-378. ISSN 2643-6264.