ACERPI: An approach for ordinances collection, information extraction and entity resolution

  • Christian Schmitz Universidade Federal do Rio Grande do Sul (UFRGS)
  • Serigne K. Mbaye Instituto Federal do Rio Grande do Sul (IFRS)
  • Edimar Manica Instituto Federal do Rio Grande do Sul (IFRS)
  • Renata Galante Universidade Federal do Rio Grande do Sul (UFRGS)

Resumo


Ordinances are documents issued by federal institutions that contain, among others, information regarding their staff. These documents are accessible through public repositories that usually do not allow any filter or advanced search on documents’ contents. This paper presents ACERPI, an approach which identifies the people mentioned in the ordinances to help the user find the documents of interest. ACERPI combines techniques to discover, obtain, convert and structure documents, extract information, and link employees entities. Experiments were performed on two real datasets and demonstrated a recall of 72.7% for our named entity recognition model trained with only 534 samples and F1 measure of 90% in the efficacy of the entity resolution technique.

Palavras-chave: Information Extraction, Named Entity Recognition, Entity Resolution

Referências

Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. (2008). Supporting the automatic construction of entity aware search engines. In Proc. of the 10th ACM Workshop on WIDM, page 149–156, NY, USA.

Brasil (2011). Lei no 12.527/2011. Diário Oficial da República.

Campus Ibirubá. Boletins de Serviço. https://ibiruba.ifrs.edu.br/site/conteudo.php?cat=50. Accessed: 2021-09-05.

Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. (2020). An overview of end-to-end entity resolution for big data. ACM Comput. Surv., 53(6).

Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., and Wudali, R. (2010). Named Entity Recognition and Resolution in Legal Text. Springer-Verlag.

Explosion.ai. Industrial-strength natural language processing. https://spacy.io/. Accessed: 2021-09-05.

Explosion.ai. Prodigy · radically efficient machine teaching. an annotation tool powered by active learning. https://prodi.gy/. Accessed: 2021-09-05.

Foundation, T. A. S. Apache pdfbox-a java pdf library. https://pdfbox.apache.org/. Accessed: 2021-09-05.

IFRS. Documentos. https://ibiruba.ifrs.edu.br/site/conteudo.php?cat=50. Accessed: 2021-09-05.

IFRS, Campus Ibirubá. Boletim de Serviço. https://ifrs.edu.br/ibiruba/documentosoficiais/boletim-de-servico/. Accessed: 2021-09-05.

Lage, J. P., Silva, A. S., Golgher, P. B., and Laender, A. H. F. (2004). Automatic generation of agents for collecting hidden web pages for data extraction. DKE, 49:177–196.

Manica, E., Dorneles, C. F., and Galante, R. (2017). Orion: A cypher-based web data extractor. In DEXA, pages 275–289, Cham. Springer.

Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticæ Investigationes, 30(1):3–26.

UFRGS. Consulta a portarias geradas pela reitoria da ufrgs. [link]. Accessed: 2021-09-05.

van Dalen-Oskam, K., de Does, J., Marx, M., Sijaranamual, I., Depuydt, K., Verheij, B., and Geirnaert, V. (2014). Named entity recognition and resolution for literary studies. Computational Linguistics in the Netherlands Journal, 4:121–136.
Publicado
04/10/2021
SCHMITZ, Christian; MBAYE, Serigne K.; MANICA, Edimar; GALANTE, Renata. ACERPI: An approach for ordinances collection, information extraction and entity resolution. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 36. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 97-108. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2021.17869.