ACERPI: An approach for ordinances collection, information extraction and entity resolution
Resumo
Ordinances are documents issued by federal institutions that contain, among others, information regarding their staff. These documents are accessible through public repositories that usually do not allow any filter or advanced search on documents’ contents. This paper presents ACERPI, an approach which identifies the people mentioned in the ordinances to help the user find the documents of interest. ACERPI combines techniques to discover, obtain, convert and structure documents, extract information, and link employees entities. Experiments were performed on two real datasets and demonstrated a recall of 72.7% for our named entity recognition model trained with only 534 samples and F1 measure of 90% in the efficacy of the entity resolution technique.
Referências
Brasil (2011). Lei no 12.527/2011. Diário Oficial da República.
Campus Ibirubá. Boletins de Serviço. https://ibiruba.ifrs.edu.br/site/conteudo.php?cat=50. Accessed: 2021-09-05.
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. (2020). An overview of end-to-end entity resolution for big data. ACM Comput. Surv., 53(6).
Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., and Wudali, R. (2010). Named Entity Recognition and Resolution in Legal Text. Springer-Verlag.
Explosion.ai. Industrial-strength natural language processing. https://spacy.io/. Accessed: 2021-09-05.
Explosion.ai. Prodigy · radically efficient machine teaching. an annotation tool powered by active learning. https://prodi.gy/. Accessed: 2021-09-05.
Foundation, T. A. S. Apache pdfbox-a java pdf library. https://pdfbox.apache.org/. Accessed: 2021-09-05.
IFRS. Documentos. https://ibiruba.ifrs.edu.br/site/conteudo.php?cat=50. Accessed: 2021-09-05.
IFRS, Campus Ibirubá. Boletim de Serviço. https://ifrs.edu.br/ibiruba/documentosoficiais/boletim-de-servico/. Accessed: 2021-09-05.
Lage, J. P., Silva, A. S., Golgher, P. B., and Laender, A. H. F. (2004). Automatic generation of agents for collecting hidden web pages for data extraction. DKE, 49:177–196.
Manica, E., Dorneles, C. F., and Galante, R. (2017). Orion: A cypher-based web data extractor. In DEXA, pages 275–289, Cham. Springer.
Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticæ Investigationes, 30(1):3–26.
UFRGS. Consulta a portarias geradas pela reitoria da ufrgs. [link]. Accessed: 2021-09-05.
van Dalen-Oskam, K., de Does, J., Marx, M., Sijaranamual, I., Depuydt, K., Verheij, B., and Geirnaert, V. (2014). Named entity recognition and resolution for literary studies. Computational Linguistics in the Netherlands Journal, 4:121–136.