Construction of the Semantic Dataset of Legal Entities
Abstract
The Federal Revenue of Brazil provides registration data on companies, establishments and corporate bodies through the National Register of Legal Entities (CNPJ), serving as a reliable and accessible source of data. However, obtaining and managing this data is not a trivial task. This work carries out the first initiative to build a semantic dataset (DS) of Legal Entities based on a Data Lakehouses and semantics architecture. Data Lakehouse emerges as an innovative data architecture, combining the advantages of data lakes and data warehouses to provide a unified, efficient and manageable storage layer. Throughout this article, the dataset construction process is described, also providing the resources, scripts and artifacts used, as well as an exploration through GraphDB and presentation of possible use cases.
References
Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. (2021). Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR, volume 8, page 28.
Barbosa, R. P. C. (2023). Potencializando o uso de dados em políticas públicas através do primeiro datalake municipal no mundo no rio de janeiro. Enepcp.
Bertails, A. and Prud’hommeaux, E. G. (2011). Interpreting relational databases in the rdf domain. In Proceedings of the sixth international conference on Knowledge capture, pages 129–136.
Braz, C. S., Mendes, B. M., Oliveira, G. P., Costa, L. L., Silva, M. O., Brandao, M. A., Lacerda, A., and Pappa, G. L. (2023). Análise de irregularidades em licitações públicas com foco em empresas de pequeno porte. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico, pages 94–105. SBC.
Cherradi, M. (2024). Data lakehouse: Next generation information system. In Seminars in Medical Writing and Education, volume 3, pages 67–67.
Databricks (2021). What is a medallion architecture. [link]. Acessado em: 15-07-2024.
de Oliveira Araújo, L. S., Santos, M. T., and Silva, D. A. (2015). The brazilian federal budget ontology: a semantic web case of public open data. In Proceedings of the 7th International Conference on Management of computational and collective intElligence in Digital EcoSystems, pages 85–89.
do Prado Pagotto, D., da Silva Marques, W., de Oliveira, D. S., Ferreira, V. d. R. S., de Azevedo, V. N., and Júnior, C. V. B. (2024). Inovação em saúde: a implementação de um data lake para armazenamento, sistematização e disponibilização de dados em saúde no brasil. InCID: Revista de Ciência da Informação e Documentação, 15(1).
Ehrlinger, L. and Wöß, W. (2016). Towards a definition of knowledge graphs. SEMANTiCS (Posters, Demos, SuCCESS), 48(1-4):2.
Haelen, B. and Davis, D. (2023). Delta Lake: Up and Running. ”O’Reilly Media, Inc.”.
Nascimento, L. M. (2017). Utilizando linked data para publicação e cruzamento de dados governamentais abertos. Master’s thesis, Universidade Federal Fluminense.
W3C (2012a). A direct mapping of relational data to rdf.
W3C (2012b). R2rml: Rdb to rdf mapping language.
