Class Schema Discovery from Semi-Structured Data




Schema Discovery, Entity Classes, Semi-structured Data, Class Attributes


A wide range of applications has used semi-structured data. A characteristic of this type of data is its flexible structure, i.e., it does not rely on schema-based constraints to define its entities. Usually entities of a same kind (i.e, class) do not present the same attribute set. However, some data processing and management applications rely on a data schema to perform their tasks. In this context, the lack of structure is a challenge for these applications to use this data. In this paper, we propose CoFFee, an approach to class schema discovery. Given a set of heterogeneous entity schemata, found within a class, CoFFee provides a summarized set with core attributes. To this end, CoFFee applies a strategy combining attributes co-occurrence and frequency. It models a set of entity schemata as a graph and uses centrality metrics to capture the co-occurrence between attributes. We evaluated CoFFee using data from 12 classes extracted from DBpedia and e-Commerce datasets. We benchmarked it against two other state-of-the-art approaches. The results show that: i) CoFFee effectively provides a summarized schema, minimizing non-relevant attributes without compromising the data retrieval rate; and ii) CoFFee produces a summarized schema of good quality, outperforming the baselines by an average of 19% of F1 score.


Download data is not yet available.


Adolphs, P., Theobald, M., Schafer, U., Uszkoreit, H., and Weikum, G. (2011). Yago-qa: Answering questions by structured knowledge queries. In 2011 IEEE Fifth International Conference on Semantic Computing, pages 158–161. IEEE.

Bouhamoum, R., Kedad, Z., and Lopes, S. (2020). Scalable schema discovery for rdf data. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XLVI, pages 91–120. Springer.

Bouhamoum, R., Kedad, Z., and Lopes, S. (2022). Incremental schema generation for large and evolving rdf sources. In Transactions on Large-Scale Data-and Knowledge-Centered Systems LI, pages 28–63. Springer.

Christodoulou, K., Paton, N. W., and Fernandes, A. A. (2015). Structure inference for linked data sources using clustering. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIX, pages 1–25. Springer.

Costa-Neto, E., Moreira, J., Barbosa, L., and Salgado, A. C. (2022). Coffee: A co-occurrence and frequency-based approach to schema mining. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 52–64, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224190.

Dong, X. L. and Srivastava, D. (2015). Schema Alignment, pages 31–61. Springer International Publishing, Cham. DOI: 10.1007/978-3-031-01853-42.

Gómez, S. N., Etcheverry, L., Marotta, A., and Consens, M. P. (2018). Findings from two decades of research on schema discovery using a systematic literature review. In AMW.

Han, L., Finin, T., and Joshi, A. (2011). Gorelations: An intuitive query system for dbpedia. In Joint International Semantic Technology Conference, pages 334–341. Springer.

Hassanzadeh, O., Pu, K. Q., Yeganeh, S. H., Miller, R. J., Popa, L., Hernández, M. A., and Ho, H. (2013). Discovering linkage points over web data. Proceedings of the VLDB Endowment, 6(6):445–456.

Issa, S., Paris, P.-H., Hamdi, F., and Si-Said Cherfi, S. (2019). Revealing the conceptual schemas of rdf datasets. In Giorgini, P. and Weber, B., editors, Advanced Information Systems Engineering, pages 312–327, Cham. Springer International Publishing.

Kellou-Menouer, K., Kardoulakis, N., Troullinou, G., Kedad, Z., Plexousakis, D., and Kondylakis, H. (2021). A survey on semantic schema discovery. The VLDB Journal, pages 1–36.

Kellou-Menouer, K. and Kedad, Z. (2015). Schema discovery in rdf data sources. In International Conference on Conceptual Modeling, pages 481–495. Springer.

Moreira, J. and Barbosa, L. (2021). Deepex: A robust weak supervision system for knowledge base augmentation. J. Data Semant., 10(3-4):309–325. DOI: 10.1007/s13740-021-00134-x.

Moreira, J., Neto, E. C., and Barbosa, L. (2021). Analysis of structured data on wikipedia. International Journal of Metadata, Semantics and Ontologies, 15(1):71–86.

Poyraz, K. (2022). Partial rdf schema retrieval. Master’s thesis.

Queiroz-Sousa, P. O., Salgado, A. C., and Pires, C. E. (2013). A method for building personalized ontology summaries. Journal of Information and Data Management, 4(3):236–236.

Spearman, C. (1961). The proof and measurement of association between two things.

Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., and Wangz, C. (2015). Schema management for document stores. Proc. VLDB Endow., 8(9):922–933. DOI: 10.14778/2777598.2777601.

Weise, M., Lohmann, S., and Haag, F. (2016). Ld-vowl: Extracting and visualizing schema information for linked data. In 2nd international workshop on visualization and interaction for ontologies and linked data, pages 120–127.

Wu, F. and Weld, D. S. (2007). Autonomously semantifying wikipedia. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 41–50.

Zhang, J. and Luo, Y. (2017). Degree centrality, betweenness centrality, and closeness centrality in social network. In Proceedings of the 2017 2nd International Conference on Modelling, Simulation and Applied Mathematics (MSAM2017), volume 132, pages 300–303.




How to Cite

Costa Neto, E., Moreira, J., Barbosa, L., & Salgado, A. C. (2023). Class Schema Discovery from Semi-Structured Data. Journal of Information and Data Management, 14(1).



SBBD 2022 Full papers - Extended Papers