CoFFee: A Co-occurrence and Frequency-Based Approach to Schema Mining
A wide range of applications use semi-structured data. A characteristic of these data is that they are heterogeneous and do not follow a predefined schema, i.e., schema-less. The lack of structure makes it difficult to use this data since many applications depend on it to perform their tasks. Thus, we propose CoFFee, a schema mining approach that, given a set of heterogeneous schemas, provides a summarized schema containing a set of core attributes. To this end, CoFFee uses a strategy that combines co-occurrence and frequency of attributes. It models a set of entity schemas as a graph and uses centrality metrics to capture the co-occurrence between attributes. We evaluated CoFFee using data extracted from six DBpedia classes and compared it with two state-of-the-art approaches. The results achieved show that CoFFee produces a summarized schema of good quality, outperforming the baselines by an average of 22% of the F1 score.
Christodoulou, K., Paton, N. W., and Fernandes, A. A. (2015). Structure inference for linked data sources using clustering. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XIX, pages 1-25. Springer.
Dong, X. L. and Srivastava, D. (2015). Schema Alignment, pages 31-61. Springer International Publishing, Cham.
Hassanzadeh, O., Pu, K. Q., Yeganeh, S. H., Miller, R. J., Popa, L., Hernandez, M. A., and Ho, H. (2013). Discovering linkage points over web data. Proceedings of the VLDB Endowment, 6(6):445-456.
Issa, S., Paris, P.-H., Hamdi, F., and Si-Said Cherfi, S. (2019). Revealing the conceptual schemas of rdf datasets. In Giorgini, P. and Weber, B., editors, Advanced Information Systems Engineering, pages 312-327, Cham. Springer International Publishing.
Kellou-Menouer, K., Kardoulakis, N., Troullinou, G., Kedad, Z., Plexousakis, D., and Kondylakis, H. (2021). A survey on semantic schema discovery. The VLDB Journal, pages 1-36.
Kellou-Menouer, K. and Kedad, Z. (2015). Schema discovery in rdf data sources. In International Conference on Conceptual Modeling, pages 481-495. Springer.
Lange, D., Bohm, C., and Naumann, F. (2010). Extracting structured information from wikipedia articles to populate infoboxes. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM’10, page 1661-1664, New York, NY, USA. Association for Computing Machinery.
Moreira, J. and Barbosa, L. (2021). Deepex: A robust weak supervision system for knowledge base augmentation. J. Data Semant., 10(3-4):309-325.
Moreira, J., Neto, E. C., and Barbosa, L. (2021). Analysis of structured data on wikipedia. International Journal of Metadata, Semantics and Ontologies, 15(1):71-86.
Queiroz-Sousa, P. O., Salgado, A. C., and Pires, C. E. (2013). A method for building personalized ontology summaries. Journal of Information and Data Management, 4(3):236-236.
Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., and Wangz, C. (2015). Schema management for document stores. Proc. VLDB Endow., 8(9):922-933.
Weise, M., Lohmann, S., and Haag, F. (2016). Ld-vowl: Extracting and visualizing schema information for linked data. In 2nd international workshop on visualization and interaction for ontologies and linked data, pages 120-127.
Wu, F. and Weld, D. S. (2007). Autonomously semantifying wikipedia. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 41-50.
Zhang, J. and Luo, Y. (2017). Degree centrality, betweenness centrality, and closeness centrality in social network. In Proceedings of the 2017 2nd International Conference on Modelling, Simulation and Applied Mathematics (MSAM2017), volume 132, pages 300-303.