Discovery and Application of Data Dependencies
Abstract
This work makes contributions that reach central problems in connection with data dependencies. The first problem regards the discovery of dependencies of high expressive power. We introduce an efficient algorithm for the discovery of denial constraints: a type of dependency that has enough expressive power to generalize other important types of dependencies and to express complex business rules. The second problem concerns the application of dependencies for improving data consistency. We present a modification for traditional dependency discovery approaches that enables the dependency discovery algorithms to return reliable results even if they run on data containing some inconsistent records. Also, we present a system for detecting violations of dependencies efficiently. Our extensive experimental evaluation shows that our system is up to three orders-of-magnitude faster than state-of-the-art solutions, especially for larger datasets and massive numbers of dependency violations. The last contribution in this work regards the application of dependencies in query optimization. We present a system for the automatic discovery and selection of functional dependencies. Our experimental evaluation shows that our system selects relevant functional dependencies that help reducing the overall query response time for various types of query workloads.
Keywords:
data quality, data consistency, data dependencies, integrity constraints, error detection
References
Abedjan, Z., Golab, L., and Naumann, F. (2015). Profiling relational data: A survey. The VLDB Journal, 24(4):557–581.
Abiteboul, S., Hull, R., and Vianu, V. (1995). Foundations of Databases. Addison-Wesley.
Kimura, H., Huo, G., Rasin, A., Madden, S., and Zdonik, S. B. (2009). Correlation maps: A compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow., 2(1):1222–1233.
Liu, J., Li, J., Liu, C., and Chen, Y. (2012). Discover dependencies from data - a review. IEEE TKDE, 24(2):251–264.
Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., and Naumann, F. (2015). Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB., 8(10):1082–1093.
Pena, E. H. M. (2018). Workload-aware discovery of integrity constraints for data cleaning. In VLDB 2018 - PhD Workshop, volume 2175.
Pena, E. H. M. and de Almeida, E. C. (2018). Bfastdc: A bitwise algorithm for mining denial constraints. In Database and Expert Systems Applications (DEXA), pages 5368, Cham. Springer International Publishing.
Pena, E. H. M. and de Almeida, E. C. (2019). Short paper: Descoberta automática de restrições de negação confiáveis. In XXXIV Simpósio Brasileiro de Banco de Dados, SBBD 2019, Fortaleza, CE, Brazil, October 7-10, 2019, pages 187–192. SBC.
Pena, E. H. M., de Almeida, E. C., and Naumann, F. (2019). Discovery of approximate (and exact) denial constraints. Proc. VLDB Endow., 13(3):266–278.
Pena, E. H. M., Falk, E., Meira, J. A., and de Almeida, E. C. (2018). Mind your dependencies for semantic query optimization. JIDM, 9(1):3–19.
Pena, E. H. M., Lucas Filho, E. R., de Almeida, E. C., and Naumann, F. (2020). Efficient detection of data dependency violations. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), page 1235–1244.
Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB Endow., 10(11):1190–1201.
Santore, F., de Almeida, E. C., Bonat, W. H., Pena, E. H. M., and de Oliveira, L. E. S. (2020). A framework for analyzing the impact of missing data in predictive models. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2209–2212.
Abiteboul, S., Hull, R., and Vianu, V. (1995). Foundations of Databases. Addison-Wesley.
Kimura, H., Huo, G., Rasin, A., Madden, S., and Zdonik, S. B. (2009). Correlation maps: A compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow., 2(1):1222–1233.
Liu, J., Li, J., Liu, C., and Chen, Y. (2012). Discover dependencies from data - a review. IEEE TKDE, 24(2):251–264.
Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., and Naumann, F. (2015). Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB., 8(10):1082–1093.
Pena, E. H. M. (2018). Workload-aware discovery of integrity constraints for data cleaning. In VLDB 2018 - PhD Workshop, volume 2175.
Pena, E. H. M. and de Almeida, E. C. (2018). Bfastdc: A bitwise algorithm for mining denial constraints. In Database and Expert Systems Applications (DEXA), pages 5368, Cham. Springer International Publishing.
Pena, E. H. M. and de Almeida, E. C. (2019). Short paper: Descoberta automática de restrições de negação confiáveis. In XXXIV Simpósio Brasileiro de Banco de Dados, SBBD 2019, Fortaleza, CE, Brazil, October 7-10, 2019, pages 187–192. SBC.
Pena, E. H. M., de Almeida, E. C., and Naumann, F. (2019). Discovery of approximate (and exact) denial constraints. Proc. VLDB Endow., 13(3):266–278.
Pena, E. H. M., Falk, E., Meira, J. A., and de Almeida, E. C. (2018). Mind your dependencies for semantic query optimization. JIDM, 9(1):3–19.
Pena, E. H. M., Lucas Filho, E. R., de Almeida, E. C., and Naumann, F. (2020). Efficient detection of data dependency violations. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), page 1235–1244.
Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB Endow., 10(11):1190–1201.
Santore, F., de Almeida, E. C., Bonat, W. H., Pena, E. H. M., and de Oliveira, L. E. S. (2020). A framework for analyzing the impact of missing data in predictive models. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2209–2212.
Published
2021-07-18
How to Cite
PENA, Eduardo Henrique Monteiro; DE ALMEIDA, Eduardo Cunha.
Discovery and Application of Data Dependencies. In: THESIS AND DISSERTATION CONTEST (CTD), 34. , 2021, Evento Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 1-6.
ISSN 2763-8820.
DOI: https://doi.org/10.5753/ctd.2021.15749.
