Detection and Application of Data Dependencies

Abstract


Data dependencies are critical in important areas of data management such as data quality, integration and analysis. This thesis presents relevant contributions to important problems related to such dependencies. The first is related to dependency detection. We studied the detection of negation constraints, as they generalize other types of dependencies, and are able to express complex data quality rules. We present an algorithm for discovering negation constraints and evaluate it in a variety of scenarios. Compared to state-of-the-art solutions, our algorithm significantly improves detection efficiency in terms of runtime. The second problem concerns the application of dependencies to improve data consistency. We show that it is possible to extract evidence from datasets to discover constraints that hold approximately and that identify, with good accuracy and recovery, inconsistencies in the input dataset. We also present an error detection system based on negation constraints that has executions up to three orders of magnitude faster than state-of-the-art solutions, especially for larger data sets and complex constraints. Finally, our last contribution is about applying dependencies in query optimization. We present a system for the automatic detection and selection of functional dependencies based on representations extracted from workloads. Our experiments show that applying selected dependencies can reduce the overall response time of multiple queries. The above contributions were published in renowned national (SBBD) and international (PVLDB, CIKM and DEXA) vehicles, and enabled national cooperation with federal universities (UFPR and UTFPR), as well as international cooperation with research institutes (HPI-Germany and SnT- Luxembourg).
Keywords: data dependency, dependency detection, denial constraints

References

Abedjan, Z., Golab, L., and Naumann, F. (2015). Profiling relational data: A survey. The VLDB Journal, 24(4):557–581.

Abiteboul, S., Hull, R., and Vianu, V. (1995). Foundations of Databases. Addison-Wesley.

Chu, X., Ilyas, I. F., and Papotti, P. (2013). Holistic data cleaning: Putting violations into context. pages 458–469.

Kimura, H., Huo, G., Rasin, A., Madden, S., and Zdonik, S. B. (2009). Correlation maps: A compressed access method for exploiting soft functional dependencies. Proc. VLDB Endow., 2(1):1222–1233.

Liu, J., Li, J., Liu, C., and Chen, Y. (2012). Discover dependencies from data - a review. IEEE TKDE, 24(2):251–264.

Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.-P., Schönberg, M., Zwiener, J., and Naumann, F. (2015). Functional dependency discovery: An experimental evaluation of seven algorithms. PVLDB., 8(10):1082–1093.

Pena, E. H. M. (2018). Workload-aware discovery of integrity constraints for data cleaning. In VLDB 2018 - PhD Workshop, volume 2175.

Pena, E. H. M. and de Almeida, E. C. (2018). Bfastdc: A bitwise algorithm for mining denial constraints. In Database and Expert Systems Applications (DEXA), pages 53–68, Cham. Springer International Publishing.

Pena, E. H. M. and de Almeida, E. C. (2019). Short paper: Descoberta automática de restrições de negação confiáveis. In XXXIV Simpósio Brasileiro de Banco de Dados, SBBD 2019, Fortaleza, CE, Brazil, October 7-10, 2019, pages 187–192. SBC.

Pena, E. H. M., de Almeida, E. C., and Naumann, F. (2019). Discovery of approximate (and exact) denial constraints. Proc. VLDB Endow., 13(3):266–278.

Pena, E. H. M., Falk, E., Meira, J. A., and de Almeida, E. C. (2018). Mind your dependencies for semantic query optimization. JIDM, 9(1):3–19.

Pena, E. H. M., Lucas Filho, E. R., de Almeida, E. C., and Naumann, F. (2020). Efficient detection of data dependency violations. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), page 1235–1244.

Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB Endow., 10(11):1190–1201.

Santore, F., de Almeida, E. C., Bonat, W. H., Pena, E. H. M., and de Oliveira, L. E. S. (2020). A framework for analyzing the impact of missing data in predictive models. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2209–2212.
Published
2021-10-04
PENA, Eduardo Henrique Monteiro; CUNHA DE ALMEIDA, Eduardo. Detection and Application of Data Dependencies. In: THESIS AND DISSERTATION CONTEST (CTDBD) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 36. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 183-188. DOI: https://doi.org/10.5753/sbbd_estendido.2021.18183.