Statistical Validation of Column Matching in the Database Schema Evolution of the Brazilian Public School Census

Resumo


Publicly available datasets are subject to new versions, with each version potentially reflecting changes to the data. These changes may involve adding or removing attributes, changing data types, and modifying values or their semantics. Integrating these datasets into a relational database poses a significant challenge: How to keep track of the evolving database schema while incorporating different versions of the data sources? This paper presents a statistical methodology to validate the integration of 12 years of open access datasets from Brazil’s School Census, with a new version of the datasets released annually by the Brazilian Ministry of Education (MEC). We employ various statistical tests to find matching attributes between datasets from a specific year and their potential equivalents in datasets from later years. The results show that using the Kolmogorov–Smirnov test, we can successfully match columns from different dataset versions in about 90% of cases.

Palavras-chave: Relational Database, Schema Evolution, Statistical Methods

Referências

Abedjan, Z., Golab, L., and Naumann, F. (2015). Profiling relational data: a survey. VLDB J., 24(4):557–581.

Alves, T., Silveira, A. A. D., Schneider, G., and Fabro, M. D. D. (2019). Financiamento da escola pública de educação básica: a proposta do simulador de custo-aluno qualidade. Educação E Sociedade (in Portuguese), 40.

Anderson, T. W. and Darling, D. A. (1952). Asymptotic Theory of Certain ”Goodness of Fit” Criteria Based on Stochastic Processes. The Annals of Mathematical Statistics, 23(2):193 – 212.

Berger, V. W. and Zhou, Y. (2014). Kolmogorov–smirnov test: Overview. Wiley statsref: Statistics reference online.

Cerqueus, T., de Almeida, E. C., and Scherzinger, S. (2015a). Safely managing data variety in big data software development. In 1st IEEE/ACM BIGDSE, pages 4–10.

Cerqueus, T., Scherzinger, S., and de Almeida, E. C. (2015b). Controvol: Let yesterday’s data catch up with today’s application code. In WWW Companion, pages 15–16.

Curino, C., Moon, H. J., Deutsch, A., and Zaniolo, C. (2013). Automating the database schema evolution process. VLDB J., 22(1):73–98.

Curino, C., Moon, H. J., and Zaniolo, C. (2009). Automating database schema evolution in information system upgrades. In 2nd ACM HotSWUp 2009.

D’Agostino, R. (1986). Goodness-of-Fit-Techniques. Statistics: A Series of Textbooks and Monographs. Taylor & Francis.

Delplanque, J., Etien, A., Anquetil, N., and Ducasse, S. (2020). Recommendations for evolving relational databases. In CAiSE 2020, pages 498–514.

Etien, A. and Anquetil, N. (2024). Automatic recommendations for evolving relational databases schema. arXiv preprint arXiv:2404.08525.

Garcia-Molina, H., Ullman, J. D., and Widom, J. (2009). Database systems - the complete book (2. ed.). Pearson Education.

Hahs-Vaughn, D. and Lomax, R. (2020). Statistical Concepts: A Second Course. Routledge.

Klettke, M., Awolin, H., Störl, U., Müller, D., and Scherzinger, S. (2017). Uncovering the evolution history of data lakes. In IEEE BigData 2017, pages 2462–2471.

Meurice, L. and Cleve, A. (2017). Supporting schema evolution in schema-less nosql data stores. In 24th IEEE SANER, pages 457–461.

Pena, E. H. M., de Almeida, E. C., and Naumann, F. (2021). Fast detection of denial constraint violations. Proc. VLDB Endow., 15(4):859–871.

Pettitt, A. N. (1976). A two-sample anderson-darling rank statistic. Biometrika, 63(1):161–168.

Qiu, D., Li, B., and Su, Z. (2013). An empirical analysis of the co-evolution of schema and code in database applications. In ACM SIGSOFT, page 125–135.

Rayner, J., Thas, O., and Best, D. (2009). Smooth Tests of Goodness of Fit: Using R. Wiley series in probability and statistics Smooth tests of goodness of fit using R. Wiley.

Ringlstetter, A., Scherzinger, S., and Bissyandé, T. F. (2016). Data model evolution using object-nosql mappers: folklore or state-of-the-art? In 2nd IEEE/ACM BIGDSE, page 33–36.

Scherzinger, S., de Almeida, E. C., Cerqueus, T., de Almeida, L. B., and Holanda, P. (2016). Finding and fixing type mismatches in the evolution of object-nosql mappings. In EDBT/ICDT Workshops, volume 1558.

Scherzinger, S. and Sidortschuck, S. (2020). An empirical study on the design and evolution of nosql database schemas. In Conceptual Modeling, pages 441–455. Springer Inter. Publishing.

Schneider, G., Gallotti Frantz, M., and Alves, T. (2023). Infraestrutura das escolas públicas no brasil: desigualdades e desafios para o financiamento da educação básica. Revista Educação Básica em Foco (in Portuguese), 17(2).

Schneider, G., Silveira, A. A., and Alves, T. (2020). Mapeamento da formação de docentes no paraná: um olhar para o indicador de adequação. Jornal de Políticas Educacionais (in Portuguese), 1(3).

Silveira, A. D., Schneider, G., and Alves, T. (2021). Simulador de Custo-Aluno Qualidade (SimCAQ): Trajetória e Potencialidades. Inep/MEC (in Portuguese).

Vassiliadis, P., Zarras, A. V., and Skoulis, I. (2015). How is life for a table in an evolving relational schema? birth, death and everything in between. In Conceptual Modeling, pages 453–466.
Publicado
14/10/2024
YAMANAKA, Muriki G.; DE ALMEIDA, Diogo H.; DE ALMEIDA, Paulo Ricardo Lisboa; DOMINICO, Simone; PERES, Leticia M.; SUNYE, Marcos S.; ALMEIDA, Eduardo C. de. Statistical Validation of Column Matching in the Database Schema Evolution of the Brazilian Public School Census. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 498-509. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240840.