Integrating Machine Learning Model Ensembles to the SAVIME Database System

  • Anderson Silva Laboratório Nacional de Computação Científica (LNCC)
  • Patrick Valduriez INRIA / University of Montpellier / CNRS / LIRMM
  • Fabio Porto Laboratório Nacional de Computação Científica (LNCC)

Resumo


The integration of machine learning algorithms into database systems has brought new opportunities in different areas from indexing to query optimization. In this paper, we describe the integration of an approach for the automatic computation of model ensembles to answer a predictive query. We have extended the SAVIME multi-dimensional array DBMS by adding a new function to its query language and implementing the selection and allocation ensemble model dataflow into the query processing component of SAVIME. We show some initial experimental results depicting its performance against a pure Python implementation of the ensemble approach. Interestingly enough the C++ implementation within SAVIME is up to 4 times faster than its competitor.

Palavras-chave: Data Management Systems, Machine Learning, Ensemble Classifier

Referências

(2022). Greenplum. https://greenplum.org/. [Online; accessed 20-July-2022].

(2022). PostgreSQL. https://www.postgresql.org/. [Online; accessed 20-July-2022].

Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. (2017). Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1387–1395, Nova Scotia, Canada. Association for Computing Machinery.

Brown, P. G. (2010). Overview of scidb: Large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, page 963–968, New York, NY, USA. Association for Computing Machinery.

Duta, C. and Grust, T. (2020). Functional-style SQL UDFs with a capital ’F’. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 1273–1287, New York, NY, USA. Association for Computing Machinery.

Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., and Chaudhuri, S. (2019). Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow., 12(9):1044–1057.

Fard, A., Le, A., Larionov, G., Dhillon, W., and Bear, C. (2020). Vertica-ML: Distributed machine learning in Vertica database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 755–768, New York, NY, USA. Association for Computing Machinery.

Jasny, M., Ziegler, T., Kraska, T., Roehm, U., and Binnig, C. (2020). DB4ML - An inmemory database kernel with machine learning support. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 159–173, New York, NY, USA. Association for Computing Machinery.

Kim, K., Jung, J., Seo, I., Han,W.-S., Choi, K., and Chong, J. (2022). Learned cardinality estimation: An in-depth study. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 1214–1227, New York, NY, USA. Association for Computing Machinery.

Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. (2018). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 489–504, New York, NY, USA. Association for Computing Machinery.

L. S. Lustosa, H., C. Silva, A., N. R. da Silva, D., Valduriez, P., and Porto, F. (2021). SAVIME: An array DBMS for simulation analysis and ML models prediction. Journal of Information and Data Management, 11(3).

Lustosa, H. (2020). SAVIME:Enabling Declarative Array Processing in Memory. PhD thesis, National Laboratory of Scientific Computing.

Marcus, R., Negi, P., Mao, H., Tatbul, N., Alizadeh, M., and Kraska, T. (2021). Bao: Making learned query optimization practical. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 1275–1288, New York, NY, USA. Association for Computing Machinery.

Pereira, R., Souto, Y., Chaves, A., Zorilla, R., Tsan, B., Rusu, F., Ogasawara, E., Ziviani, A., and Porto, F. (2021). Djensemble: A cost-based selection and allocation of a disjoint ensemble of spatio-temporal models. In 33rd International Conference on Scientific and Statistical Database Management, SSDBM 2021, page 226–231, New York, NY, USA. Association for Computing Machinery.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.

Sandha, S. S., Cabrera,W., Al-Kateb, M., Nair, S., and Srivastava, M. (2019). In-database distributed machine learning: Demonstration using teradata SQL engine. Proc. VLDB Endow., 12(12):1854–1857.

Woltmann, L., Hartmann, C., Thiele, M., Habich, D., and Lehner, W. (2019). Cardinality estimation with local deep learning models. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM ’19, New York, NY, USA. Association for Computing Machinery.
Publicado
19/09/2022
SILVA, Anderson; VALDURIEZ, Patrick; PORTO, Fabio. Integrating Machine Learning Model Ensembles to the SAVIME Database System. In: WORKSHOP BRASILEIRO DE INTEGRAÇÃO DE BANCOS DE DADOS E INTELIGÊNCIA ARTIFICIAL - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 231-238. DOI: https://doi.org/10.5753/sbbd_estendido.2022.21870.