A Clustering Visualization Query Language

  • Ana Sodré Federal University of Paraná
  • Luis Floriano Federal University of Paraná
  • Aurora Pozo Federal University of Paraná
  • Carmem Hara Federal University of Paraná

Abstract


In recent years, machine learning techniques have been used in several areas and applications. However, the application of such techniques still relies on professional experts, with ability to execute the required sequence of tasks. In the database area, SQL has democratized the use of database management systems (DBMS) for the end user through a simple declarative language, which resembles natural language. In this paper, Clustering Visualization Query Language (CVQL) is proposed, which extends SQL to execute clustering and visualization tasks. Clustering is a technique used to divide data into groups that share similar features. The visualization of these groupings in different graphical forms is an essential functionality for the end user to analyze the results. CVQL was implemented using mySQL DBMS and scikit-learn library. To present the system, we consider an 8 lines query example on a real database of Covid-19 cases in the USA. The execution of the same sequence os tasks would require 140 lines of code in Python, which shows the usefulness of CVQL.

Keywords: query language, SQL, clustering, visualization, machine learning

References

Adam, A., Blockeel, H., Govers, S., and Aertsen, A. (2013). Sccql : A constraint-based clustering system. Lecture Notes in Computer Science - Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 8190:681–684.

Baptista de Almeida, J. L., Sakata, T. C., and Faceli, K. (2016). Asaclu: Selecting diverse and relevant clusters. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 474–479.

Boehm, M., Evfimievski, A. V., Pansare, N., and Reinwald, B. (2016). Declarative machine learning - a classification of basic properties and types. arXiv, 1605.05826.

Cai, Z., Vagena, Z., Perez, L., Arumugam, S., Haas, P. J., and Jermaine, C. (2013). Simulation of database-valued markov chains using SimSQL. In Proc. of the International Conference on Management of Data (SIGMOD), pages 637–648. 1 https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Sex-Age-and-S/ 9bhg-hcku/data 2 https://www.census.gov/data/datasets/time-series/demo/popest/ 2010s-national-detail.html

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2019). Auto-sklearn: efficient and robust automated machine learning. Automated Machine Learning, pages 113–134.

Imielinski, T. and Mannila, H. (1996). A database perspective on knowledge discovery. Communications of the ACM, 39(11):58–64.

Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown, K. (2016). AutoWEKA 2.0: Automatic model selectionand hyperparameter optimization in WEKA. ournal of Machine Learning Research, 17:1–5.

Makrynioti, N. and Vassalos, V. (2020). Declarative data analytics: a survey. IEEE Transactions of Knowledge and Data Engineering (TKDE) (to appear).

Markl, V. (2014). Breaking the chains: On declarative data analysis and data independence in the big data era. PVLDB, 7(13):1730–1733.

Meo, R., Psaila, G., and Ceri, S. (1996). A new sql-like operator for mining association rules. In Proc. of the 22nd International Conference on Very Large Data Bases (VLDB), page 122–133.

Real, E. M. (2016). Estimating the number of clusters based on sequential clustering algorithms. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), pages 229–234.

Rivers, J. (2017). Scidb: An array-native computational database for heterogeneous, multi-dimensional data sets. In Proc. of the 2017 IEEE International Conference on Big Data (Big Data), pages 3206–3210.

Zuccarelli, E. (2020). Developing machine learning pipelines. Towards Data Science.
Published
2020-10-20
SODRÉ, Ana; FLORIANO, Luis; POZO, Aurora; HARA, Carmem. A Clustering Visualization Query Language. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 17. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 449-458. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2020.12150.