Efficient online tree, rule-based, and distance-based algorithms

  • Saulo Martiello Mastelini USP
  • André Carlos Ponce de Leon Ferreira de Carvalho USP

Resumo


The fast development of technology resulted in the constant production of data in different forms and from different sources. Contrary to what was observed in the first machine learning (ML) research works, there might be too much data to handle with traditional algorithms. Changes in the underlying data distributions might also render traditional ML solutions useless in real-world applications. Online ML (OML) aims to create solutions able to process data incrementally, with limited computation resource usage, and to deal with time-changing data distributions. Unfortunately, we have seen a recent growing trend in creating OML algorithms that solely focus on predictive performance and overlook computational costs. In regression tasks, the problem is even more pronounced when considering some of the most popular OML solutions: decision trees, decision rules, and ensembles of these models. In this thesis, we created improved and efficient OML algorithms from the mentioned algorithmic families by focusing on decreasing time and memory costs while keeping competitive predictive performance. Our proposals are either novel standalone OML algorithms or additions that can be paired with any existing tree or decision rule regressors.

Referências

Bahri, M., Bifet, A., Gama, J., Gomes, H. M., and Maniu, S. (2021). Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(3):e1405.

Bifet, A. and Gavaldà, R. (2009). Adaptive learning from evolving data streams. In International Symposium on Intelligent Data Analysis, pages 249–260. Springer.

Bifet, A., Gavaldà, R., Holmes, G., and Pfahringer, B. (2018). Machine Learning for Data Streams with Practical Examples in MOA. MIT Press. [link].

Cano, A. and Krawczyk, B. (2022). Rose: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, pages 1–39.

Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80, Boston, MA, USA. ACM.

Dong, W., Moses, C., and Li, K. (2011). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586.

Duarte, J. and Gama, J. (2015). Multi-target regression from high-speed data streams with adaptive model rules. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1–10, Campus des Cordeliers, Paris, France. IEEE.

Duarte, J., Gama, J., and Bifet, A. (2016). Adaptive model rules from high-speed data streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3):1–22.

Gama, J. (2010). Knowledge discovery from data streams. Chapman and Hall/CRC.

García Martín, E. (2020). Energy Efficiency in Machine Learning: Approaches to Sustainable Data Stream Mining. PhD thesis, Blekinge Tekniska Högskola.

García-Martín, E., Rodrigues, C. F., Riley, G., and Grahn, H. (2019). Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134:75–88.

Gomes, H. M., Barddal, J. P., Ferreira, L. E. B., and Bifet, A. (2018). Adaptive random forests for data stream regression. In 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association.

Ikonomovska, E., Gama, J., and Džeroski, S. (2011a). Incremental multi-target model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing, pages 988–993. ACM.

Ikonomovska, E., Gama, J., and Džeroski, S. (2011b). Learning model trees from evolving data streams. Data mining and knowledge discovery, 23(1):128–168.

Ikonomovska, E., Gama, J., and Džeroski, S. (2015). Online tree-based ensembles and option trees for regression on evolving data streams. Neurocomputing, 150:458–470.

Knuth, D. E. (2014). Art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley Professional.

Korycki, Ł. and Krawczyk, B. (2020). Adaptive deep forest for online learning from drifting data streams. arXiv preprint arXiv:2010.07340.

Mastelini, S. M. and de Carvalho, A. C. P. d. L. F. (2021). Using dynamical quantization to perform split attempts in online tree regressors. Pattern Recognition Letters, 145:37–42.

Mastelini, S. M., Montiel, J., Gomes, H. M., Bifet, A., Pfahringer, B., and de Carvalho, A. C. (2021). Fast and lightweight binary and multi-branch Hoeffding Tree Regressors. In 2021 International Conference on Data Mining Workshops (ICDMW), pages 380–388. IEEE.

Mastelini, S. M., Nakano, F. K., Vens, C., de Leon Ferreira, A. C. P., et al. (2022). Online extra trees regressor. IEEE Transactions on Neural Networks and Learning Systems.

Mastelini, S. M. and Ponce de Leon Ferreira de Carvalho, A. C. (2020). 2cs: correlation-guided split candidate selection in hoeffding tree regressors. In Brazilian Conference on Intelligent Systems, pages 337–351. Springer.

Mastelini, S. M., Veloso, B., Halford, M., de Leon Ferreira, A. C. P., Gama, J., et al. (2024). Swinn: Efficient nearest neighbor search in sliding windows using graphs. Information Fusion, 101:101979.

Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H. M., Read, J., Abdessalem, T., and Bifet, A. (2021). River: machine learning for streaming data in python. Journal of Machine Learning Research, 22(110):1–8.

Osojnik, A. (2017). Structured output prediction on data streams. PhD thesis, Ph. D. thesis, Joºef Stefan International Postgraduate School.

Osojnik, A., Panov, P., and Džeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information Systems, 50(2):315–339.

Palli, A. S., Jaafar, J., Gilal, A. R., Alsughayyir, A., Gomes, H. M., Alshanqiti, A., and Omar, M. (2024). Online machine learning from non-stationary data streams in the presence of concept drift and class imbalance: A systematic review. Journal of Information and Communication Technology, 23(1):105–139.

Pfahringer, B., Holmes, G., and Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 296–307. Springer.

Russell, S. J. and Norvig, P. (2016). Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,.

Shimomura, L. C., Oyamada, R. S., Vieira, M. R., and Kaster, D. S. (2021). A survey on graph-based methods for similarity searches in metric spaces. Information Systems, 95:101507.
Publicado
21/07/2024
MASTELINI, Saulo Martiello; CARVALHO, André Carlos Ponce de Leon Ferreira de. Efficient online tree, rule-based, and distance-based algorithms. In: CONCURSO DE TESES E DISSERTAÇÕES (CTD), 37. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 21-27. ISSN 2763-8820. DOI: https://doi.org/10.5753/ctd.2024.1859.