Skip to main content

Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13073))

Abstract

Hierarchical data stream classification inherits the properties and constraints of hierarchical classification and data stream classification concomitantly. Therefore, it requires novel approaches that (i) can handle class hierarchies, (ii) can be updated over time, and (iii) are computationally light-weighted regarding processing time and memory usage. In this study, we propose the Gaussian Naive Bayes for Hierarchical Data Streams (GNB-hDS) method: an incremental Gaussian Naive Bayes for classifying potentially unbounded hierarchical data streams. GNB-hDS uses statistical summaries of the data stream instead of storing actual instances. These statistical summaries allow more efficient data storage, keep constant computational time and memory, and calculate the probability of an instance belonging to a specific class via the Bayes’ Theorem. We compare our method against a technique that stores raw instances, and results show that our method obtains equivalent prediction rates while being significantly faster.

Supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.ppgia.pucpr.br/~jean.barddal/datasets/GNB-hDS.zip.

References

  1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)

    Google Scholar 

  2. Alcobé, J.: Incremental learning of tree augmented Naive Bayes classifiers. In: Garijo, F.J., Riquelme, J.C., Toro, M. (eds.) IBERAMIA 2002. LNCS (LNAI), vol. 2527, pp. 32–41. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36131-6_4

    Chapter  Google Scholar 

  3. Anderson, J.R., Matessa, M.: Explorations of an incremental, Bayesian algorithm for categorization. Mach. Learn. 9(4), 275–308 (1992)

    Google Scholar 

  4. Bahri, M., Maniu, S., Bifet, A.: A sketch-based Naive Bayes algorithms for evolving data streams. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 604–613. IEEE (2018)

    Google Scholar 

  5. Barddal, J.P., Gomes, H.M., Enembreck, F., Pfahringer, B., Bifet, A.: On dynamic feature weighting for feature drifting data streams. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 129–144. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46227-1_9

    Chapter  Google Scholar 

  6. Barros, R.S., Cabral, D.R., Gonçalves Jr., P.M., Santos, S.G.: RDDM: reactive drift detection method. Expert Syst. Appl. 90, 344–355 (2017)

    Google Scholar 

  7. Bi, W., Kwok, J.T.: Bayes-optimal hierarchical multilabel classification. IEEE Trans. Knowl. Data Eng. 27(11), 2907–2918 (2015)

    Article  Google Scholar 

  8. Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. SIAM (2007)

    Google Scholar 

  9. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 139–148 (2009)

    Google Scholar 

  10. Bifet, A., Kirkby, R.: Data stream mining a practical approach (2009)

    Google Scholar 

  11. Bishop, C.M.: Pattern Recognition and Machine Learning. springer, Heidelberg (2006)

    Google Scholar 

  12. Burred, J.J., Lerch, A.: A hierarchical approach to automatic musical genre classification. In: Proceedings of the 6th International Conference on Digital Audio Effects, pp. 8–11. Citeseer (2003)

    Google Scholar 

  13. de Campos Merschmann, L.H., Freitas, A.A.: An extended local hierarchical classifier for prediction of protein and gene functions. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 159–171. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40131-2_14

    Chapter  Google Scholar 

  14. Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. J. Mach. Learn. Res. 7, 31–54 (2006)

    MathSciNet  MATH  Google Scholar 

  15. Chan, T.F., Golub, G.H., LeVeque, R.J.: Algorithms for computing the sample variance: analysis and recommendations. Am. Stat. 37(3), 242–247 (1983)

    MathSciNet  MATH  Google Scholar 

  16. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    Google Scholar 

  17. Frías-Blanco, I., del Campo-Ávila, J., Ramos-Jimenez, G., Morales-Bueno, R., Ortiz-Díaz, A., Caballero-Mota, Y.: Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Trans. Knowl. Data Eng. 27(3), 810–823 (2014)

    Article  Google Scholar 

  18. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997)

    Article  Google Scholar 

  19. Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC (2010)

    Google Scholar 

  20. Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)

    Article  MathSciNet  Google Scholar 

  21. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)

    Article  Google Scholar 

  22. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

    Google Scholar 

  23. Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data—a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, New York (2015). https://doi.org/10.1007/978-1-4939-2092-1_38

    Chapter  Google Scholar 

  24. Kiritchenko, S., Famili, F.: Functional annotation of genes using hierarchical text categorization. In: Proceedings of BioLink SIG, ISMB, January 2005

    Google Scholar 

  25. Klawonn, F., Angelov, P.: Evolving extended Naive Bayes classifiers. In: Sixth IEEE International Conference on Data Mining-Workshops (ICDMW 2006), pp. 643–647. IEEE (2006)

    Google Scholar 

  26. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)

    Google Scholar 

  27. Nguyen, H.L., Woon, Y.K., Ng, W.K.: A survey on data stream clustering and classification. Knowl. Inf. Syst. 45(3), 535–569 (2015)

    Article  Google Scholar 

  28. Parmezan, A.R.S., Souza, V.M.A., Batista, G.E.A.P.A.: Towards hierarchical classification of data streams. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds.) CIARP 2018. LNCS, vol. 11401, pp. 314–322. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-13469-3_37

    Chapter  Google Scholar 

  29. Pereira, R.M., Bertolini, D., Teixeira, L.O., Silla Jr., C.N., Costa, Y.M.: COVID-19 identification in chest x-ray images on flat and hierarchical classification scenarios. Comput. Methods Programs Biomed. 194, 105532 (2020)

    Google Scholar 

  30. Pontes, E.A.S.: A brief historical overview of the Gaussian curve: from Abraham de Moivre to Johann Carl Friedrich Gauss. Int. J. Eng. Sci. Invent. (IJESI), 28–34 (2018)

    Google Scholar 

  31. Prasad, B.R., Agarwal, S.: Stream data mining: platforms, algorithms, performance evaluators and research trends. Int. J. Database Theory Appl. 9(9), 201–218 (2016)

    Article  Google Scholar 

  32. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)

    Google Scholar 

  33. Seidl, T., Assent, I., Kranen, P., Krieger, R., Herrmann, J.: Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 311–322 (2009)

    Google Scholar 

  34. Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)

    Article  MathSciNet  Google Scholar 

  35. Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 22(1–2), 31–72 (2011)

    Article  MathSciNet  Google Scholar 

  36. Silla Jr., C.N., Freitas, A.A.: A global-model Naive Bayes approach to the hierarchical prediction of protein functions. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 992–997. IEEE (2009)

    Google Scholar 

  37. Souza, V.M.A., Reis, D.M., Maletzke, A.G., Batista, G.E.A.P.A.: Challenges in benchmarking stream learning algorithms with real-world data. Data Min. Knowl. Discov., 1–54 (2020). https://doi.org/10.1007/s10618-020-00698-5

  38. Steinbach, M., Ertöz, L., Kumar, V.: The Challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-08968-2_16

  39. Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dep. Trinity Coll. Dublin 106(2), 58 (2004)

    Google Scholar 

  40. West, D.: Updating mean and variance estimates: an improved method. Commun. ACM 22(9), 532–535 (1979)

    Article  Google Scholar 

  41. Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics), pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16

  42. Wu, F., Zhang, J., Honavar, V.: Learning classifiers using hierarchically structured class taxonomies. In: Zucker, J.-D., Saitta, L. (eds.) SARA 2005. LNCS (LNAI), vol. 3607, pp. 313–320. Springer, Heidelberg (2005). https://doi.org/10.1007/11527862_24

    Chapter  Google Scholar 

  43. Yassin, N.I., Omran, S., El Houby, E.M., Allam, H.: Machine learning techniques for breast cancer computer aided diagnosis using different image modalities: a systematic review. Comput. Methods Progr. Biomed. 156, 25–45 (2018)

    Article  Google Scholar 

  44. Yeo, I.K., Johnson, R.A.: A new family of power transformations to improve normality or symmetry. Biometrika 87(4), 954–959 (2000)

    Article  MathSciNet  Google Scholar 

  45. Zaragoza, J.C., Sucar, E., Morales, E., Bielza, C., Larranaga, P.: Bayesian chain classifiers for multidimensional classification. In: Twenty-Second International Joint Conference on Artificial Intelligence. Citeseer (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduardo Tieppo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tieppo, E., Barddal, J.P., Nievola, J.C. (2021). Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13073. Springer, Cham. https://doi.org/10.1007/978-3-030-91702-9_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91702-9_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91701-2

  • Online ISBN: 978-3-030-91702-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics