Agrupamento Fuzzy para Fluxo Contínuo de Dados – Um Estudo de Algoritmos Baseados em Blocos

R. K. Asbahr; P. A. Lopes; H. A. Camargo

doi:10.5753/kdmile.2018.27396

R. K. Asbahr UFSCar
P. A. Lopes Itera
H. A. Camargo UFSCar

DOI: https://doi.org/10.5753/kdmile.2018.27396

Resumo

Data Stream Mining (DSM) has become an important topic due to the increasing availability of large collections of data. These data sets are characterized by having potentially infinite size, which prevents them from being stored in their entirety, and can generate examples with changeable statistical distribution according to time. These characteristics impose the need to create and use appropriate algorithms. Clustering algorithms are appropriate for DSMs where the labeling of the examples is costly and time consuming. Fuzzy clustering algorithms present an additional benefit in these contexts by allowing decision surfaces to be defined flexibly. The objective of this work was to implement and analyze the behavior of chunk based fuzzy clustering algorithms for DSM. The experiments, using two synthetic datasets and one real data set, allow us to extract analyzes regarding trends in the behavior of the algorithms according to their abilities to treat two critical problems for this type of algorithm: change in the distribution of the data and definition of the number of groups.

Palavras-chave: data stream mining, fuzzy clustering, concept drift, machine learning

Referências

AGGARWAL, C. C., HAN, J., AND WANG, J. & YU, P. S. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Databases. vol. 29, pp.81-92, 2003.

BEZDEK, J. C. Pattern recognition with fuzzy objective function algorithms. https://doi.org/10.1007/978-1-4757-0450-1, 1981.

GAMA, J. Knowledge discovery from data streams, 2010. Chapman and Hall.

GAMA, J. A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence. 1 (1): 45-55, 2012. https://doi.org/10.1007/s13748-011-0002-6.

GROUP, C. I. Data stream repository. http://github.com/CIG-UFSCar/DS Datas, 2017.

HORE, P. AND HALL, L. O. & GOLDGOF, D. B. A fuzzy c means variant for clustering evolving data streams. In 2007 IEEE International Conference on Systems, Man and Cybernetics, 2007a. https://doi.org/10.1109/ICSMC.2007.4413710.

HORE, P. AND HALL, L. O. & GOLDGOF, D. B. Single pass fuzzy c means. In 2007 IEEE International Fuzzy Systems Conference, 2007b. https://doi.org/10.1109/FUZZY.2007.4295372.

HORE, P., HALL, L. O., AND GOLDGOF, D. B. . C. W. Online fuzzy c means. n NAFIPS 2008 - 2008 Annual Meeting of the North American Fuzzy Information Processing Society, 2008. https://doi.org/10.1109/NAFIPS.2008.4531233.

JAIN, A. K. AND MURTY, M. N. & FLYNN, P. J. Data clustering: a review. ACM Computing Surveys 31 (3): 264–323, 1999. https://doi.org/10.1145/331499.331504.

JAWORSKI, M., DUDA, P., AND PIETRUCZUK, L. On fuzzy clustering of data streams with concept drift. Artificial Intelligence and Soft Computing vol. 2, pp. 82–91, 2012.

KRANEN, P., ASSENT, I., AND BALDAUF, C. & SEIDL, T. The clustree: Indexing micro-clusters for anytime stream mining. Knowledge and Information Systems 29 (2): 249–272, 2011. https://doi.org/10.1007/s10115-010-0342-8.

KRISHNAPURAM, R. & KELLER, J. M. A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems 1 (2): 98–110, 1993. https://doi.org/10.1109/91.227387.

LI, Y., YANG, G., HE, H., AND JIAO, L. & SHANG, R. A study of large-scale data clustering based on fuzzy clustering. Soft Computing 20 (8): 3231–3242, 2016. https://doi.org/10.1007/s00500-015-1698-1.

MACQUEEN, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability vol. 1, pp. 281–297, 1967. https://doi.org/citeulike-article-id:6083430.

MITCHELL, T. Machine learning. McGraw-Hill Education, 1997.

MOSTAFAVI, S. & AMIRI, A. Extending fuzzy c-means to clustering data streams. 20th Iranian Conference on Electrical Engineering, 2012. https://doi.org/10.1109/IranianCEE.2012.6292449.

SILVA, J. A., FARIA, E. R., BARROS, R. C., HRUSCHKA, E. R., AND CARVALHO, A. C. P. L. F. D. . G. J. Data stream clustering - a survey. ACM Computing Surveys 46 (1): 1–31, 2013. https://doi.org/10.1145/2522968.2522981.

WITTEN, I. H., FRANK, E., AND HALL, M. A. & PAL, C. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann Series in Data Management Systems., 2017.

XIE, X. L. & BENI, G. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991. https://doi.org/10.1109/34.85677.