A Framework for Online Clustering Based on Evolving Semi-Supervision

Guilherme Alves; Maria Camila N. Barioni; Elaine R. Faria

doi:10.5753/sbbd.2017.171369

Guilherme Alves Universidade Federal de Uberlândia
Maria Camila N. Barioni Universidade Federal de Uberlândia
Elaine R. Faria Universidade Federal de Uberlândia

DOI: https://doi.org/10.5753/sbbd.2017.171369

Resumo

The huge amount of currently available data puts considerable constraints on the task of information retrieval. Automatic methods to organize data, such as clustering, can be used to help with this task allowing timely access. Semi-supervised clustering approaches employ some additional information to guide the clustering performed based on data attributes to a more suitable data partition. However, this extra information may change over time imposing a shift in the manner by which data is organized. In order to help cope with this issue, we propose the framework called CABESS (Cluster Adaptation Based on Evolving Semi-Supervision), for online clustering. This framework is able to deal with evolving semi-supervision obtained through user binary feedbacks. To validate our approach, the experiments were run over hierarchical labeled data considering clustering splits over time. The experimental results show the potential of the proposed framework for dealing with evolving semi-supervision. Moreover, they also show that our framework is faster than traditional semi-supervised clustering algorithms using lower standard semi-supervision.

Palavras-chave: Online Clustering, Adaptation, Semi-Supervision, Framework

Referências

Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. (2003). A framework for clustering evolving data streams. In VLDB, pages 81–92. VLDB Endowment.

Barioni, M. C. N., Razente, H., Marcelino, A. M. R., Traina, A. J. M., and Traina, C. (2014). Open issues for partitioning clustering methods: An overview. WIREs Data Min. and Knowl. Disc., 4(3):161–177.

Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman and Hall/CRC.

Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating constraints and metric learning in semi-supervised clustering. In ACM ICML, page 11, New York, NY, USA.

Castellano, G., Fanelli, A. M., and Torsello, M. A. (2013). Shape Annotation by Incremental Semi-supervised Fuzzy Clustering. In WILF, volume 8256 of LNCS, pages 193–200. Springer.

Colonna, J. G., Gama, J., and Nakamura, E. F. (2016). Recognizing Family, Genus, and Species of Anuran Using a Hierarchical Classification Approach. pages 198–212. Springer, Cham.

Dubey, A., Bhattacharya, I., and Godbole, S. (2010). A Cluster-Level Semi-supervision Model for Interactive Clustering. pages 409–424.

El Moussawi, A., Cheriat, A., Giacometti, A., Labroche, N., and Soulet, A. (2016). Clustering with Quantitative User Preferences on Attributes. In IEEE ICTAI, pages 383–387.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231. AAAI Press.

Gama, J. (2010). Knowledge discovery from data streams. Chapman & Hall/CRC.

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218.

Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall, USA.

Lai, H. P., Visani, M., Boucher, A., and Ogier, J.-M. (2014). A new interactive semi-supervised clustering model for large image database indexing. Pattern Recognition Letters, 37(1):94–106.

Lelis, L. and Sander, J. (2009). Semi-supervised Density-Based Clustering. In IEEE ICDM, pages 842–847.

Liu, E. Y., Zhang, Z., and Wang, W. (2011). Clustering with relative constraints. In ACM SIGKDD, page 947, New York, NY, USA.

Oliveira, M. D. and Gama, J. (2010). Bipartite graphs for monitoring clusters transitions. In IDA, pages 114–124. Springer.

Pereira, G. and Moreira, J. (2016). Monitoring clusters in the telecom industry. In New Advances in Information Systems and Technologies, pages 631–640. Springer.

Ruiz, C., Spiliopoulou, M., and Menasalvas, E. (2007). C-DBSCAN: Density-Based Clustering with Constraints, volume 4482 of LNCS. Springer.

Silva, W. J., Barioni, M. C. N., de Amo, S., and Razente, H. L. (2015). Semi-supervised clustering using multi-assistant-prototypes to represent each cluster. In SAC, pages 831–836, New York.

Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. (2006). MONIC. In ACM SIGKDD, page 706, New York, NY, USA. ACM Press.

Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for very Large Databases. ACM SIGMOD Record, 25(2):103–114.