Combining semi-supervision and hubness to improve clustering of high-dimensional data
Abstract
The curse of dimensionality turns the high-dimensional data analysis a challenging task for data clustering techniques. In order to deal with high-dimensional data, this paper presents a clustering approach that explores the combination of two strategies: semi-supervision and density estimation based on hubness scores. Initial experimental results show a good performance when applied on real data sets with different characteristics.
Keywords:
Semi-Supervision, Hubness, Data Clustering, High Dimension
References
Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 1 edition.
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30.
Dhillon, I. S., Guan, Y., and Kulis, B. (2004). Kernel k-means: Spectral clustering and normalized cuts. KDD ’04, pages 551–556. ACM.
Faceli, K., Lorena, A. C., Gama, J. a., and Carvalho, A. (2011). Inteligência Artificial: Uma Abordagem de Aprendizado de Máquina. LTC, 1 edition.
Samet, H. (2005). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998). Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discov., 2(2):169–194.
Silvestre, A. L. (2007). Análise de Dados e Estatística Descritiva. Escolar Editora.
Tomasev, N. and Mladenic, D. (2013). Hub co-occurrence modeling for robust high-dimensional knn classification. In ECML PKDD, pages 643–659. Springer.
Tomasev, N., Radovanovic, M., Mladenic, D., and Ivanovic, M. (2011). The role of hubness in clustering high-dimensional data. PAKDD, pages 183–195. Springer.
Tomasev, N., Radovanovic, M., Mladenic, D., and Ivanovic, M. (2014). The role of hubness in clustering high-dimensional data. IEEE TKDE, 26(3):739–751.
Zar, J. H. (2007). Biostatistical Analysis. Prentice-Hall, Inc., 5 edition.
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30.
Dhillon, I. S., Guan, Y., and Kulis, B. (2004). Kernel k-means: Spectral clustering and normalized cuts. KDD ’04, pages 551–556. ACM.
Faceli, K., Lorena, A. C., Gama, J. a., and Carvalho, A. (2011). Inteligência Artificial: Uma Abordagem de Aprendizado de Máquina. LTC, 1 edition.
Samet, H. (2005). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998). Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Discov., 2(2):169–194.
Silvestre, A. L. (2007). Análise de Dados e Estatística Descritiva. Escolar Editora.
Tomasev, N. and Mladenic, D. (2013). Hub co-occurrence modeling for robust high-dimensional knn classification. In ECML PKDD, pages 643–659. Springer.
Tomasev, N., Radovanovic, M., Mladenic, D., and Ivanovic, M. (2011). The role of hubness in clustering high-dimensional data. PAKDD, pages 183–195. Springer.
Tomasev, N., Radovanovic, M., Mladenic, D., and Ivanovic, M. (2014). The role of hubness in clustering high-dimensional data. IEEE TKDE, 26(3):739–751.
Zar, J. H. (2007). Biostatistical Analysis. Prentice-Hall, Inc., 5 edition.
Published
2016-10-04
How to Cite
DE LIMA, Mateus C.; BARIONI, Maria Camila N.; RAZENTE, Humberto L..
Combining semi-supervision and hubness to improve clustering of high-dimensional data. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 31. , 2016, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2016
.
p. 139-144.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2016.24318.
