Local Feature Selection in Hierarchical Document Clustering

  • Marcelo N. Ribeiro UFPE
  • Manoel J. R. Neto UFAL
  • Ricardo B. C. Prudêncio UFPE

Abstract


Feature selection has improved the performance of text clustering. Global feature selection tries to identify a single subset of features which are relevant to all clusters. However, the clustering process might be improved by considering different subsets of features for locally describing each cluster. In this work, we introduce the method ZOOM-IN to perform local feature selection for partitional hierarchical clustering of text collections. The proposed method explores the diversity of clusters generated by the hierarchical algorithm, selecting a variable number of features according to the size of the clusters. Experiments were conducted on Reuters collection, by evaluating the bisecting K-means algorithm with both global and local approaches to feature selection. The results of the experiments showed an improvement in clustering performance with the use of the proposed local method.

References

Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). “Feature selection for clustering - a filter solution”. In Proceedings of the 2002 IEEE International Conference on Data Mining, pages 115–122, Washington, DC, USA. IEEE Computer Society.

Dhillon, I., Kogan, J., and Nicholas, C. (2003). “Feature selection and document clustering”. In Berry, M. W., editor, Survey of Text Mining, pages 73–100. Springer.

Dy, J. G. and Brodley, C. E. (2004). “Feature selection for unsupervised learning”. Journal of Machine Learning Research, 5:845–889.

Esuli, A., Fagni, T., and Sebastiani, F. (2008). “Boosting multi-label hierarchical text categorization”. Information Retrieval, 11(4):287–313.

Koller, D. and Sahami, M. (1997). “Hierarchically classifying documents using very few words”. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 170–178, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Lewis, D. D. (1999). “Reuters-21578 text categorization test collection distribution 1.0”. [link].

Li, Y., Dong, M., and Hua, J. (2008). “Localized feature selection for clustering”. Pattern Recognition Letters, 29(1):10–18.

Oleander Solutions. “Oleander Stemming Library”. [link].

Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman, R. (2006). “Incremental hierarchical clustering of text documents”. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 357–366, New York, NY, USA. ACM.

Salton, G., Wong, A., and Yang, C. S. (1975). “A vector space model for automatic indexing”. Communications of the ACM, 18(11):613–620.

Slonim, N., Friedman, N., and Tishby, N. (2002). “Unsupervised document classification using sequential information maximization”. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 129–136, New York, NY, USA. ACM.

Steinbach, M., Karypis, G., and Kumar, V. (2000). “A comparison of document clustering techniques”. Technical report, Department of Computer Science and Engineering, University of Minnesota.

Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. (2005). “Comparing and combining dimension reduction techniques for efficient text clustering”. In International Workshop on Feature Selection for Data Mining.

Zhao, Y. and Karypis, G. (2002). “Evaluation of hierarchical clustering algorithms for document datasets”. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524, New York, NY, USA. ACM.
Published
2009-07-20
RIBEIRO, Marcelo N.; R. NETO, Manoel J.; PRUDÊNCIO, Ricardo B. C.. Local Feature Selection in Hierarchical Document Clustering. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 7. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 292-301. ISSN 2763-9061.

Most read articles by the same author(s)