An Approach for Selecting Meaningful Groups in Hierarchical Document Clustering

  • Ricardo M. Marcacini USP
  • Maria F. Moura Embrapa
  • Solange O. Rezende USP

Abstract


Hierarchical document clustering usually generates many cluster and subclusters, making the analysis and interpretation of the results difficult. In this paper an approach to obtain a reduced hierarchy of documents from the original hierarchies is presented, selecting only significant clusters. The selection is supported by quality measures of cluster, adapted to the high dimensionality of textual data and by considering the hierarchical relation among the clusters. An experimental evaluation was carried out through 10 textual collections and three different hierarchical clustering algorithms; which presented good results.

References

Boley, D. (1998). “Principal direction divisive partitioning”. Data Mining and Knowledge Discovery, v.2, n.4, pages 325–344.

El-Hamdouchi, A. and Willett, P. (1986). “Hierarchic document classification using Ward’s clustering method”. In Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval, pages 149–156.

Everitt, B. S., Landau, S., and Leese, M. (2001). Cluster Analysis. Arnold Publishers.

Feldman, R. and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Fung, B., Wang, K., and Ester, M. (2003). “Hierarchical document clustering using frequent itemsets”. In Proceedings of the SIAM International Conference on Data, pages 59–70.

Guha, S., Rastogi, R., and Shim, K. (1998). “CURE: an efficient clustering algorithm for large databases”. ACM SIGMOD Record, v.27, n.2, pages 73–84.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002a). “Cluster validity methods: Part I”. ACM SIGMOD Record, v.31, n.2, pages 40–45.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002b). “Clustering validity checking methods: Part II”. ACM SIGMOD Record, v.31, n.3, pages 19–27.

Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. (1998). “WebACE: A web agent for document categorization and exploration”. In Proceedings of the second international conference on Autonomous agents, pages 408–415.

Huang, R., Zhang, Z., and Lam, W. (2006). “Refining Hierarchical Taxonomy Structure Via Semi-supervised Learning”. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 653–654.

Jain, A. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.

Kashyap, V., Ramakrishnan, C., Thomas, C., and Sheth, A. (2005). “Taxaminer: An experimentation framework for automated taxonomy bootstrapping”. International Journal of Web and Grid Services, v.1, n.2, pages 240–266.

Larsen, B. and Aone, C. (1999). “Fast and effective text mining using linear-time document clustering”. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–22.

Lin, C. and Chen, M. (2005). “Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging”. IEEE Transactions on Knowledge and Data Engineering, v.17, n.2, pages 145–159.

Milligan, G. and Cooper, M. (1985). “An examination of procedures for determining the number of clusters in a data set”. Pshychometrika, v.50, n.2, pages 159–179.

Müller, A., Dorre, J., Gerstl, P., and Seiffert, R. (1999). “The TaxGen Framework: Automating the Generation of a Taxonomy for a Large Document Collection”. In Proceedings of the 32nd Annual Hawaii International Conference on System Sciences, page 2034.

Nogueira, B. M., Moura, M. F., Conrado, M. S., Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2008). “Winning Some of the Document Preprocessing Challenges in a Text Mining Process”. In IV Workshop em Algoritmos e Aplicações de Mineração de Dados, pages 10–18.

Sarle, W. S. and Kuo, A. H. (1993). “The MODECLUS procedure”. Technical Report 256, NC: SAS Institute Inc.

Sneath, P. H. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, London, UK.

Zhao, Y. and Karypis, G. (2002). “Evaluation of hierarchical clustering algorithms for document datasets”. In Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524.
Published
2009-07-20
MARCACINI, Ricardo M.; MOURA, Maria F.; REZENDE, Solange O.. An Approach for Selecting Meaningful Groups in Hierarchical Document Clustering. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 7. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 302-311. ISSN 2763-9061.