Uma Abordagem para Seleção de Grupos Significativos em Agrupamento Hierárquico de Documentos

Ricardo M. Marcacini; Maria F. Moura; Solange O. Rezende

Ricardo M. Marcacini USP
Maria F. Moura Embrapa
Solange O. Rezende USP

Resumo

O agrupamento hierárquico de documentos geralmente fornece muitos grupos e subgrupos, dificultando a análise e interpretação dos resultados. Neste trabalho é apresentado uma abordagem para obtenção de hierarquias de documentos reduzidas, a partir das hierarquias originais, selecionando-se apenas os grupos mais significantes. A seleção é apoiada por medidas de qualidade de grupos, adaptadas para a alta dimensionalidade de dados textuais e para considerar o relacionamento hierárquico entre os grupos. Uma avaliação experimental foi realizada em 10 coleções de documentos e três diferentes algoritmos de agrupamento hierárquico; apresentando bons resultados.

Referências

Boley, D. (1998). “Principal direction divisive partitioning”. Data Mining and Knowledge Discovery, v.2, n.4, pages 325–344.

El-Hamdouchi, A. and Willett, P. (1986). “Hierarchic document classification using Ward’s clustering method”. In Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval, pages 149–156.

Everitt, B. S., Landau, S., and Leese, M. (2001). Cluster Analysis. Arnold Publishers.

Feldman, R. and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Fung, B., Wang, K., and Ester, M. (2003). “Hierarchical document clustering using frequent itemsets”. In Proceedings of the SIAM International Conference on Data, pages 59–70.

Guha, S., Rastogi, R., and Shim, K. (1998). “CURE: an efficient clustering algorithm for large databases”. ACM SIGMOD Record, v.27, n.2, pages 73–84.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002a). “Cluster validity methods: Part I”. ACM SIGMOD Record, v.31, n.2, pages 40–45.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002b). “Clustering validity checking methods: Part II”. ACM SIGMOD Record, v.31, n.3, pages 19–27.

Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. (1998). “WebACE: A web agent for document categorization and exploration”. In Proceedings of the second international conference on Autonomous agents, pages 408–415.

Huang, R., Zhang, Z., and Lam, W. (2006). “Refining Hierarchical Taxonomy Structure Via Semi-supervised Learning”. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 653–654.

Jain, A. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.

Kashyap, V., Ramakrishnan, C., Thomas, C., and Sheth, A. (2005). “Taxaminer: An experimentation framework for automated taxonomy bootstrapping”. International Journal of Web and Grid Services, v.1, n.2, pages 240–266.

Larsen, B. and Aone, C. (1999). “Fast and effective text mining using linear-time document clustering”. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16–22.

Lin, C. and Chen, M. (2005). “Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging”. IEEE Transactions on Knowledge and Data Engineering, v.17, n.2, pages 145–159.

Milligan, G. and Cooper, M. (1985). “An examination of procedures for determining the number of clusters in a data set”. Pshychometrika, v.50, n.2, pages 159–179.

Müller, A., Dorre, J., Gerstl, P., and Seiffert, R. (1999). “The TaxGen Framework: Automating the Generation of a Taxonomy for a Large Document Collection”. In Proceedings of the 32nd Annual Hawaii International Conference on System Sciences, page 2034.

Nogueira, B. M., Moura, M. F., Conrado, M. S., Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2008). “Winning Some of the Document Preprocessing Challenges in a Text Mining Process”. In IV Workshop em Algoritmos e Aplicações de Mineração de Dados, pages 10–18.

Sarle, W. S. and Kuo, A. H. (1993). “The MODECLUS procedure”. Technical Report 256, NC: SAS Institute Inc.

Sneath, P. H. and Sokal, R. R. (1973). Numerical Taxonomy. Freeman, London, UK.

Zhao, Y. and Karypis, G. (2002). “Evaluation of hierarchical clustering algorithms for document datasets”. In Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524.

Uma Abordagem para Seleção de Grupos Significativos em Agrupamento Hierárquico de Documentos

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)