Seleção Local de Características em Agrupamento Hierárquico de Documentos
Resumo
O uso de seleção de características é capaz de melhorar a precisão e tempo de execução dos algoritmos de agrupamento de documentos. A seleção global de características tenta identificar um único subconjunto de características que é relevante para todos os grupos. No entanto, o processo de agrupamento pode ser melhorado considerando diferentes subconjuntos de características que descrevam localmente cada grupo. Neste trabalho, é introduzido o método ZOOM-IN para realizar seleção local de características para agrupamento hierárquico divisivo de documentos. O método proposto explora a diversidade de grupos retornados por um algoritmo hierárquico, selecionando um número variável de características de acordo com o tamanho dos grupos. Experimentos foram conduzidos na base Reuters, avaliando o algoritmo bisecting K-means com ambas as abordagens global e local para seleção de características. Os resultados dos experimentos mostraram uma melhora no desempenho do agrupamento com o uso do método local proposto.Referências
Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). “Feature selection for clustering - a filter solution”. In Proceedings of the 2002 IEEE International Conference on Data Mining, pages 115–122, Washington, DC, USA. IEEE Computer Society.
Dhillon, I., Kogan, J., and Nicholas, C. (2003). “Feature selection and document clustering”. In Berry, M. W., editor, Survey of Text Mining, pages 73–100. Springer.
Dy, J. G. and Brodley, C. E. (2004). “Feature selection for unsupervised learning”. Journal of Machine Learning Research, 5:845–889.
Esuli, A., Fagni, T., and Sebastiani, F. (2008). “Boosting multi-label hierarchical text categorization”. Information Retrieval, 11(4):287–313.
Koller, D. and Sahami, M. (1997). “Hierarchically classifying documents using very few words”. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 170–178, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Lewis, D. D. (1999). “Reuters-21578 text categorization test collection distribution 1.0”. [link].
Li, Y., Dong, M., and Hua, J. (2008). “Localized feature selection for clustering”. Pattern Recognition Letters, 29(1):10–18.
Oleander Solutions. “Oleander Stemming Library”. [link].
Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman, R. (2006). “Incremental hierarchical clustering of text documents”. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 357–366, New York, NY, USA. ACM.
Salton, G., Wong, A., and Yang, C. S. (1975). “A vector space model for automatic indexing”. Communications of the ACM, 18(11):613–620.
Slonim, N., Friedman, N., and Tishby, N. (2002). “Unsupervised document classification using sequential information maximization”. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 129–136, New York, NY, USA. ACM.
Steinbach, M., Karypis, G., and Kumar, V. (2000). “A comparison of document clustering techniques”. Technical report, Department of Computer Science and Engineering, University of Minnesota.
Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. (2005). “Comparing and combining dimension reduction techniques for efficient text clustering”. In International Workshop on Feature Selection for Data Mining.
Zhao, Y. and Karypis, G. (2002). “Evaluation of hierarchical clustering algorithms for document datasets”. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524, New York, NY, USA. ACM.
Dhillon, I., Kogan, J., and Nicholas, C. (2003). “Feature selection and document clustering”. In Berry, M. W., editor, Survey of Text Mining, pages 73–100. Springer.
Dy, J. G. and Brodley, C. E. (2004). “Feature selection for unsupervised learning”. Journal of Machine Learning Research, 5:845–889.
Esuli, A., Fagni, T., and Sebastiani, F. (2008). “Boosting multi-label hierarchical text categorization”. Information Retrieval, 11(4):287–313.
Koller, D. and Sahami, M. (1997). “Hierarchically classifying documents using very few words”. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 170–178, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Lewis, D. D. (1999). “Reuters-21578 text categorization test collection distribution 1.0”. [link].
Li, Y., Dong, M., and Hua, J. (2008). “Localized feature selection for clustering”. Pattern Recognition Letters, 29(1):10–18.
Oleander Solutions. “Oleander Stemming Library”. [link].
Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman, R. (2006). “Incremental hierarchical clustering of text documents”. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 357–366, New York, NY, USA. ACM.
Salton, G., Wong, A., and Yang, C. S. (1975). “A vector space model for automatic indexing”. Communications of the ACM, 18(11):613–620.
Slonim, N., Friedman, N., and Tishby, N. (2002). “Unsupervised document classification using sequential information maximization”. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 129–136, New York, NY, USA. ACM.
Steinbach, M., Karypis, G., and Kumar, V. (2000). “A comparison of document clustering techniques”. Technical report, Department of Computer Science and Engineering, University of Minnesota.
Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. (2005). “Comparing and combining dimension reduction techniques for efficient text clustering”. In International Workshop on Feature Selection for Data Mining.
Zhao, Y. and Karypis, G. (2002). “Evaluation of hierarchical clustering algorithms for document datasets”. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 515–524, New York, NY, USA. ACM.
Publicado
20/07/2009
Como Citar
RIBEIRO, Marcelo N.; R. NETO, Manoel J.; PRUDÊNCIO, Ricardo B. C..
Seleção Local de Características em Agrupamento Hierárquico de Documentos. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 7. , 2009, Bento Gonçalves/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2009
.
p. 292-301.
ISSN 2763-9061.
