Automatic Construction of Web Directories Using Incremental Term Clustering

  • Ricardo M. Marcacini USP
  • Solange O. Rezende USP

Abstract


Hierarchical clustering methods are useful to support the construction of web directories in a unsupervised way. However, the traditional methods are ineffective in dynamic scenarios, with constant updating of knowledge. Moreover, these methods obtain a hierarchical structure that is difficult to be interpreted by users. In this paper, we propose an incremental term clustering approach that allows (1) the organization of document collections in dynamic scenarios and (2) obtain cluster descriptors to support the interpretation of results. An experimental evaluation was carried out through real data from a web directory, which presented good results.

References

Bradley, P. S., Fayyad, U. M., and Reina, C. (1998). Scaling Clustering Algorithms to Large Databases. In Knowledge Discovery and Data Mining, pages 9–15.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30.

Farnstrom, F., Lewis, J., and Elkan, C. (2000). Scalability for clustering algorithms revisited. ACM SIGKDD Explorations Newsletter, 2:51–57.

Feldman, R. and Sanger, J. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Fung, B. C. M., Wang, K., and Ester, M. (2008). The Encyclopedia of Data Warehousing and Mining, chapter Hierarchical Document Clustering, pages 970–975. Idea Group.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3):264–323.

Kim, H. J. (2006). On text mining algorithms for automated maintenance of hierarchical knowledge directory. In Knowledge Science, Engineering and Management, Lecture Notes in Computer Science, pages 202–214.

Manning, C. D., Raghavan, P., and Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge University Press.

Marcacini, R. M. and Rezende, S. O. (2010a). Incremental construction of topic hierarchies using hierarchical term clustering. In SEKE’2010: Proceedings of the 22nd International Conference on Software Engineering and Knowledge Engineering, pages 553–558. KSI - Knowledge Systems Institute.

Marcacini, R. M. and Rezende, S. O. (2010b). Torch: a tool for building topic hierarchies from growing text collection. In WFA’2010: IX Workshop de Ferramentas e Aplicações - XVI Webmedia, pages 1–3.

Marchionini, G. (2006). Exploratory search: from finding to understanding. Communications of ACM, 49(4):41–46.

Metwally, A., Agrawal, D., and Abbadi, A. E. (2005). Efficient computation of frequent and top-k elements in data streams. In ICDT’05: Proceedings of 10th International Conference on Database Theory, pages 398–412.

Moura, M. F. and Rezende, S. O. (2010). A simple method for labeling hierarchical document clusters. In IAI’10: Proceedings of the 10th International Conference on Artificial Intelligence and Applications, pages 363–371, Acta Press, 2010.

Nassar, S., Sander, J., and Cheng, C. (2004). Incremental and effective data summarization for dynamic hierarchical clustering. In SIGMOD’04: Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 467–478.

Premalatha, K. and Natarajan, A. (2010). A Literature Review on Document Clustering. Information Technology Journal, 9(5):993–1002.

Xu, R. and Wunsch, D. (2008). Clustering. Wiley-IEEE Press, IEEE Press Series on Computational Intelligence.

Yang, H. C. and Lee, C. H. (2004). A text mining approach on automatic generation of web directories and hierarchies. Expert Systems with Applications, 27(4):645–663.

Zhao, Y. and Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. In CIKM ’02: Proceedings of the 11th International Conference on Information and Knowledge Management, pages 515–524.

Zhao, Y., Karypis, G., and Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168.
Published
2011-07-19
MARCACINI, Ricardo M.; REZENDE, Solange O.. Automatic Construction of Web Directories Using Incremental Term Clustering. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 8. , 2011, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2011 . p. 323-334. ISSN 2763-9061.

Most read articles by the same author(s)

1 2 > >>