Generating Features from Textual Documents Through Association Rules

  • Rafael Geraldeli Rossi USP
  • Solange Oliveira Rezende USP

Resumo


The Text Mining techniques are used to organize, manage and extract knowledge from the huge amount of textual data available in digital format. In order to use these techniques, the textual documents need to be represented in an appropriate format. The common way to represent text collections is by using the bag-of-words approach, in which each document is represented by a vector. Each word in the document collection represents a dimension of the vector. This approach has well known problems as the high dimensionality, and sparsity of data. Besides, most of the concepts are described by a set of words, such as “text mining”, “association rules”, and “machine learning”. The approaches, which generate features compounded by a set of words to solve this problem, suffer from other problems, such as the generation of features without meaning, and the need to analyze the high dimensionality of the bag-of-words in order to generate the features. An approach named bag-of-related-words is proposed to generate features compounded by a set of related words that avoids the problems as mentioned above. The features are generated from each textual document of a collection through association rules. Experiments were carried out using classification algorithms with different paradigms in order to evaluate the generated features. The obtained results demonstrated that the proposed approach is similar to the bag-of-words with much lower dimensionality and features which are easy to understand.

Referências

Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB’94: International Conference on Very Large Data Bases, pages 487–499. Morgan Kaufmann Publishers Inc.

Ahonen-Myka, H., Heinonen, O., Klemettinen, M., and Verkamo, A. I. (1999). Finding co-occurring text phrases by combining sequence and frequent set discovery. In IJCAI-99: Workshop on Text Mining: Foundations, Techniques and Applications, pages 1–9.

Bekkerman, R. and Allan, J. (2004). Using bigrams in text categorization. Technical Report IR-408, Center of Intelligent Information Retrieval, UMass Amherst.

Blanchard, J., Guillet, F., Gras, R., and Briand, H. (2005). Using information-theoretic measures to assess association rule interestingness. In ICDM’05: Internation Conference on Data Mining, pages 66–73.

Carvalho, A. L. C., Moura, E. S., and Calado, P. (2010). Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583–597.

Carvalho, V. R. and Cohen, W. W. (2006). Improving “email speech acts” analysis via n-gram selection. In ACTS ’09: Workshop on Analyzing Conversations in Text and Speech, pages 35–41. Association for Computational Linguistics.

Fagan, J. (1989). The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science, 40(2):115–132.

Fürnkranz, J. (1998). A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence.

Geng, L. and Hamilton, H. J. (2006). Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9.

Guillet, F. and Hamilton, H. J., editors (2007). Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer.

Liu, S. and Hu, H. (2007). Text classification using sentential frequent itemsets. Journal of Computer Science and Technology, 22(2):334–337.

McNicholas, P. D., Murphy, T. B., and O’Regan, M. (2008). Standardising the lift of an association rule. Computational Statistics & Data Analysis, 52(10):4712–4721.

Mladenic, D. and Grobelnik, M. (1998). Word sequences as features in text-learning. In ERK’98: Electrotechnical and Computer Science Conference, pages 145–148.

Porter, M. F. (1980). An algorithm for suffix stripping. Readings in Information Retrieval, 14(3):130–137.

Rossi, R. G. and Rezende, S. O. (2010). The use of frequent itemsets extracted from textual documents for the classification task. In WTI 2010: International workshop on Web and Text Intelligence located on International Joint Conference (SBIA, SBRN, JRI), pages 1–10.

Salton, G. (1989). Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc.

Soares, M. V. B., Prati, R. C., and Monard, M. C. (2008). PRETEXT II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Technical Report 333, ICMC-USP.

Tan, C.-M., Wang, Y.-F., and Lee, C.-D. (2002a). The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529–546.

Tan, P.-N., Kumar, V., and Srivastava, J. (2002b). Selecting the right interestingness measure for association patterns. In ACM SIGKDD’2002: International Conferenceon Knowledge Discovery and Data Mining, pages 32–41. ACM.

Tesar, R., Strnad, V., Jezek, K., and Poesio, M. (2006). Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In DocEng’06: ACM Symposium on Document Engineering, pages 138–146.

Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2 edition.

Yang, Z., Zhang, L., Yan, J., and Li, Z. (2003). Using association features to enhance the performance of naı̈ve bayes text classifier. In ICCIMA ’03: International Conference on Computational Intelligence and Multimedia Applications, page 336. IEEE Computer Society.

Zhang, X. and Zhu, X. (2007). A new type of feature - loose n-gram feature in text categorization. In IbPRIA’07: Iberian Conference on Pattern Recognition and Image Analysis, pages 378–385. Springer.
Publicado
19/07/2011
ROSSI, Rafael Geraldeli; REZENDE, Solange Oliveira. Generating Features from Textual Documents Through Association Rules. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 8. , 2011, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2011 . p. 311-322. ISSN 2763-9061.

Artigos mais lidos do(s) mesmo(s) autor(es)