Escalabilidade e Eficiência em Mineração de Dados de Aplicações Internet

  • Wagner Meira Jr. UFMG
  • Renato Ferreira UFMG
  • Dorgival Guedes UFMG

Resumo


A Internet foi muito além de um artefato tecnológico, passando a ser um instrumento crescente de interação social. Essas interações são usualmente complexas e difíceis de analisar automaticamente, demandando o desenvolvimento de novas técnicas de mineração de dados que se adaptem às peculiaridades dos cenários de aplicação. Por sua vez, essas novas técnicas, à semelhança de outras técnicas de mineração de dados, são intensivas em termos de computação e de entrada e saída, motivando a pesquisa e o desenvolvimento de novos paradigmas, ambientes de programação e algoritmos paralelos que executem essas tarefas com escalabilidade e eficiência. Os resultados descritos no artigo apontam não apenas a pertinência do desenvolvimento dessas novas técnicas, como também a sua paralelização. Mais ainda, permitem identificar três grupos de desafios de pesquisa em Ciência da Computação, assim como as suas demandas mais relevantes.

Referências

R. Araujo, G. Ferreira, G. Orair, W. Meira Jr., R. Ferreira, D. Guedes, and M. Zaki. The partricluster algorithm for gene expression analysis. International Journal of Parallel Programming, 2007.

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.

I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In Proc. of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 11–18, 2004.

Harold Borko and Myrna Bernick. Automatic document classification. J. ACM, 10(2):151–162, 1963.

B. Coutinho. Desempenho e disponibilidade em sistemas distribuídos em larga escala. Master’s thesis, DCC – UFMG, Belo Horizonte, Minas Gerais, Brazil, 2005.

N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proc. of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108, 2004.

Jochen Darre, Peter Gerstl, and Roland Seiffert. Text mining: finding nuggets in mountains of textual data. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 398–401, New York, NY, USA, 1999. ACM Press.

Ismail Fahmi. Examining learning algorithms for text classification in digital libraries. In Master’s thesis, Groninge, Netherland, 2004.

Tom Fawcett. ’in vivo’ spam filtering: A challenge problem for data mining. KDD Explorations, 5(2), December 2003. Disponível em: [link].

G. Ferreira, R. Araújo, G. Orair, L. Gonçalves, D. Guedes, R. Ferreira, W. Meira Jr., and V. Furtado. Paralelização eficiente de um algoritmo de agrupamento hierárquico. In Anais do Workshop sobre Algoritmos de Mineração de Dados, Uberlândia, MG, 2005.

R. Ferreira, W. Meira Jr., D. Guedes, L. Drummond, B. Coutinho, G. Teodoro, T. Tavares, R. Araújo, and G. Ferreira. Anthill: A scalable run-time environment for data mining applications. In Proc. of the 17th International Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, RJ, 2005.

Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees. In Howard Shrobe and Ted Senator, editors, Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, pages 717–724, Menlo Park, California, 1996. AAAI Press.

D. Guedes, W. Meira Jr., and R. Ferreira. Anteater: A service oriented architecture for high-performance data mining. IEEE Internet Computing, 10(4):36–43, 2006.

T. Joachims. Making large-scale svm learning practical. In Scholkopf B., Burges C.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods-Support Vector Learning, pages 169–184, Cambridge, MA, 1999. MIT Press.

Matthew B. Koll. Information retrieval theory and design based on a model of the user’s concept relations. In SIGIR ’80: Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, pages 77–93, Kent, UK, UK, 1981. Butterworth & Co.

R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, 2, 2000.

J. Kubica, A. Moore, and J. Schneider. Tractable group detection on large link data sets. In Proc. of the Third IEEE International Conference on Data Mining, 2003.

Wai Lam and Chao Yang Ho. Using a generalized instance set for automatic text categorization. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 81–89, New York, NY, USA, 1998. ACM Press.

Q. Lu and L. Getoor. Link-based classification. In Proc. of the 2003 International Conference on Machine Learning, pages 496–503, 2003.

A McCallum and K Nigam. A comparison of event models for naive bayes text classification. In Workshop on Learning for Text Categorization. AAAI, 1998.

A. Pereira, F. Mourao, P. Goes, and W. Meira Jr. Reactivity in online auctions. In Proc. of Workshop Reactivity on the Web at the International Conference on Extending Database Technology (EDBT’2006), volume 3896, Munich, 2006. Springer – Lecture Notes in Computer Science.

A. Pereira, L. Silva, and W. Meira Jr. Evaluating the impact of reactive workloads on the performance of web applications. In Proc. of the IEEE International Performance Computing and Communications Conference (IPCCC), 2006, volume 1, pages 425–432, Phoenix, 2006.

A. Pereira, L. Silva, W. Meira Jr., and W. Santos. Assessing the impact of reactive workloads on the performance of web applications. In Proc. of the IEEE International 2068 Symposium on Performance Analysis of Systems and Software (ISPASS), 2006, volume 1, pages 425–432, Austin, TX, 2006.

B. Possas, N. Ziviani, B. Ribeiro-Neto, and W. Meira Jr. The set-based model for information retrieval. ACM Transactions on Information Systems, 23(4):397–429, 2005.

J. P. Scott. Social Network Analysis: A handbook. Sage publications, 2005.

G. Teodoro, T. Tavares, R. Ferreira, T. Kurc, W. Meira Jr., D. Guedes, and J. Saltz. Run-time support for efficient execution of scientific workflows on distributed environments. In Proc. of the XVIII International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2006, Ouro Preto, MG, 2006. IEEE.

Loren Terveen, Will Hill, and Brian Amento. Constructing, organizing, and visualizing collections of topically related web resources. ACM Trans. Comput.-Hum. Interact., 6(1):67–94, 1999.

A. Veloso, M. Cristo, W. Meira Jr., M. Goncalves, and M. Zaki. Multi-evidence, multi-criteria, lazy associative classification. In Proc. of the ACM Fifteenth Conference on Information and Knowledge Management - CIKM 2006, Arlington, VA, EUA, 2006.

A. Veloso and W. Meira Jr. Eager, lazy and hybrid algorithms for multi-criteria associative classification. In Anais do Workshop sobre Algoritmos de Mineração de Dados (WAMD) 2005, pages 17–25, Uberlândia,MG, 2005.

A. Veloso and W. Meira Jr. Rule generation and rule selection techniques for cost-sensitive associative classification. In Anais do Simpósio Brasileiro de Bancos de Dados, 2005, pages 295–309, Uberlândia,MG, 2005.

A. Veloso and W. Meira Jr. Lazy associative classification for content-based spam detection. In Proc. of the Fourth Latin American Web Congress - LA-WEB 2006, Cholula, Mexico, 2006.

A. Veloso, W. Meira Jr., R. Ferreira, D. Guedes, and S. Parthasarathy. Asynchronous and anticipatory filter-stream based parallel algorithm for frequent itemset mining. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 422–433, Pisa, Italy, 2004.

A. Veloso, W. Meira Jr., T. Macambira, D. Guedes, and H. Almeida. Automatic moderation of comments in a large on-line journalistic environment. In Proc. of the International Conference on Weblogs and Social Media, Boulder, CO, USA, March 2007.

A. Veloso, W. Meira Jr., and M. Zaki. Lazy associative classification. In Proc. of the International Conference on Data Mining - ICDM 2006, pages 645–654, 2006.

Adriano Veloso, Wagner Meira Jr., Marco Cristo, Marcos Gonçalves, and Mohammed J. Zaki. Multievidence, multicriteria, lazy associative document classification. In Conference on Information and Knowledge Management (CIKM). ACM, 2006.

T. Washio and H. Motoda. State of the art of graph-based data mining. ACM SIGKDD Explorations Newsletter, 5(1):59–68, July 2003.
Publicado
30/06/2007
MEIRA JR., Wagner; FERREIRA, Renato; GUEDES, Dorgival. Escalabilidade e Eficiência em Mineração de Dados de Aplicações Internet. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 34. , 2007, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2007 . p. 2055-2069. ISSN 2595-6205.