Smart Crawler - Using Committee Machines for Web Pages Continuous Classification

  • Luiz Henrique Zambom Santana UFSC
  • Ronaldo dos Santos Mello UFSC
  • Mauro Roisenberg UFSC

Resumo


The speed of information publishing in WWW is unprecedented. The individuals and organizations struggle to be up to date and find relevant knowledge from a tsunami of news, videos, posts, and comments. In the other hand, these contents (mostly bound to HTML pages) are unstructured and not explicitly classified. In this context, machine-learning techniques can be very handy to automatic separate useful information from irrelevant noise. The present paper describes a novel approach for Web Pages crawling. The Smart Crawler employs two techniques for improving the information classification: massive Web page crawling and continuous classification through committee machines. These ideas are implemented using Big Data and cloud-ready technologies, whose the cornerstone is a framework that enables memory-intensive processing, high scalability, and streaming processing. The results indicates a significant classification capability and that the classification rate can scale linearly according to the size of the dataset.
Publicado
27/10/2015
SANTANA, Luiz Henrique Zambom; MELLO, Ronaldo dos Santos; ROISENBERG, Mauro. Smart Crawler - Using Committee Machines for Web Pages Continuous Classification. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 21. , 2015, Manaus. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 125-132.