Smart Crawler - Using Committee Machines for Web Pages Continuous Classification

Luiz Henrique Zambom Santana; Ronaldo dos Santos Mello; Mauro Roisenberg

Luiz Henrique Zambom Santana UFSC
Ronaldo dos Santos Mello UFSC
Mauro Roisenberg UFSC

Resumo

The speed of information publishing in WWW is unprecedented. The individuals and organizations struggle to be up to date and find relevant knowledge from a tsunami of news, videos, posts, and comments. In the other hand, these contents (mostly bound to HTML pages) are unstructured and not explicitly classified. In this context, machine-learning techniques can be very handy to automatic separate useful information from irrelevant noise. The present paper describes a novel approach for Web Pages crawling. The Smart Crawler employs two techniques for improving the information classification: massive Web page crawling and continuous classification through committee machines. These ideas are implemented using Big Data and cloud-ready technologies, whose the cornerstone is a framework that enables memory-intensive processing, high scalability, and streaming processing. The results indicates a significant classification capability and that the classification rate can scale linearly according to the size of the dataset.

Smart Crawler - Using Committee Machines for Web Pages Continuous Classification

Resumo

Artigos mais lidos do(s) mesmo(s) autor(es)