Smart Crawler - Using Committee Machines for Web Pages Continuous Classification
Resumo
The speed of information publishing in WWW is unprecedented. The individuals and organizations struggle to be up to date and find relevant knowledge from a tsunami of news, videos, posts, and comments. In the other hand, these contents (mostly bound to HTML pages) are unstructured and not explicitly classified. In this context, machine-learning techniques can be very handy to automatic separate useful information from irrelevant noise. The present paper describes a novel approach for Web Pages crawling. The Smart Crawler employs two techniques for improving the information classification: massive Web page crawling and continuous classification through committee machines. These ideas are implemented using Big Data and cloud-ready technologies, whose the cornerstone is a framework that enables memory-intensive processing, high scalability, and streaming processing. The results indicates a significant classification capability and that the classification rate can scale linearly according to the size of the dataset.
Publicado
27/10/2015
Como Citar
SANTANA, Luiz Henrique Zambom; MELLO, Ronaldo dos Santos; ROISENBERG, Mauro.
Smart Crawler - Using Committee Machines for Web Pages Continuous Classification. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 21. , 2015, Manaus.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2015
.
p. 125-132.