Detection and Extraction of Templates in Web Pages

  • Karane Vieira UFAM
  • Altigran Soares da Silva UFAM

Abstract


The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present two new algorithms based on tree mappings that efficiently and accurately removes templates found in collections of web pages by just inspecting a few sample pages. We show that our algorithms are effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that they also boost the accuracy of web page clustering and classification methods.

References

Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detection via data mining and its applications. In Proc. of the Int. Conf. on the World Wide Web, pages 580–591.

de Castro Reis, D., Golgher, P. B., da Silva, A. S., and Laender, A. H. F. (2004). Automatic web news extraction using tree edit distance. In Proc. of the Int. Conf. on the World Wide Web, pages 502–511.

Gibson, D., Punera, K., and Tomkins, A. (2005). The volume and evolution of web page templates. In Proc. of the Int. Conf. on the World Wide Web - Poster Session, pages 830–839.

Valiente, G. (2001). An efficient bottom-up distance between trees. In Proc. of the Int. Symposium on String Processing and Information Retrieval.

Vieira, K., Costa Carvalho, A. L., Berlt, K., Moura, E. S., Silva, A. S., and Freire, J. (2009). On finding templates on web collections. World Wide Web, 12(2):171–211.

Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J. M. B., and Freire, J. (2006). A fast and robust method for web page template detection and removal. In Proc. of the ACM Int. Conf. on Information and Knowledge Management, pages 258–267.

Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proc. of the Int. ACM Conf. on Knowledge Discovery and Data Mining, pages 296–305.
Published
2009-07-20
VIEIRA, Karane; SILVA, Altigran Soares da. Detection and Extraction of Templates in Web Pages. In: THESIS AND DISSERTATION CONTEST (CTD), 22. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 57-64. ISSN 2763-8820.