Detection and Extraction of Templates in Web Pages
Abstract
The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present two new algorithms based on tree mappings that efficiently and accurately removes templates found in collections of web pages by just inspecting a few sample pages. We show that our algorithms are effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that they also boost the accuracy of web page clustering and classification methods.
References
de Castro Reis, D., Golgher, P. B., da Silva, A. S., and Laender, A. H. F. (2004). Automatic web news extraction using tree edit distance. In Proc. of the Int. Conf. on the World Wide Web, pages 502–511.
Gibson, D., Punera, K., and Tomkins, A. (2005). The volume and evolution of web page templates. In Proc. of the Int. Conf. on the World Wide Web - Poster Session, pages 830–839.
Valiente, G. (2001). An efficient bottom-up distance between trees. In Proc. of the Int. Symposium on String Processing and Information Retrieval.
Vieira, K., Costa Carvalho, A. L., Berlt, K., Moura, E. S., Silva, A. S., and Freire, J. (2009). On finding templates on web collections. World Wide Web, 12(2):171–211.
Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J. M. B., and Freire, J. (2006). A fast and robust method for web page template detection and removal. In Proc. of the ACM Int. Conf. on Information and Knowledge Management, pages 258–267.
Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proc. of the Int. ACM Conf. on Knowledge Discovery and Data Mining, pages 296–305.
