A Similarity Function for HTML Lists

Filipe Guédes Venâncio; Ronaldo dos Santos Mello

Filipe Guédes Venâncio UFSC
Ronaldo dos Santos Mello UFSC

Resumo

The Web continues to grow day after day, being the largest source of data in several domains of knowledge. Such a data are particularly available into billions of HTML pages. Due to it, Web scraping is an increasing activity with a great focus on structured data, like Web tables and Web lists. While the literature about extraction and matching of Web tables is extensive, this is not the same for Web lists. The main reason for that is the absence of a heading with fixed and explicit attributes into a Web list, which makes difficult to find out similar data for matching purposes. Related work about Web lists similarity do not deal with the comparison of the content of two different lists. In order to fill this gap, this paper proposes a similarity function for HTML-coded lists called Simlist. It innovates by comparing not only the content of the two input lists but also data on the HTML pages that surround the lists and may contribute to infer their contexts. Besides, Simlist may be customized by the user in terms of weights for the internal and surrounding data, as well as auxiliary text similarity measures. Preliminary experimental evaluations show that our similarity function had obtained good precision, recall and f-measure scores even considering the inherent heterogeneity of HTML lists on the Web.

Palavras-chave: data on the Web, similarity, HTML lists

A Similarity Function for HTML Lists

Resumo

Artigos mais lidos do(s) mesmo(s) autor(es)