EDREW - Enhanced Data Representation for Extraction in Web

  • Marcelo C. Nunes UFSC
  • Carina F. Dorneles UFSC


Extracting data from Web sites is still a challenge since pages have a complex and changeable structure, and the reason is simple: Web pages are designed to be visually user-friendly to users and not for the task of extracting data. In addition, each of them has its own and varied structures based on the HTML DOM structure. Since Web page designers can have their own standards for designing the pages, web page structures are widely divergent. So, identifying and extracting information still represents a significant barrier. To overcome this challenge, we propose a new approach called EDREW, which uses the information from the HTML DOM structure and the information generated through the HTML elements to represent the context of the elements on the page without the need for rendering. We use the ELMo model to extract information and classify them as noise or useful content. The experiments were performed on the public dataset Structured Web Data Extraction (SWDE) and on a new dataset created for this work, based on the most current versions of the pages in the dataset SWDE. Using EDREW, it was possible to overcome the baselines using the original SWDE dataset and extract twice as much page content using a new version of SWDE built by us with updated pages.
Palavras-chave: semi-structured web extraction, web information extraction


