Selecting keywords to represent web pages using Wikipedia information

  • Maisa Vidal UFAM
  • Guilherme V. Menezes UFMG
  • Klessius Berlt UFAM
  • Edleno S. de Moura UFAM
  • Karla Okada INDT
  • Nivio Ziviani UFMG
  • David Fernandes UFAM
  • Marco Cristo UFAM


In this paper we present three new methods to extract keywords from web pages using Wikipedia as an external source of information. The information used from Wikipedia includes the titles of articles, co-occurrence of keywords and categories associated with each Wikipedia definition. We compare our methods with three keyword extraction methods used as baselines: (i) all the terms of a web page, (ii) a TF-IDF implementation that extracts single weighted words of a web page and (iii) a previously proposed Wikipediabased keyword extraction method presented in the literature. We compare our three keyword extraction methods with the baseline methods in three distinct scenarios, all related to our target application, which is the selection of ads in a context-based advertising system. In the first scenario, the target pages to place ads were extracted from Wikipedia articles, whereas the target pages in the other two scenarios were extracted from a news web site. Experimental results show that our methods are quite competitive solutions for the task of selecting good keywords to represent target web pages, albeit being simple, effective and time efficient. For instance, in the first scenario our best method used to extract keywords from Wikipedia articles achieved an improvement of 33% when compared to the second best baseline, and a gain of 26% when considering all the terms.
VIDAL, Maisa; MENEZES, Guilherme V.; BERLT, Klessius; MOURA, Edleno S. de; OKADA, Karla; ZIVIANI, Nivio; FERNANDES, David; CRISTO, Marco. Selecting keywords to represent web pages using Wikipedia information. In: SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 18. , 2012, São Paulo. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 375-382.

