Wiki Evolution dataset: English Wikipedia revision articles represented by quality attributes

  • Ana Luiza Sanches Federal Center for Technological Education of Minas Gerais (CEFET-MG)
  • Sinval de Deus Vieira Júnior Federal Center for Technological Education of Minas Gerais (CEFET-MG)
  • Daniel Hasan Dalip Federal Center for Technological Education of Minas Gerais (CEFET-MG)
  • Bárbara Gabrielle C. O. Lopes Federal University of Minas Gerais (UFMG)

Abstract


This paper presents the creation and publishing of the Wikipedia article's evolution dataset. This dataset is a set of revisions of articles, represented by quality attributes and quality classification. This dataset can be used for studies regarding automatic quality classification that consider the article revision history as well as understanding how the content and quality of articles evolve over time in this collaborative platform.

Keywords: dataset, wikipedia, quality attributes

References

Batista, N. A., Brandão, M. A., Pinheiro, M. B., Dalip, D. H., and Moro, M. M. (2018). Dados de múltiplas fontes da web: coleta, integração e pré-processamento. In de Computação-SBC, S. B., editor, Anais do XXIV Simpósio Brasileiro de Sistemas Multimídia e Web: Minicursos, chapter 5, pages 153-192. Sociedade Brasileira de Computação – SBC.

Dalip, D. H. (2015). Uma Abordagem Multi-Visão para a Estimativa Automática da Qualidade de Conteúdo Colaborativo na Web 2.0. PhD thesis, UFMG.

Dang, Q. V. and Ignat, C.-L. (2016). Quality assessment of wikipedia articles without feature engineering. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pages 27-30, New York, NY, USA. ACM.

Hasan Dalip, D., André Gonçalves, M., Cristo, M., and Calado, P. (2009). Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pages 295-304, New York, NY, USA. ACM.

Jhandir, M. Z., Tenvir, A., On, B.-W., Lee, I., and Choi, G. S. (2017). Controversy detection in wikipedia using semantic dissimilarity. Inf. Sci., 418(C):581-600.

Pinto, A. C., Silva, B. S., Carmo, P. R. M., Lima, R. L. A., Amorim, L. S. P., Viana, R. T. C., Dalip, D. H., and Oliveira, P. A. C. (2020). Webfeatures: A web tool to extract features from collaborative content. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web, pages 103-106, Porto Alegre, RS, Brasil. SBC.

Tyagi, N., Solanki, A., and Tyagi, S. (2010). An algorithmic approach to data preprocessing in web usage mining. International Journal of Information Technology and Knowledge Management, 2.

WikiMedia (2019). Mediawiki api help. Disponível em https://wiki.f-si.org/api.php.

Wikipedia (2019a). Wikipédia: :aviso geral. Disponível em https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Aviso_geral.

Wikipedia (2019b). Wikipédia:content assessment. Disponível em https://en.wikipedia.org/wiki/Wikipedia:Content_assessment.

Wikipedia (2019c). Wikipedia:size of wikipedia. Disponível em https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.
Published
2022-09-19
SANCHES, Ana Luiza; VIEIRA JÚNIOR, Sinval de Deus; DALIP, Daniel Hasan; LOPES, Bárbara Gabrielle C. O.. Wiki Evolution dataset: English Wikipedia revision articles represented by quality attributes. In: DATASET SHOWCASE WORKSHOP (DSW), 4. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 46-56. DOI: https://doi.org/10.5753/dsw.2022.225573.