A Heterogeneous Network-Based Positive and Unlabeled Learning Approach to Detect Fake News

Mariana C. de Souza; Bruno M. Nogueira; Rafael G. Rossi; Ricardo M. Marcacini; Solange O. Rezende

Mariana C. de Souza USP https://orcid.org/0000-0002-1746-8414
Bruno M. Nogueira UFMS https://orcid.org/0000-0002-2401-2423
Rafael G. Rossi UFMS https://orcid.org/0000-0001-8513-3213
Ricardo M. Marcacini USP https://orcid.org/0000-0002-2309-3487
Solange O. Rezende USP https://orcid.org/0000-0002-5233-7639

Resumo

The dynamism of fake news evolution and dissemination plays a crucial role in influencing and confirming personal beliefs. To minimize the spread of disinformation approaches proposed in the literature, automatic fake news detection generally learns models through binary supervised algorithms considering textual and contextual information. However, labeling significant amounts of real news to build accurate classifiers is difficult and time-consuming due to their broad spectrum. Positive and unlabeled learning (PUL) can be a good alternative in this scenario. PUL algorithms learn models considering little labeled data of the interest class and use unlabeled data to increase classification performance. This paper proposes a heterogeneous network variant of the PU-LP algorithm, a PUL algorithm based on similarity networks. Our network incorporates different linguistic features to characterize fake news, such as representative terms, emotiveness, pausality, and average sentence size. Also, we considered two representations of the news to compute similarity: term frequency-inverse document frequency, and Doc2Vec, which creates fixed-sized document representations regardless of its length. We evaluated our approach in six datasets written in Portuguese or English, comparing its performance with a binary semi-supervised baseline algorithm, using two well-established label propagation algorithms: LPHN and GNetMine. The results indicate that PU-LP with heterogeneous networks can be competitive to binary semi-supervised learning. Also, linguistic features such as representative terms and pausality improved the classification performance, especially when there is a small amount of labeled news.

Palavras-chave: Fake news, One-class learning, Positive and unlabeled learning, Transdutive semi-supervised learning, Graph-based learning