Intelligent News Filtering: A Clustering-Based Approach

  • Luíza Diapp UFPR
  • Lisiane Reips UFPR
  • Aurora T. R. Pozo UFPR
  • Carmem S. Hara UFPR

Abstract


With the growing volume of news available online, efficient tools are needed to help users quickly find relevant information. The ENoW (Web News Extractor) tool was designed to automatically collect news articles based on user-defined keywords, enabling data storage and applying an intelligent filtering system to highlight relevant content. However, the initial filtering process requires users to manually select the most relevant news from a randomly selected sample of collected articles. As a result, users often need to request multiple new samples to find relevant content, which makes the process both time-consuming and exhausting. To address this issue, this paper proposes the aplication of K-Means clustering algorithm to refine the filtering process, ensuring that the initial sample better represents the different extracted topics. The results showed a significant reduction in the number of articles users needed to browse in order to identify relevant content. This improvement was subsequently integrated into the ENoW tool, enhancing the overall user experience in news filtering.

References

Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1):45–65.

Barbosa, S. D. J., Silva, B. d., Silveira, M. S., Gasparini, I., Darin, T., and Barbosa, G. D. J. (2021). Interação humano-computador e experiência do usuario. Auto publicação.

Bouras, C. and Tsogkas, V. (2012). A clustering technique for news articles using wordnet. Knowledge-Based Systems, 36:115–128.

Chapman, A. D. (2005). Principles and methods of data cleaning. GBIF.

Chawla, N. V. and Karakoulas, G. (2005). Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research, 23:331–366.

Lan, F. (2022). Research on text similarity measurement hybrid algorithm with term semantic information and tf-idf method. Advances in Multimedia, 2022(1):7923262.

Madhulatha, T. S. (2012). An overview on clustering methods. arXiv preprint arXiv:1205.1117.

Park, K., Hong, J. S., and Kim, W. (2020). A methodology combining cosine similarity with classifier for text classification. Applied Artificial Intelligence, 34(5):396–411.

Reips, L. (2023). Enow - um extrator de notícias da web. Dissertação de mestrado, Universidade Federal do Paraná, Curitiba, Brasil. Orientadora: Carmem Satie Hara.

Reips, L. and Hara, C. (2022). Integração e rotulação automatizada de dados sobre o cnidário physalia physalis, usando a geolocalização como referência. In Anais Estendidos do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 105–111, Porto Alegre, RS, Brasil. SBC.

Reips, L., Musicante, M., Vargas-Solar, G., Pozo, A. T., and Hara, C. S. (2023). Enow-extrator de dados de notícias da web. In Anais Estendidos do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 78–83. SBC.

Xu, R. and Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645–678.

Zhou, Z., Qin, J., Xiang, X., Tan, Y., Liu, Q., and Xiong, N. N. (2020). News text topic clustering optimized method based on tf-idf algorithm on spark. Computers, Materials & Continua, 62(1).
Published
2025-04-23
DIAPP, Luíza; REIPS, Lisiane; POZO, Aurora T. R.; HARA, Carmem S.. Intelligent News Filtering: A Clustering-Based Approach. In: REGIONAL DATABASE SCHOOL (ERBD), 20. , 2025, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 80-89. ISSN 2595-413X. DOI: https://doi.org/10.5753/erbd.2025.6842.