On the evaluation of graph construction methods for semi-supervised transductive classification

  • Leonardo Macedo Freire Universidade Federal de Ouro Preto (UFOP)
  • Jefferson Tales Oliva Universidade Tecnológica Federal do Paraná (UTFPR) http://orcid.org/0000-0003-1574-1293
  • Valéria de Carvalho Santos Universidade Federal de Ouro Preto (UFOP)
  • Vander Luis de Souza Freitas Universidade Federal de Ouro Preto (UFOP)
  • Jadson Castro Gertrudes Universidade Federal de Ouro Preto (UFOP) https://orcid.org/0000-0002-0861-6681

Resumo


Semi-supervised learning addresses critical challenges in machine learning when labeled data is scarce but unlabeled data is abundant. However, ensuring effective label propagation in transductive semi-supervised classification algorithms is challenging due to the significant influence of the underlying graph's topological structure. This article systematically investigates this problem by evaluating various graph construction methods alongside traditional approaches, including the novel application of the HDBSCAN*-derived Mutual Reachability Minimum Spanning Tree (MST_R) and the Disparity Filter (DF). We employed the Local and Global Consistency (LGC) algorithm for label propagation and assessed performance using macro-averaged F-measure and statistical tests (Friedman and Nemenyi). Our experiments show that the DF demonstrates compelling performance, showing no statistical difference from top-ranked methods like SYM_FKNN. On the other hand, MST_R exhibited limitations that could be verified in future studies. This comprehensive analysis, further supported by Spearman's rank correlation with graph and data properties, provides important insights into optimal graph selection for robust semi-supervised classification workflows.
Palavras-chave: density-based clustering, disparity filter, graph construction, semi-supervised classification, transductive classification

Referências

Berton, L., de Andrade Lopes, A., and Vega-Oliveros, D. A. A comparison of graph construction methods for semi-supervised learning. In 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018. IEEE, pp. 1–8, 2018.

Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM TKDD 10 (1): 1–51, 2015.

Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006.

de Sousa, C. A. R., Rezende, S. O., and Batista, G. E. A. P. A. Influence of graph construction on semi-supervised learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III, H. Blockeel, K. Kersting, S. Nijssen, and F. Zelezný (Eds.). Lecture Notes in Computer Science, vol. 8190. Springer, pp. 160–175, 2013.

de Souto, M. C. P., Costa, I. G., de Araujo, D. S. A., Ludermir, T. B., and Schliep, A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics vol. 9, pp. 1–14, 2008.

Demšar, J. Statistical comparisons of classifiers over multiple data sets. JMLR vol. 7, pp. 1–30, 2006.

Fontaine, F., Pastor, M., Zamora, I., and Sanz, F. Anchor-grind: Filling the gap between standard 3d qsar and the grid-independent descriptors. Journal of Medicinal Chemistry 48 (7): 2687-2694, 2005. PMID: 15801859.

Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32 (200): 675–701, 1937.

Gaulton, A., Hersey, A., Nowotka, M., Bento, A. P., Chambers, J., Mendez, D., Mutowo-Meullenet, P.,Atkinson, F., Bellis, L. J., Cibrián-Uhalte, E., Davies, M., Dedman, N., Karlsson, A., Magariños, M. P., Overington, J. P., Papadatos, G., Smit, I., and Leach, A. R. The ChEMBL database in 2017. Nucleic Acids Res. 45 (Database-Issue): D945–D954, 2017.

Gertrudes, J. C., Zimek, A., Sander, J., and Campello, R. J. G. B. A unified framework of density-based clustering for semi-supervised classification. In Proceedings of the 30th International Conference on Scientific and Statistical Database Management, SSDBM 2018, Bozen-Bolzano, Italy, July 09-11, 2018, D.Sacharidis, J. Gamper, and M. H. Böhlen (Eds.). ACM, pp. 11:1–11:12, 2018.

Lichman, M. UCI machine learning repository, 2013.

Liu, W. and Chang, S. Robust multi-class transductive learning with graphs. In Proc. IEEE CVPR. pp. 381–388, 2009.

Menczer, F., Fortunato, S., and Davis, C. A. A first course in network science. Cambridge University Press, 2020.

Moulavi, D., Jaskowiak, P. A., Campello, R. J. G. B., Zimek, A., and Sander, J. Density-based clustering validation. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014, M. J. Zaki, Z. Obradovic, P. Tan, A. Banerjee, C. Kamath, and S. Parthasarathy (Eds.). SIAM, pp. 839–847, 2014.

Naldi, M. C., Campello, R. J. G. B., Hruschka, E. R., and Carvalho, A. C. P. L. F. Efficiency issues of evolutionary k-means. Appl.iedSoft Computing 11 (2): 1938–1952, 2011.

Nemenyi, P. B. Distribution-free multiple comparisons. Ph.D. thesis, Princeton University, 1963.

Ozaki, K., Shimbo, M., Komachi, M., and Matsumoto, Y. Using the mutual k-nearest neighbor graphs for semi-supervised classification on natural language data. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, Portland, Oregon, USA, June 23-24, 2011, S. Goldwater and C. D. Manning (Eds.). ACL, pp. 154–162, 2011.

Serrano, M. Á., Boguñá, M., and Vespignani, A. Extracting the multiscale backbone of complex weighted networks. Proceedings of the national academy of sciences 106 (16): 6483–6488, 2009.

Sokolova, M. and Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf Process Manag 45 (4): 427–437, 2009.

Sutherland, J. J., O’Brien, L. A., and Weaver, D. F. A comparison of methods for modeling quantitative structure-activity relationships. J. Med.Chem. 47 (22): 5541–5554, 2004.

Vanschoren, J., Van Rijn, J. N., Bischl, B., and Torgo, L. Openml: networked science in machine learning. ACM SIGKDD Expl. Newsletter 15 (2): 49–60, 2014.

Vega-Oliveros, D. A., Berton, L., Eberle, A. M., de Andrade Lopes, A., and Zhao, L. Regular graph construction for semi-supervised learning. In Journal of physics: Conference series. Vol. 490. IOP Publishing, pp. 012022, 2014.

Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. Clustering gene-expression data with repeated measurements. Genome Biol 4 (5): R34, 2003.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003,December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada, S. Thrun, L. K. Saul, and B. Schölkopf (Eds.). MIT Press, pp. 321–328, 2003.

Zhu, X. Semi-supervised learning literature survey — tr1530. Tech. rep., University of Wisconsin, Madison, 2005.
Publicado
29/09/2025
FREIRE, Leonardo Macedo; OLIVA, Jefferson Tales; SANTOS, Valéria de Carvalho; FREITAS, Vander Luis de Souza; CASTRO GERTRUDES, Jadson. On the evaluation of graph construction methods for semi-supervised transductive classification. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 13. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 33-40. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2025.247713.