Positive-Unlabeled Learning for Addressing Hidden Positives in Survey-Based Health Screening Information Systems

  • Rafael F. Pinheiro USP
  • Nataly L. Patti da Silva USP

Resumo


Survey-based health datasets embed label bias due to underdiagnosis and underreporting, undermining their use for predictive models in screening information systems. This paper explores Positive–Unlabeled (PU) Learning as a data-quality correction mechanism for self-reported health data. Using BRFSS-2015 and diabetes-related conditions, this paper shows how PU Learning can redistribute hidden positives within the unlabeled majority, improving detection of at-risk individuals—especially pre-diabetes—while shifting predictive signals toward healthcare access and response-quality factors. The results suggest that PU Learning can improve survey-based screening systems under incomplete labeling, challenging the no-label/no-condition assumption.

Referências

Alqahtani, S. A. M., Alobaid, H. M., Alshammari, J., Alqarzae, S. A., Aloyouni, S. Y., Al-Eidan, A. A., Alhamad, S., Almiman, A., Alkhulaifi, F. M., and Alomar, S. (2024). Feature importance and model performance for prediabetes prediction: A comparative study. Journal of King Saud University - Science, 36:103583.

Bekker, J. and Davis, J. (2020). Learning from positive and unlabeled data: a survey. Machine Learning, 109(4):719–760.

Elkan, C. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, page 213–220, New York, NY, USA. Association for Computing Machinery.

Kursa, M. B. and Rudnicki, W. R. (2010). Feature selection with the boruta package. Journal of Statistical Software, 36(11):1–13.

Lakshmi, H., Reddy, A. S., and Naidu, K. (2023). Analysis of diabetic prediction using machine learning algorithms on brfss dataset. In Proceedings of the 7th International Conference on Trends in Electronics and Informatics (ICOEI 2023). IEEE.

Li, F., Dong, S., Leier, A., Han, M., Guo, X., Xu, J., Wang, X., Pan, S., Jia, C., Zhang, Y., Webb, G. I., Coin, L. J. M., Li, C., and Song, J. (2022). Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Briefings in Bioinformatics, 23(1):bbab461.

Mordelet, F. and Vert, J.-P. (2014). A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201–209. Partially Supervised Learning for Pattern Recognition.

Xie, Z., Nikolayeva, O., Luo, J., and Li, D. (2019). Building risk prediction models for type 2 diabetes using machine learning techniques. Preventing Chronic Disease, 16:190109. Original Research — Peer Reviewed.

Yang, P., Li, X., Mei, J., Kwoh, C., and Ng, S. (2012). Positive-unlabeled learning for disease gene identification. Bioinformatics, 28(20):2640–2647.

Zhang, L., Shang, X., Sreedharan, S., Yan, X., Liu, J., Keel, S., Wu, J., Peng, W., and He, M. (2020). Predicting the development of type 2 diabetes in a large australian cohort using machine-learning techniques: Longitudinal survey study. JMIR Medical Informatics, 8(7):e16850.

Zhang, P., Fonnesbeck, C., Schmidt, D. C., White, J., Kleinberg, S., and Mulvaney, S. A. (2022). Using momentary assessment and machine learning to identify barriers to self-management in type 1 diabetes: Observational study. JMIR mHealth and uHealth, 10(3):e21959.

Zheng, Y., Peng, H., Zhang, X., Zhao, Z., Gao, X., and Li, J. (2019). Ddi-pulearn: A positive-unlabeled learning method for large-scale prediction of drug-drug interactions. BMC Bioinformatics, 20(19):661.
Publicado
25/05/2026
PINHEIRO, Rafael F.; SILVA, Nataly L. Patti da. Positive-Unlabeled Learning for Addressing Hidden Positives in Survey-Based Health Screening Information Systems. In: TRILHA DE NOVAS IDEIAS E RESULTADOS EMERGENTES EM SI - POSICIONAMENTO DE IDEIAS - SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 22. , 2026, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 238-249. DOI: https://doi.org/10.5753/sbsi_estendido.2026.249046.