An algorithm for deduplication and insurance of quality in a database for ASD diagnosis: proposal and qualitative evaluation

  • Sarah Klock Mauricio USP
  • Helena Brentani USP
  • Joana Portolese USP
  • Luciana Madanelo USP
  • Ariane Machado-Lima USP
  • Lima Fátima L. S. Nunes USP

Abstract


Autism Spectrum Disorder (ASD) diagnosis requires the action of well-trained health professionals, which limits access to diagnosis. Computer-aided diagnosis using biomarkers can be an alternative to make diagnosis more accessible. However, public databases to support the development of CAD systems are still a challenge. This study presents an algorithm to identify flaws in data quality in a database for ASD diagnosis, such as duplicate records and missing data, in the Research Electronic Data Capture platform. The tool automates error detection and generates structured reports to assist health professionals in correcting data. A qualitative evaluation confirmed its usefulness and indicates that the time to identify errors can decrease approximately 15 times, contributing to minimizing the effort necessary to maintain a consistent database.

References

Bhatt, H. S., Singh, R., and Vatsa, M. (2013). Can combining demographics and biometrics improve de-duplication performance? In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 188–193.

Burns, S. S., Browne, A., Davis, G. N., Rimrodt, S. L., and Cutting, L. E. (2014). Pycap (version 1.0) [computer software].

Chen, H., Hailey, D., Wang, N., and Yu, P. (2014). A review of data quality assessment methods for public health information systems. International Journal of Environmental Research and Public Health, 11(5):5170–5207.

Council, N. R., of Behavioral, D., Sciences, S., Education, on National Statistics, C., and on Handling Missing Data in Clinical Trials, P. (2010). The Prevention and Treatment of Missing Data in Clinical Trials. National Academies Press.

Kaushik, V. D., Bendale, A., Nigam, A., and Gupta, P. (2012). An efficient algorithm for de-duplication of demographic data. In Huang, D.-S., Jiang, C., Bevilacqua, V., and Figueroa, J. C., editors, Intelligent Computing Technology, pages 602–609, Berlin, Heidelberg. Springer Berlin Heidelberg.

Mandell, D. S., Novak, M. M., and Zubritsky, C. D. (2005). Factors associated with age of diagnosis among children with autism spectrum disorders. 116.

Pinheiro, T. D. (2018). Classificação de imagens faciais para o auxílio ao diagnóstico do transtorno do espectro autista.

Python Software Foundation (2025). difflib — helpers for computing deltas. Accessed: 28-02-2025.

The pandas development team (2020). pandas-dev/pandas: Pandas. Van Rossum, G. and Drake Jr, F. L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.

Wall, D. P., Dally, R., Luyster, R., Jung, J.-Y., and DeLuca, T. F. (2012). Use of artificial intelligence to shorten the behavioral diagnosis of autism. PLoS One.

Zeidan, J., Fombonne, E., Scorah, J., Ibrahim, A., Durkin, M. S., Saxena, S., Yusuf, A., Shih, A., and Elsabbagh, M. (2022). Global prevalence of autism: A systematic review update. Autism Res, 15:778–790.
Published
2025-06-09
MAURICIO, Sarah Klock; BRENTANI, Helena; PORTOLESE, Joana; MADANELO, Luciana; MACHADO-LIMA, Ariane; NUNES, Lima Fátima L. S.. An algorithm for deduplication and insurance of quality in a database for ASD diagnosis: proposal and qualitative evaluation. In: UNDERGRADUATE RESEARCH WORKS CONTEST - BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTHCARE (SBCAS), 25. , 2025, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 7-12. ISSN 2763-8987. DOI: https://doi.org/10.5753/sbcas_estendido.2025.7265.