An algorithm for deduplication and insurance of quality in a database for ASD diagnosis: proposal and qualitative evaluation
Abstract
Autism Spectrum Disorder (ASD) diagnosis requires the action of well-trained health professionals, which limits access to diagnosis. Computer-aided diagnosis using biomarkers can be an alternative to make diagnosis more accessible. However, public databases to support the development of CAD systems are still a challenge. This study presents an algorithm to identify flaws in data quality in a database for ASD diagnosis, such as duplicate records and missing data, in the Research Electronic Data Capture platform. The tool automates error detection and generates structured reports to assist health professionals in correcting data. A qualitative evaluation confirmed its usefulness and indicates that the time to identify errors can decrease approximately 15 times, contributing to minimizing the effort necessary to maintain a consistent database.
References
Burns, S. S., Browne, A., Davis, G. N., Rimrodt, S. L., and Cutting, L. E. (2014). Pycap (version 1.0) [computer software].
Chen, H., Hailey, D., Wang, N., and Yu, P. (2014). A review of data quality assessment methods for public health information systems. International Journal of Environmental Research and Public Health, 11(5):5170–5207.
Council, N. R., of Behavioral, D., Sciences, S., Education, on National Statistics, C., and on Handling Missing Data in Clinical Trials, P. (2010). The Prevention and Treatment of Missing Data in Clinical Trials. National Academies Press.
Kaushik, V. D., Bendale, A., Nigam, A., and Gupta, P. (2012). An efficient algorithm for de-duplication of demographic data. In Huang, D.-S., Jiang, C., Bevilacqua, V., and Figueroa, J. C., editors, Intelligent Computing Technology, pages 602–609, Berlin, Heidelberg. Springer Berlin Heidelberg.
Mandell, D. S., Novak, M. M., and Zubritsky, C. D. (2005). Factors associated with age of diagnosis among children with autism spectrum disorders. 116.
Pinheiro, T. D. (2018). Classificação de imagens faciais para o auxílio ao diagnóstico do transtorno do espectro autista.
Python Software Foundation (2025). difflib — helpers for computing deltas. Accessed: 28-02-2025.
The pandas development team (2020). pandas-dev/pandas: Pandas. Van Rossum, G. and Drake Jr, F. L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
Wall, D. P., Dally, R., Luyster, R., Jung, J.-Y., and DeLuca, T. F. (2012). Use of artificial intelligence to shorten the behavioral diagnosis of autism. PLoS One.
Zeidan, J., Fombonne, E., Scorah, J., Ibrahim, A., Durkin, M. S., Saxena, S., Yusuf, A., Shih, A., and Elsabbagh, M. (2022). Global prevalence of autism: A systematic review update. Autism Res, 15:778–790.
