Advances in the treatment of textual data in health with Artificial Intelligence techniques: An algorithm for data clustering
Abstract
The advance of Information Technology (IT) in healthcare has generated a large volume of data, often without adequate processing. In view of this, Artificial Intelligence (AI) helps to harness this data, but dealing with free and heterogeneous clinical texts is still challenging. This study developed a Python algorithm for pre-processing and clustering 217,000 clinical diagnoses by structural similarities, focusing on terms related to Dengue and COVID-19. Consequently, preliminary results show that this approach effectively organizes the data, facilitating further analysis. Despite the initial success, challenges such as the configuration of terms and the heterogeneity of the texts indicate the need for improvements to improve the accuracy of the process.
References
Dobrakowski, A. G., Mykowiecka, A., Marciniak, M., Jaworski, W., and Biecek, P. (2021). Interpretable segmentation of medical free-text records based on word embeddings. Journal of Intelligent Information Systems, 57:447–465.
Dou, Y. and Meng, W. (2023). Comparative analysis of weka-based classification algorithms on medical diagnosis datasets. Technology and health care : official journal of the European Society for Engineering and Medicine, 31:397–408.
Ghaddar, B. and Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265:993–1004.
Godinho, T. M., Lebre, R., Almeida, J. R., and Costa, C. (2019). Etl framework for real-time business intelligence over medical imaging repositories. Journal of Digital Imaging, 32:870–879.
Haraty, R. A., Dimishkieh, M., and Masud, M. (2015). An enhanced k-means clustering algorithm for pattern discovery in healthcare data. International Journal of Distributed Sensor Networks, 2015.
Napravnik, M., Hržić, F., Tschauner, S., and Štajduhar, I. (2024). Building radiologynet: an unsupervised approach to annotating a large-scale multimodal medical database. BioData Mining, 17.
Paula, F. D. A. P., Ferreira, J. Z., Júnior, E. L. D. S., Alves, I. G., Narvaes, J. V. R., Paula, C. D. A. P., Baretta, I. P., and Pacheco, R. B. (2023). Incidência da dengue durante a covid-19.
Singh, P., Singh, S. P., and Singh, D. S. (2019). An introduction and review on machine learning applications in medicine and healthcare.
Siouda, R., Nemissi, M., and Seridi, H. (2024). Diverse activation functions based-hybrid rbf-elm neural network for medical classification. Evolutionary Intelligence, 17:829–845.
Thangarasu, G. and Dominic, P. D. D. (2015). Diabetic deduction through non-parametric analysis. International Journal of Business Information Systems, 20:325–347.
Tripathi, M. A., Tripathi, R., Effendy, F., Manoharan, G., Paul, M. J., and Aarif, M. (2023). An in-depth analysis of the role that ml and big data play in driving digital marketing’s paradigm shift.
Waqas, S. M., Hussain, K., Mostafa, S. A., Nawi, N. M., and Khan, S. (2022). Fuzzy density-based clustering for medical diagnosis. volume 457 LNNS, pages 264–271.
