Natural Language Processing for Clinical Data Classification




named-entity recognition, clinical dataset, term frequency - inverse Document Frequency


The widespread adoption of systems for managing and recording medical documents (MD) has generated a large volume of unstructured data. It corresponds to free text containing ambiguous expressions to describe conditions or procedures. It makes the task of manually categorizing MD error-prone. This work aims to label and classify MD in Portuguese using binary labeling (Recipes and Others) and multi-class (Recipes, Exams, Certificates, and Others). The n-gram and term frequency - inverse document frequency (TF–IDF) were used in the text vectorization step. The results achieved are promising: they presented 0.99 and 0.97 for Kappa in the binary and multi-class classification, respectively. Thus, with the classification of MD, it is possible to provide segmentation of information to manage prescription drugs.


Download data is not yet available.


Assale, M., Dui, L. G., Cina, A., Seveso, A., and Cabitza, F. (2019). The revival of the notes field: Leveraging the unstructured content in electronic health records. Frontiers in Medicine, 0:66.

Baratloo, A., Hosseini, M., Negida, A., and El Ashal, G. (2015). Evidence based emergency medicine; part 1: Simple definition and calculation of accuracy, sensitivity and specificity. Emergency, 3:48–49.

Breiman, L. (2001). Random forests. Machine Learning 2001 45:1, 45:5–32.

Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4):467–480. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121–167.

Cabitza, F., Locoro, A., Alderighi, C., Rasoini, R., Compagnone, D., and Berjano, P. (2019). The elephant in the record: On the multiplicity of data recording work:. Health Informatics Journal, 25:475–490.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.

Cui, M., Bai, R., Lu, Z., Li, X., Aickelin, U., and Ge, P. (2019). Regular expression based medical text classification using constructive heuristic approach. IEEE Access, 7:147892–147904.

Gardner, M. W. and Dorling, S. (1998). Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15):2627–2636.

Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366.

Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.

Lee, J., Scott, D. J., Villarroel, M., Clifford, G. D., Saeed, M., and Mark, R. G. (2011). Open-access mimic-ii database for intensive care research. In 2011 Annual Interna- tional Conference of the IEEE Engineering in Medicine and Biology Society, pages 8315–8318. IEEE.

Lins, A. and Ludermir, T. B. (2005). Hybrid optimization algorithm for the definition of mlp neural network architectures and weights. In Fifth International Conference on Hybrid Intelligent Systems (HIS’05), pages 6–pp. IEEE.

Liu, J., Bai, R., Lu, Z., Ge, P., Aickelin, U., and Liu, D. (2020). Data-driven regular expressions evolution for medical text classification using genetic programming. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE.

Mountrakis, G., Im, J., and Ogole, C. (2011). Support vector machines in remote sensing: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 66(3):247–259.

Murphy, S. N. and Chueh, H. C. (2002). A security architecture for query tools used to access large biomedical databases. In Proceedings of the AMIA Symposium, page 552. American Medical Informatics Association.

Ogunleye, A. and Wang, Q.-G. (2019). Xgboost model for chronic kidney disease diagnosis. IEEE/ACM transactions on computational biology and bioinformatics, 17(6):2131–2140.

Ohno-Machado, L., Bafna, V., Boxwala, A. A., Chapman, B. E., Chapman, W. W., Chaudhuri, K., Day, M. E., Farcas, C., Heintzman, N. D., Jiang, X., et al. (2012). idash: integrating data for analysis, anonymization, and sharing. Journal of the Ame- rican Medical Informatics Association, 19(2):196–201.

Reys, A. D., Silva, D., Severo, D., Pedro, S., e Sa ́, M. M. d. S., and Salgado, G. A. (2020). Predicting multiple icd-10 codes from brazilian-portuguese clinical notes. In Brazilian Conference on Intelligent Systems, pages 566–580. Springer.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1–47.

Swain, P. H. and Hauska, H. (1977). The decision tree classifier: Design and potential. IEEE Transactions on Geoscience Electronics, 15(3):142–147.

Tayefi, M., Ngo, P., Chomutare, T., Dalianis, H., Salvi, E., Budrionis, A., and Godtliebsen, F. (2021). Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdisciplinary Reviews: Computational Statistics, page e1549.

Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).

Wulff, A., Mast, M., Hassler, M., Montag, S., Marschollek, M., and Jack, T. (2020). Designing an openehr-based pipeline for extracting and standardizing unstructured cli- nical data using natural language processing. Methods of Information in Medicine, 59:e64–e78.

Yun-tao, Z., Ling, G., and Yong-cheng, W. (2005). An improved tf-idf approach for text classification. Journal of Zhejiang University-SCIENCE A 2005 6:1, 6:49–55.



How to Cite

L. V. de Sousa, O. ., M. V. Magalhães, D. ., E. S. Campelo, V. ., & R. V. e Silva, R. (2022). Natural Language Processing for Clinical Data Classification. ISys - Brazilian Journal of Information Systems, 15(1), 13:1–13:17.



Special issues articles