Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups

  • João Vitor Andrioli de Souza PUCPR
  • Yohan Bonescki Gumiel PUCPR
  • Lucas Emanuel Silva e Oliveira PUCPR
  • Claudia Maria Cabral Moro PUCPR

Resumo


Considering the difficulties of extracting entities from Electronic Health Records (EHR) texts in Portuguese, we explore the Conditional Random Fields (CRF) algorithm to build a Named Entity Recognition (NER) system based on a corpus of clinical Portuguese data annotated by experts. We acquaint the challenges and methods to classify Abbreviations, Disorders, Procedures and Chemicals within the texts. By selecting a meaningful set of features, and parameters with the best performance the results demonstrate that the method is promising and may support other biomedical tasks, nonetheless, further experiments with more features, different architectures and sophisticated preprocessing steps are needed.

Referências

Abacha A.B., Zweigenbaum P. (2011). Medical Entity Recognition: A Comparison of Semantic and Statistical Methods. Workshop on Biomedical Natural Language Processing, 56-64.

Al-Hegami A. S., Othman A.M.F., Bagash F.T.. (2017). A Biomedical Named Entity Recognition Using Machine Learning Classifiers and Rich Feature Set. IJCSNS International Journal of Computer Science and Network Security, Vol.17.1. 170-176.

Cortes C., Vapnik V. (1995). Support-Vector Networks. Kluwer Academic Publishers, Boston. Manufactured in the Netherlands. Machine Learning, 20, 273-297. http://dx.doi.org/10.1007/BF00994018

Denecke K. (2014). Extracting Medical Concepts from Medical Social Media with Clinical NLP Tools: A Qualitative Study.

Dingcheng Li, Kipper-Schuler K., Savova G.. (2008). Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts. Current Trends in Biomedical Natural Language Processing. 94–95.

Jagannatha, A.N., Yu, H. (2016). Bidirectional RNN for Medical Event Detection in Electronic Health Records. Proc. Conf. Assoc. Comput. Linguist. North Am. Chapter. Meet. 473–482.

Lafferty J. D., McCallum A., Perreira F.C.N. (2001). Conditional Random Fields: Boston. Probabilistic Models for Segmenting and Labeling Sequence Data. Department of Computer & Information Science. 282-289.

Lindberg D.A.B., Humphreys B.L., McCray A.T.. (1993). The Unified Medical Language System. Methods Inf Med. 32:281–91. http://dx.doi.org/10.1055/s-0038-1637976

Miotto, R., Li, L., Kidd, B.A., Dudley, J.T. (2016). Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 6, 1–10. http://dx.doi.org/10.1038/srep26094

Oliveira, L. E. S., Gebeluca, C. P., Silva, A. M. P., Moro, C. M. C., Hasan, S. A., Farri, O. (2017). A Statistics and UMLS-based Tool for Assisted Semantic Annotation of Brazilian Clinical Documents. IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1072-1078. http://dx.doi.org/10.1109/BIBM.2017.8217805

Saha, S., Ekbal, A., Sikdar, U. K. (2015). Named entity recognition and classification in biomedical text using classifier ensemble. International Journal of Data Mining and Bioinformatics, 11(4), 365. doi:10.1504/ijdmb.2015.067954. http://dx.doi.org/10.1504/ijdmb.2015.067954

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol.34.1. 1–47. http://dx.doi.org/10.1145/505282.505283

Shickel, B., Tighe, P.J., Bihorac, A., Rashidi, P.: Deep EHR. (2017). A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J. Biomed. Heal. Informatics. 1–1. http://dx.doi.org/10.1109/JBHI.2017.2767063

Yadav V., Bethard S. (2018). A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. Proceedings of the 27th International Conference on Computational Linguistics. 2145-2158.
Publicado
11/06/2019
DE SOUZA, João Vitor Andrioli; GUMIEL, Yohan Bonescki; OLIVEIRA, Lucas Emanuel Silva e; MORO, Claudia Maria Cabral . Named Entity Recognition for Clinical Portuguese Corpus with Conditional Random Fields and Semantic Groups. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 19. , 2019, Niterói. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 318-323. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2019.6269.