Impulsionando a descoberta de tratamentos na medicina através da representação distribuída de palavras

Matheus V. V. Berto; Tiago A. Almeida

doi:10.5753/webmedia_estendido.2024.243672

Matheus V. V. Berto UFSCar
Tiago A. Almeida UFSCar

DOI: https://doi.org/10.5753/webmedia_estendido.2024.243672

Resumo

Word embeddings are mathematical and computational representations that consist of high dimensional vectors capable of encoding the meaning of terms or sentences in a text. This well-established approach enhanced many Natural Language Processing applications, since it can be easily generated from large textual datasets by a different set of algorithms. In this study, we have extended a recently discovered use of word embeddings: the ability to uncover potential implicit information in a corpus (also known as latent knowledge) that may not be achievable with human analysis alone. More specifically, our work combines word embeddings computed through diverse unsupervised methods in order to extract latent knowledge that could anticipate clinical discoveries in the field of medicine. By using a massive amount of scientific papers related to a high deadly cancer called Acute Myeloid Leukemia, our study shows that currently approved therapies could have been investigated earlier due to drug testing notifications issued by our framework. Therefore, our strategy collaborates to a faster drug analysis and biomedical discoveries. Details about our proposal and in-depth analysis of the results can be found in Berto et al. [2].

Palavras-chave: representação vetorial distribuída, embeddings de palavras, descoberta de conhecimento em bases de dados, processamento de linguagem natural, IA na medicina

Referências

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR 3 (2003), 1137–1155. DOI: 10.1162/153244303322533223

Matheus V. V. Berto, Breno L. Freitas, Carolina Scarton, João A. Machado-Neto, and Tiago A. Almeida. 2024. Accelerating discoveries in medicine using distributed vector representations of words. Expert Systems with Applications 250 (2024) , 123566. DOI: 10.1016/j.eswa.2024.123566

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. Trans. of the ACL 5 (07 2016), 12. DOI: 10.1162/tacl_a_00051

Bob Löwenberg, Gert J. Ossenkoppele, Wim van Putten, Harry C. Schouten, Carlos Graux, Augustin Ferrant, Pieter Sonneveld, Johan Maertens, Mojca JongenLavrencic, Marie von Lilienfeld-Toal, Bart J. Biemond, Edo Vellenga, Marinus van Marwijk Kooy, Leo F. Verdonck, Joachim Beck, Hartmut Döhner, Alois Gratwohl, Thomas Pabst, and Gregor Verhoef. 2009. High-Dose Daunorubicin in Older Patients with Acute Myeloid Leukemia. New England J. of Medicine 361, 13 (Sept. 2009), 1235–1248. DOI: 10.1056/nejmoa0901409

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781

Magnus Sahlgren. 2008. The distributional hypothesis. Italian J. of Linguistics 20 (01 2008), 33–54.

Pranav Shetty and Rampi Ramprasad. 2021. Automated knowledge extraction from polymer literature using natural language processing. iScience 24, 1 ( Jan. 2021), 101922. DOI: 10.1016/j.isci.2020.101922

Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander R Dunn, Ziqin Rong, Olga Kononova, Kristin A. Persson, Gerbrand Ceder, and Anubhav Jain. 2019. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 (July 2019), 95–98. DOI: 10.1038/s41586-019-1335-8

Feifan Yang. 2022. Natural Language Processing Applied on Large Scale DataExtraction from Scientific Papers in Fuel Cells. In Proc. of the 5th NLPIR ( Sanya, China) (NLPIR 2021). ACM, New York, NY, USA, 168–175. DOI: 10.1145/3508230.3508256