Um framework para extração automática de informações em patentes farmacêuticas

Pablo Cecilio; Antônio Pereira; Felipe Viegas; Juliana Rosa; Washington Cunha; Fabiana Testa; Elisa Tuler; Leonardo Rocha

doi:10.5753/webmedia_estendido.2023.233297

Pablo Cecilio UFSJ
Antônio Pereira UFSJ
Felipe Viegas UFMG
Juliana Rosa Universidade do Porto
Washington Cunha UFMG
Fabiana Testa USP
Elisa Tuler UFSJ
Leonardo Rocha UFSJ

DOI: https://doi.org/10.5753/webmedia_estendido.2023.233297

Resumo

The management of pharmaceutical patents often involves laborious manual searches due to the extensive details in documents on the invention’s claims and methodology/results explanation. In order to address this challenge, we have proposed PATopics, a comprehensive framework designed to extract pertinent information from textual data within patents. PATopics utilizes this information to construct relevant topics, establish correlations with useful patent characteristics, and present the gathered insights through a user-friendly web interface. To evaluate the effectiveness of our framework, we conducted a study involving 4,832 pharmaceutical patents associated with 809 molecules patented by 478 companies. We analyzed the framework’s performance based on the requirements of three user profiles: researchers, chemists, and companies. The results highlighted the practicality and usefulness of PATopics in the pharmaceutical domain, showcasing its ability to assist users from diff erent backgrounds in navigating and extracting valuable insights from patent information.

Palavras-chave: Natural Language Processing, Topic Modeling

Referências

Livio Garattini, Marco Badinella Martini, and Pier Mannuccio Mannucci. 2022. Pharmaceutical patenting in the European Union: reform or riddance. Internal and Emergency Medicine 17, 3 (2022), 937–939. https://doi.org/10.1007/s11739-021-02887-6

B. L. Genin and D. S. Zolkin. 2021. Similarity search in patents databases. The evaluations of the search quality. World Patent Information 64, February (2021), 102022. https://doi.org/10.1016/j.wpi.2021.102022

Zaiqiao Meng, Hong Shen, Huimin Huang, Wei Liu, Jing Wang, and Arun Kumar Sangaiah. 2018. Search result diversification on attributed networks via nonnegative matrix factorization. Information Processing & Management 54, 6 (2018), 1277–1291.

Claude Sammut and Geoffrey I.Webb (Eds.). 2010. TF–IDF. Springer US, Boston, MA, 986–987. https://doi.org/10.1007/978-0-387-30164-8_832

Felipe Viegas, Sérgio Canuto, Christian Gomes,Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. (2019). https://doi.org/10.1145/3289600.3291032

Hugh Waters and Marlon Graf. 2018. The Costs of Chronic Disease in the U.S. Milken Institute August (2018), 24. [link].