Um framework para extração automática de informações em patentes farmacêuticas

  • Pablo Cecilio UFSJ
  • Antônio Pereira UFSJ
  • Felipe Viegas UFMG
  • Juliana Rosa Universidade do Porto
  • Washington Cunha UFMG
  • Fabiana Testa USP
  • Elisa Tuler UFSJ
  • Leonardo Rocha UFSJ


The management of pharmaceutical patents often involves laborious manual searches due to the extensive details in documents on the invention’s claims and methodology/results explanation. In order to address this challenge, we have proposed PATopics, a comprehensive framework designed to extract pertinent information from textual data within patents. PATopics utilizes this information to construct relevant topics, establish correlations with useful patent characteristics, and present the gathered insights through a user-friendly web interface. To evaluate the effectiveness of our framework, we conducted a study involving 4,832 pharmaceutical patents associated with 809 molecules patented by 478 companies. We analyzed the framework’s performance based on the requirements of three user profiles: researchers, chemists, and companies. The results highlighted the practicality and usefulness of PATopics in the pharmaceutical domain, showcasing its ability to assist users from diff erent backgrounds in navigating and extracting valuable insights from patent information.
Palavras-chave: Natural Language Processing, Topic Modeling


Livio Garattini, Marco Badinella Martini, and Pier Mannuccio Mannucci. 2022. Pharmaceutical patenting in the European Union: reform or riddance. Internal and Emergency Medicine 17, 3 (2022), 937–939.

B. L. Genin and D. S. Zolkin. 2021. Similarity search in patents databases. The evaluations of the search quality. World Patent Information 64, February (2021), 102022.

Zaiqiao Meng, Hong Shen, Huimin Huang, Wei Liu, Jing Wang, and Arun Kumar Sangaiah. 2018. Search result diversification on attributed networks via nonnegative matrix factorization. Information Processing & Management 54, 6 (2018), 1277–1291.

Claude Sammut and Geoffrey I.Webb (Eds.). 2010. TF–IDF. Springer US, Boston, MA, 986–987.

Felipe Viegas, Sérgio Canuto, Christian Gomes,Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. (2019).

Hugh Waters and Marlon Graf. 2018. The Costs of Chronic Disease in the U.S. Milken Institute August (2018), 24. [link].
CECILIO, Pablo; PEREIRA, Antônio; VIEGAS, Felipe; ROSA, Juliana; CUNHA, Washington; TESTA, Fabiana; TULER, Elisa; ROCHA, Leonardo. Um framework para extração automática de informações em patentes farmacêuticas. In: WORKSHOP DE FERRAMENTAS E APLICAÇÕES - SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 29. , 2023, Ribeirão Preto/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 97-100. ISSN 2596-1683. DOI: