An Intelligent System Based on LLMs for Automated Healthcare Data Collection
Abstract
The Internet contains a large volume of important information for the health sector, spread across several websites. The excess of data and the variety of sources make it difficult to efficiently extract this information. This work proposes to automate this collection using Web Scraping, Natural Language Processing and Language Models. Techniques with BeautifulSoup (for static websites) and Selenium (for dynamic websites) were applied, with support from the GPT-4o-mini model. The tests showed a significant reduction in collection and structuring time. The methodology obtained 90% accuracy on static websites and 73% on dynamic websites. Despite limitations with extensive texts and Document Object Model (DOM) structure, the results show that the approach is viable.References
Abdullah, S. S., Rahaman, M. S., and Rahman, M. S. (2013). Analysis of stock market using text mining and natural language processing. In 2013 International Conference on Informatics, Electronics and Vision (ICIEV), pages 1–6.
Abdurakhmonova, N., Alisher, I., and Toirova, G. (2022). Applying web crawler technologies for compiling parallel corpora as one stage of natural language processing. In 2022 7th International Conference on Computer Science and Engineering (UBMK), pages 73–75.
Bitterman, D. S., Goldner, E., Finan, S., Harris, D., Durbin, E. B., Hochheiser, H., Warner, J. L., Mak, R. H., Miller, T., and Savova, G. K. (2023). An end-to-end natural language processing system for automatically extracting radiation therapy events from clinical texts. International Journal of Radiation Oncology*Biology*Physics, 117(1):262–273.
Guo, D., Yue, A., Ning, F., Huang, D., Chang, B., Duan, Q., Zhang, L., Chen, Z., Zhang, Z., Zhan, E., Zhang, Q., Jiang, K., Li, R., Zhao, S., and Wei, Z. (2023). A study case of automatic archival research and compilation using large language models. In 2023 IEEE International Conference on Knowledge Graph (ICKG), pages 52–59.
Kim, S., Choi, S., and Seok, J. (2021). Keyword extraction in economics literatures using natural language processing. In 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN), pages 75–77.
Li, H., Li, Z., and Rao, Z. (2019). Text mining strategy of power customer service work order based on natural language processing technology. In 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), pages 335–338.
Liu, X., Zhou, Y., and Wang, Z. (2019). Recognition and extraction of named entities in online medical diagnosis data based on a deep neural network. Journal of Visual Communication and Image Representation, 60:1–15.
Lunn, S., Zhu, J., and Ross, M. (2020). Utilizing web scraping and natural language processing to better inform pedagogical practice. In 2020 IEEE Frontiers in Education Conference (FIE), pages 1–9.
OpenAI (2024). Hello gpt-4o. Available from: [link]. Accessed: 2025-06-01.
Pichiyan, V., Muthulingam, S., G, S., Nalajala, S., Ch, A., and Das, M. N. (2023). Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Computer Science, 230:193–202. 3rd International Conference on Evolutionary Computing and Mobile Sustainable Networks (ICECMSN 2023).
Single, J. I., Schmidt, J., and Denecke, J. (2020). Knowledge acquisition from chemical accident databases using an ontology-based method and natural language processing. Safety Science, 129:104747.
Son, M., Won, Y.-J., and Lee, S. (2025). Optimizing large language models: A deep dive into effective prompt engineering techniques. Applied Sciences by MDPI.
Wu, J. T., Dernoncourt, F., Gehrmann, S., Tyler, P. D., Moseley, E. T., Carlson, E. T., Grant, D. W., Li, Y., Welt, J., and Celi, L. A. (2018). Behind the scenes: A medical natural language processing project. International Journal of Medical Informatics, 112:68–73.
Yuan, F., Yuan, S., Wu, Z., and Li, L. (2024). How vocabulary sharing facilitates multilingualism in llama?
Abdurakhmonova, N., Alisher, I., and Toirova, G. (2022). Applying web crawler technologies for compiling parallel corpora as one stage of natural language processing. In 2022 7th International Conference on Computer Science and Engineering (UBMK), pages 73–75.
Bitterman, D. S., Goldner, E., Finan, S., Harris, D., Durbin, E. B., Hochheiser, H., Warner, J. L., Mak, R. H., Miller, T., and Savova, G. K. (2023). An end-to-end natural language processing system for automatically extracting radiation therapy events from clinical texts. International Journal of Radiation Oncology*Biology*Physics, 117(1):262–273.
Guo, D., Yue, A., Ning, F., Huang, D., Chang, B., Duan, Q., Zhang, L., Chen, Z., Zhang, Z., Zhan, E., Zhang, Q., Jiang, K., Li, R., Zhao, S., and Wei, Z. (2023). A study case of automatic archival research and compilation using large language models. In 2023 IEEE International Conference on Knowledge Graph (ICKG), pages 52–59.
Kim, S., Choi, S., and Seok, J. (2021). Keyword extraction in economics literatures using natural language processing. In 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN), pages 75–77.
Li, H., Li, Z., and Rao, Z. (2019). Text mining strategy of power customer service work order based on natural language processing technology. In 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), pages 335–338.
Liu, X., Zhou, Y., and Wang, Z. (2019). Recognition and extraction of named entities in online medical diagnosis data based on a deep neural network. Journal of Visual Communication and Image Representation, 60:1–15.
Lunn, S., Zhu, J., and Ross, M. (2020). Utilizing web scraping and natural language processing to better inform pedagogical practice. In 2020 IEEE Frontiers in Education Conference (FIE), pages 1–9.
OpenAI (2024). Hello gpt-4o. Available from: [link]. Accessed: 2025-06-01.
Pichiyan, V., Muthulingam, S., G, S., Nalajala, S., Ch, A., and Das, M. N. (2023). Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Computer Science, 230:193–202. 3rd International Conference on Evolutionary Computing and Mobile Sustainable Networks (ICECMSN 2023).
Single, J. I., Schmidt, J., and Denecke, J. (2020). Knowledge acquisition from chemical accident databases using an ontology-based method and natural language processing. Safety Science, 129:104747.
Son, M., Won, Y.-J., and Lee, S. (2025). Optimizing large language models: A deep dive into effective prompt engineering techniques. Applied Sciences by MDPI.
Wu, J. T., Dernoncourt, F., Gehrmann, S., Tyler, P. D., Moseley, E. T., Carlson, E. T., Grant, D. W., Li, Y., Welt, J., and Celi, L. A. (2018). Behind the scenes: A medical natural language processing project. International Journal of Medical Informatics, 112:68–73.
Yuan, F., Yuan, S., Wu, Z., and Li, L. (2024). How vocabulary sharing facilitates multilingualism in llama?
Published
2025-09-29
How to Cite
GUIMARÃES, Nathália C. O. C.; MELLO, Felipe C. B.; FERNANDES, Talita J.; CAMPELO, Luís F. H.; TEODORO, João G. M. G.; SOARES, Felipe A. L..
An Intelligent System Based on LLMs for Automated Healthcare Data Collection. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1984-1995.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14369.
