LLM-Guided Autonomous Agent for News Extraction

Abstract


Maintaining traditional rule-based scrapers is costly, as minor layout changes can break their selectors. This paper presents an autonomous agent that integrates a modular scraping pipeline with a Large Language Model (LLM) and dynamic prompt engineering to extract news without prior knowledge of HTML structure. The agent runs observe–plan–act loops, using GPT4 to decide actions (scroll, click, extract) until a stopping criterion is met. Experiments with 4.898URLs from Seade and 12 unseen portals showed average recall of 91% and precision of 95%, statistically matching the legacy scraper. The system sustained over 90% quality even on dynamic DOMs, enabling scalable media monitoring with minimal human input.

Keywords: web scraping, autonomous agents, large language models (LLMs), prompt engineering, news extraction.

References

Ahluwalia, A. and Wani, S. (2024). Leveraging large language models for web scraping. arXiv preprint arXiv:2406.08246. Acesso em: 15 jul. 2025.

Aslanyürek, C. and Yerlikaya, T. (2024). Automatic regular expression generation for extracting relevant image data from web pages using genetic algorithms. IEEE Access, 12:90660–90669.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. [link]. arXiv preprint arXiv:2005.14165. Acesso em: 15 jul. 2025.

Fundação Sistema Estadual de Análise de Dados (SEADE) (2025). Pesquisa de investimentos do estado de são paulo (piesp). [link]. Acesso em: 28 abr. 2025.

Huang, W., Gu, Z., Peng, C., Li, Z., Liang, J., Xiao, Y., Wen, L., and Chen, Z. (2024). Autoscraper: A progressive understanding web agent for web scraper generation. arXiv preprint arXiv:2404.12753. Acesso em: 15 jul. 2025.

Khder, M. A. (2021). Web scraping or web crawling: State of the art, techniques, approaches and application. International Journal of Advances in Soft Computing and Its Applications, 13(3):145–168.

Reynolds, L. and McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. [link]. arXiv preprint arXiv:2102.07350. Acesso em: 15 jul. 2025.

Roig, A. P. (2023). Web data scraper. PhD thesis, Universitat Politècnica de València, València, ES. Tese de Doutorado. Acesso em: 15 jul. 2025.

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., and Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. [link]. arXiv preprint arXiv:2303.11366. Acesso em: 15 jul. 2025.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. [link]. arXiv preprint arXiv:2201.11903. Acesso em: 15 jul. 2025.

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. (2023a). Tree of thoughts: Deliberate problem solving with large language models. [link]. arXiv preprint arXiv:2305.10601. Acesso em: 15 jul. 2025.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023b). React: Synergizing reasoning and acting in language models. [link]. arXiv preprint arXiv:2210.03629. Acesso em: 15 jul. 2025.

You, J., Lee, K., and Kwon, H. (2024). Deepscraper: A complete and efficient tweet scraping method using authenticated multiprocessing. Data and Knowledge Engineering, 149:102260.
Published
2025-09-29
NERES DE SOUSA, João V. C.; MINGARDO, Lucas M.; FREIRE, Carlos E. T.; TRAINA, Agma J. M.; T. JUNIOR, Caetano. LLM-Guided Autonomous Agent for News Extraction. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 278-288. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247075.