Do SLM agents screen papers as well as LLMs?
Resumo
Sumarizar o conhecimento científico é crucial, e a Revisão Sistemática da Literatura é um dos principais métodos utilizados, sobretudo na pesquisa baseada em evidências, sendo um processo complexo e demorado. Este trabalho foca na etapa triagem de artigos, sendo uma das mais cruciais, pois as demais dependem da qualidade desta. A literatura aponta bons resultados com LLMs comerciais como ChatGPT e Gemini para esta tarefa, mas faltam estudos sobre o uso de modelos menores (SLMs) executados localmente. Analisamos três abordagens com SLMs em comparação a um LLM comercial, usado como baseline. Os resultados mostram bom desempenho dos SLMs quando estratificamos e simplificamos em subtarefas. O Qwen 3 - 8B alcançou acurácia de até 94,35%. Com abordagem multiagente, o Phi 4 - 14B obteve 79,5% e o Qwen 3 - 4B, 78,8%, em relação ao LLM comercial.
Referências
Borah, R., Brown, A. W., Capers, P. L., and Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the prospero registry. BMJ Open, 7(2).
Colangelo, M. T., Guizzardi, S., Meleti, M., Calciolari, E., and Galli, C. (2025). Performance comparison of large language models for efficient literature screening. BioMedInformatics, 5(2).
Delgado-Chaves, F. M., Jennings, M. J., Atalaia, A., Wolff, J., Horvath, R., Mamdouh, Z. M., Baumbach, J., and Baumbach, L. (2025). Transforming literature screening: The emerging role of large language models in systematic reviews. Proceedings of the National Academy of Sciences, 122(2):e2411962122.
Galli, C., Gavrilova, A. V., and Calciolari, E. (2025). Large language models in systematic review screening: Opportunities, challenges, and methodological considerations. Information, 16(5).
Gusenbauer, M. and Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of google scholar, pubmed, and 26 other resources. Research Synthesis Methods, 11(2):181–217.
Haddaway, N. R. and Westgate, M. J. (2018). Predicting the time needed for environmental systematic reviews and systematic maps. Conservation Biology, 33(2):434–443.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In International Conference on Learning Representations.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, Keele University and University of Durham, UK.
LangChain Team and Contributors (2025). Langchain.
Li, M., Sun, J., and Tan, X. (2024). Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Systematic Reviews, 13:219.
Luo, H., Liu, P., and Esping, S. (2023). Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification. arXiv preprint arXiv:2309.14779.
Maslej, N., Fattorini, L., Perrault, R., Gil, Y., Parli, V., Kariuki, N., Capstick, E., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., Lotufo, J. B., Rome, A., Shi, A., and Oak, S. (2025). The ai index 2025 annual report. Technical report, AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, Stanford, CA.
Mechanism, S. A. (2024). Successful and timely uptake of artificial intelligence in science in the eu: evidence review report.
Mulrow, C. D. (1994). Systematic reviews: Rationale for systematic reviews. BMJ, 309(6954):597–599.
Ollama Team and Contributors (2025). Ollama.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., McGuinness, L. A., Stewart, L. A., Thomas, J., Tricco, A. C., Welch, V. A., Whiting, P., and Moher, D. (2021). The prisma 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 372.
Sampson, M., Tetzlaff, J., and Urquhart, C. (2011). Precision of healthcare systematic review searches in a cross-sectional sample. Research Synthesis Methods, 2(2):119–125.
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.
Telenti, A., Auli, M., Hie, B. L., Maher, C., Saria, S., and Ioannidis, J. P. A. (2024). Large language models for science and medicine. European Journal of Clinical Investigation, 54(6):e14183. e14183 EJCI-2023-1951.R1.
