Do SLM agents screen papers as well as LLMs?
Abstract
Summarizing scientific knowledge is crucial, and the Systematic Literature Review is one of the main methods used, especially in evidence-based research, being a complex and time-consuming process. This work focuses on the article screening stage, which is one of the most critical steps, as the subsequent phases depend on its quality. The literature shows promising results using commercial LLMs such as ChatGPT and Gemini for this task, but there is a lack of studies on the use of smaller language models (SLMs) running locally. We analyzed three approaches using SLMs in comparison to a commercial LLM used as a baseline. The results show good performance from SLMs when tasks are stratified and simplified into subtasks. Qwen 3 - 8B achieved an accuracy of up to 94.35%. Using a multi-agent approach, Phi 4 - 14B reached 79.5%, and Qwen 3 - 4B reached 78.8%, compared to the commercial LLM.
References
Borah, R., Brown, A. W., Capers, P. L., and Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the prospero registry. BMJ Open, 7(2).
Colangelo, M. T., Guizzardi, S., Meleti, M., Calciolari, E., and Galli, C. (2025). Performance comparison of large language models for efficient literature screening. BioMedInformatics, 5(2).
Delgado-Chaves, F. M., Jennings, M. J., Atalaia, A., Wolff, J., Horvath, R., Mamdouh, Z. M., Baumbach, J., and Baumbach, L. (2025). Transforming literature screening: The emerging role of large language models in systematic reviews. Proceedings of the National Academy of Sciences, 122(2):e2411962122.
Galli, C., Gavrilova, A. V., and Calciolari, E. (2025). Large language models in systematic review screening: Opportunities, challenges, and methodological considerations. Information, 16(5).
Gusenbauer, M. and Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta-analyses? evaluating retrieval qualities of google scholar, pubmed, and 26 other resources. Research Synthesis Methods, 11(2):181–217.
Haddaway, N. R. and Westgate, M. J. (2018). Predicting the time needed for environmental systematic reviews and systematic maps. Conservation Biology, 33(2):434–443.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In International Conference on Learning Representations.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, Keele University and University of Durham, UK.
LangChain Team and Contributors (2025). Langchain.
Li, M., Sun, J., and Tan, X. (2024). Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Systematic Reviews, 13:219.
Luo, H., Liu, P., and Esping, S. (2023). Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification. arXiv preprint arXiv:2309.14779.
Maslej, N., Fattorini, L., Perrault, R., Gil, Y., Parli, V., Kariuki, N., Capstick, E., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., Walsh, T., Hamrah, A., Santarlasci, L., Lotufo, J. B., Rome, A., Shi, A., and Oak, S. (2025). The ai index 2025 annual report. Technical report, AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, Stanford, CA.
Mechanism, S. A. (2024). Successful and timely uptake of artificial intelligence in science in the eu: evidence review report.
Mulrow, C. D. (1994). Systematic reviews: Rationale for systematic reviews. BMJ, 309(6954):597–599.
Ollama Team and Contributors (2025). Ollama.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., McGuinness, L. A., Stewart, L. A., Thomas, J., Tricco, A. C., Welch, V. A., Whiting, P., and Moher, D. (2021). The prisma 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 372.
Sampson, M., Tetzlaff, J., and Urquhart, C. (2011). Precision of healthcare systematic review searches in a cross-sectional sample. Research Synthesis Methods, 2(2):119–125.
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.
Telenti, A., Auli, M., Hie, B. L., Maher, C., Saria, S., and Ioannidis, J. P. A. (2024). Large language models for science and medicine. European Journal of Clinical Investigation, 54(6):e14183. e14183 EJCI-2023-1951.R1.
