Small vs. Large Language Models: A Comparative Study on Multiple-Choice Question Answering in Portuguese
Resumo
Generative models are widely used for Multiple-Choice Question Answering (MCQA). While performance often improves with model size, prior work reports inconsistencies depending on task, prompting strategy, and language. We evaluate eleven open models ranging from millions to billions of parameters, both monolingual and multilingual, on Portuguese MCQA built from college entrance exams, under six prompting strategies: zero-shot, one-shot, few-shot, shuffled-order (to probe positional effects), and two per-option label-only settings. We also quantify positional bias using a normalized positional-bias coefficient (BPC). Overall, performance increases with parameter count; however, the magnitude varies across strategies. LLaMA-3.1-Storm-8B achieves the best average accuracy, and Sabiá-7B, a model trained with a strong focus on Portuguese, performs competitively among models of similar size. Smaller models (e.g., Tucano-2B, Qwen2-0.5B) attain solid results in specific settings, particularly with per-option scoring. These findings suggest that, although larger models are generally more robust, carefully chosen prompting can make smaller models viable under resource constraints. In summary, performance scales with size but depends on prompting—per-option configurations reduce the SLM–LLM gap, and positional bias is measurable via BPC; future work includes multi-shuffle BPC estimation, calibration, and log-likelihood baselines for per-option scoring and extensions to additional domains and languages.
Referências
Nicholas Kluge Corrêa, Sophia Falk, Shiza Fatimah, Aniket Sen, and Nythamar De Oliveira. 2024. Teenytinyllama: open-source tiny language models trained in brazilian portuguese. Machine Learning with Applications 16 (2024), 100558.
Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759 (2023).
Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759 [cs.CL] [link]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2023. Textbooks Are All You Need. arXiv:2306.11644 [cs.CL] [link]
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. [n. d.]. Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024. URL [link] ( [n. d.]).
Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog 1, 3 (2023), 3.
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
Aisha Khatun and Daniel GBrown. 2024. AStudy on Large Language Models’ Limitations in Multiple-Choice Question Answering. arXiv preprint arXiv:2401.07955 (2024).
Yassine Labrak, Adrien Bazoge, Romain Dufour, et al. 2023. FrenchMedMCQA: A French multiple-choice question answering dataset for medical domain. arXiv 426 WebMedia’2025, Rio de Janeiro, Brazil Silveira et al. preprint arXiv:2304.04280 (2023).
Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Zhijing Wu, and Yiqun Liu. 2025. Blade: Enhancing black-box large language models with small domain-specific models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 24422–24430.
Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. 2024. Can multiple-choice questions really be useful in detecting the abilities of LLMs? arXiv preprint arXiv:2403.17752 (2024).
Meta AI. 2024. Meta LLaMA 3 70B: An open-weight large language model. [link]. Accessed: 2025-05-07.
Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, and Jamie Callan. 2022. ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information. arXiv:2211.15848 [cs.IR] [link]
Rafael Pires, Heitor Abonizio, Thiago Santos Almeida, and Rodrigo Nogueira. 2023. Sabiá: Portuguese large language models. Brazilian Conference on Intelligent Systems (2023), 226–240.
Ramon Pires, Thales Sales Almeida, Hugo Abonizio, and Rodrigo Nogueira. 2023. Evaluating GPT-4’s Vision Capabilities on Brazilian University Admission Exams. arXiv:2311.14169 [cs.CL]
Mamillapalli Chilaka Rao, P Sreedhar, M Bhanurangarao, and G Sujatha. 2022. Automatic multiple-choice question and answer (MCQA) generation using deep learning model. In International Conference on Information and Management Engineering. Springer, 1–8.
Joshua Robinson, Christopher Michael Rytting, and David Wingate. 2022. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353 (2022).
Roberto Rodriguez-Torrealba, Enrique Garcia-Lopez, and Antonio Garcia-Cabot. 2022. End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications 208 (2022), 118258.
Giorgio Sarti and Malvina Nissim. 2024. It5: Text-to-text pretraining for italian language understanding and generation. arXiv preprint arXiv:2402.13513 (2024).
Timo Schick and Hinrich Schütze. 2020. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118 (2020).
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. 2025. Toward expert-level medical question answering with large language models. Nature Medicine (2025), 1–8.
Patrick Sutanto, Joan Santoso, Esther Irawati Setiawan, and Aji Prasetya Wibawa. 2024. Llm distillation for efficient few-shot multiple choice question answering. arXiv preprint arXiv:2412.09807 (2024).
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel ShuWei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940.
Haochun Wang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and Ting Liu. 2024. LLMs May Perform MCQA by Selecting the Least Incorrect Option. arXiv preprint arXiv:2402.01349 (2024).
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385 (2024).
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882 (2023).
Lianmin Zheng, Wei-Lin Chiang, Yizhong Sheng, et al. 2023. Judging LLM-asa- Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685 (2023).
