A Systematic Review on Language Model Compression: Perspectives for Efficiency and Sustainability in Information Systems

Lair Anderson de P. Mesquita; Saulo Anderson F. de Oliveira

doi:10.5753/sbsi.2026.248723

Lair Anderson de P. Mesquita IFCE
Saulo Anderson F. de Oliveira IFCE

DOI: https://doi.org/10.5753/sbsi.2026.248723

Resumo

Research Context: The rapid adoption of Large Language Models (LLMs) offers unprecedented opportunities for Intelligent Information Systems. However, their high computational cost creates significant deployment barriers in resource-constrained environments. This limited access for smaller organizations and developing regions reinforces social and economic disparities in the adoption of those systems. Scientific and/or Practical Problem: LLMs demand excessive resources, which hinders their integration into sustainable INFORMATION SYSTEMS. Furthermore, there is a lack of systematic evidence on how compression affects robustness, factuality, and applicability in different contexts. Proposed Solution and/or Analysis: This work presents a systematic literature review analyzing 30 peer-reviewed studies on LLM compression methods. The review identifies trade-offs, evaluate empirical results, and highlight gaps for future research. Related IS Theory: We interpret computational efficiency and accessibility as strategic resources (Resource-Based View) that enable the development of sustainable, and equitable digital infrastructures. Research Method: This study follows PICO structure with inclusion/exclusion criteria to works published between 2018 and January 2025. It was centered around eight research questions addressing efficiency, robustness, hardware compatibility, knowledge preservation, hybridization, architectural adaptation, evaluation metrics, and industrial viability. Summary of Results: Our study shows that efficiency and system-focused methods are the most developed and investigated for LLM compression for training and inference efficiency. Among these techniques, quantization consistently outperforms pruning and error-based methods, balancing performance and design savings. However, research on robustness, generalization, and practical use remains limited. Recent advances in shared hardware designs show potential for scalability and lower power consumption. However, practical studies on the long-term sustainability, ethical issues, and social impacts of LLM compression are rare in current scientific literature. Contributions and Impact to IS area: This work contributes by mapping key challenges in LLM compression research, offering insights to design efficient, sustainable, and socially responsible intelligent systems that align AI’s technical progress with goals of inclusion and environmental responsibility.

Referências

Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. (2024). Quarot: Outlier-free 4-bit inference in rotated llms.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623.

Brown, T., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.

Chen, M., Shao,W., Xu, P.,Wang, J., Gao, P., Zhang, K., and Luo, P. (2024). Efficientqat: Efficient quantization-aware training for large language models.

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. (2022). Llm.int8(): 8-bit matrix multiplication for transformers at scale.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023a). Qlora: Efficient finetuning of quantized llms.

Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev, D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T., and Alistarh, D. (2023b). Spqr: A sparse-quantized representation for near-lossless llm weight compression.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., and Alistarh, D. (2024). Extreme compression of large language models via additive quantization.

Frantar, E. and Alistarh, D. (2023). Sparsegpt: Massive language models can be accurately pruned in one-shot.

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2023). Gptq: Accurate posttraining quantization for generative pre-trained transformers.

Ganesh, P., Chen, Y., Lou, X., et al. (2020). Compressing large-scale transformer-based models: A survey. arXiv preprint arXiv:2006.09282.

Gholami, A., Kim, S., Dong, Z., et al. (2021). A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630.

Guan, Z., Huang, H., Su, Y., Huang, H., Wong, N., and Yu, H. (2024). Aptq: Attention-aware post-training mixed-precision quantization for large language models.

Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

Hooker, S., Courville, A., Clark, G., Dauphin, Y., and Frome, A. (2020). Compressed to impress: Understanding the effects of model compression on fairness, robustness, and accuracy. In NeurIPS Workshop on Machine Learning for the Developing World.

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. (2024). Kvquant: Towards 10 million context length llm inference with kv cache quantization.

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. (2024). Squeezellm: Dense-and-sparse quantization.

Kurtic, E., Frantar, E., and Alistarh, D. (2023). Ziplm: Inference-aware structured pruning of language models.

Lee, C., Jin, J., Kim, T., Kim, H., and Park, E. (2024). Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models.

Li, P., Yang, J., Wierman, A., and Ren, S. (2024). Towards environmentally equitable ai via geographical load balancing.

Li, Y., Yu, Y., Liang, C., He, P., Karampatziakis, N., Chen, W., and Zhao, T. (2023). Loftq: Lora-fine-tuning-aware quantization for large language models.

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. (2024a). Awq: Activation-aware weight quantization for llm compression and acceleration.

Lin, Y., Tang, H., Yang, S., Zhang, Z., Xiao, G., Gan, C., and Han, S. (2024b). Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.

Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. (2024a). Minicache: Kv cache compression in depth dimension for large language models.

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. (2023). Llm-qat: Data-free quantization aware training for large language models.

Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, T. (2024b). Spinquant: Llm quantization with learned rotations.

Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models.

Methley, A. M., Campbell, S., Chew-Graham, C., McNally, R., and Cheraghi-Sohi, S. (2014). Pico, picos and spider: A comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC health services research, 14(1):579.

Rasheed, F., Karim, A., et al. (2023). A survey on large language models: Applications, challenges, limitations, and future directions. arXiv preprint arXiv:2307.10169.

Sanderson, C., Schleiger, E., Douglas, D., Kuhnert, P., and Lu, Q. (2024). Resolving ethics trade-offs in implementing responsible ai. In 2024 IEEE Conference on Artificial Intelligence (CAI), pages 1208–1213.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. (2020). Green ai. Communications of the ACM, 63(12):54–63.

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. (2024). Omniquant: Omnidirectionally calibrated quantization for large language models.

Shen, H., Mellempudi, N., He, X., Gao, Q., Wang, C., and Wang, M. (2024). Efficient post-training quantization with fp8 formats.

Siqueira de Cerqueira, J. A., Acco Tives, H., and Dias Canedo, E. (2021). Ethical guidelines and principles in the context of artificial intelligence. In Proceedings of the XVII Brazilian Symposium on Information Systems, SBSI ’21, New York, NY, USA. Association for Computing Machinery.

Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and policy considerations for deep learning in nlp. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3645–3650.

Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024a). A simple and effective pruning approach for large language models.

Sun, Y., Liu, R., Bai, H., Bao, H., Zhao, K., Li, Y., Hu, J., Yu, X., Hou, L., Yuan, C., Jiang, X., Liu, W., and Yao, J. (2024b). Flatquant: Flatness matters for llm quantization.

Touvron, H., Lavril, T., Izacard, G., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and Sa, C. D. (2024). Quip: Even better llm quantization with hadamard incoherence and lattice codebooks.

van Wynsberghe, A. (2021). Ai for sustainability and the sustainability of ai. AI and Ethics, 1(3):213–218.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Xia, M., Gao, T., Zeng, Z., and Chen, D. (2024). Sheared llama: Accelerating language model pre-training via structured pruning.

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. (2024). Smoothquant: Accurate and efficient post-training quantization for large language models.

Xu, Z., Wang, H., Zhang, L., et al. (2024). Exploring the robustness of compressed large language models. arXiv preprint arXiv:2402.07895.

Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and He, Y. (2022). Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.

Yuan, Z., Niu, L., Liu, J., Liu, W., Wang, X., Shang, Y., Sun, G., Wu, Q., Wu, J., and Wu, B. (2023). Rptq: Reorder-based post-training quantization for large language models.

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Braverman, V., Beidi Chen, and Hu, X. (2023). Kivi : Plug-and-play 2bit kv cache quantization with streaming asymmetric quantization.