Deploying Language Models on Android-Based Edge Devices: A Practical Evaluation Pipeline

Suayder Costa; Igor Lima; William Harada; Mateus Lucena; Arthur Alves; Myke Valadão; Cassio Alves; Ruan Belem; Agemilson Pimentel; Romulo Fabricio; Alexandre Miranda; Daniel Lins; Frederico Gonçalves; Sidney Leal

doi:10.5753/sbcup.2026.22352

Suayder Costa Venturus
Igor Lima Venturus
William Harada Venturus
Mateus Lucena Venturus
Arthur Alves Venturus
Myke Valadão Venturus
Cassio Alves Venturus
Ruan Belem TPV Technology
Agemilson Pimentel TPV Technology
Romulo Fabricio TPV Technology
Alexandre Miranda Paulo Feitoza Foundation
Daniel Lins Venturus
Frederico Gonçalves Venturus
Sidney Leal Venturus

DOI: https://doi.org/10.5753/sbcup.2026.22352

Resumo

Este artigo apresenta um pipeline prático de avaliação para a implantação de pequenos modelos de linguagem, versões compactas dos grandes modelos de linguagem projetadas para operar com menor consumo de memória e processamento, em dispositivos de borda baseados em Android, utilizando uma Android TV como caso representativo. O estudo investiga tanto a viabilidade de implantação quanto estratégias de aceleração em nível de software, como a escolha do motor de inferência responsável pela execução do modelo no dispositivo, sob severas restrições de memória e processamento. Os resultados mostram que a maioria dos modelos acima de 500 milhões de parâmetros não foi adequada ao ambiente avaliado, enquanto um subconjunto de modelos quantizados em 4 bits apresentou execução estável e qualidade de resposta aceitável. Além disso, os experimentos demonstram que a escolha do motor de inferência tem forte impacto no desempenho, com o MNN superando significativamente o llama.cpp em dispositivos ARM (família de processadores que predomina em hardware móvel e embarcado). Esses achados oferecem orientações práticas para a integração de IA generativa em hardware de consumo com recursos limitados.

Referências

Agrawal, R., Kumar, H., and Lnu, S. R. (2025). Efficient llms for edge devices: Pruning, quantization, and distillation techniques. In 2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), pages 1413–1418. IEEE.

Allal, L. B., Lozhkov, A., Bakouch, E., Blázquez, G. M., Penedo, G., Tunstall, L., Marafioti, A., Kydlíček, H., Lajarín, A. P., Srivastav, V., et al. (2025). Smollm2: When smol goes big–data-centric training of a small language model. arXiv preprint arXiv:2502.02737.

Bhat, A., Mondal, A., and Tripathy, A. (2024). Llm agents for internet of things (iot) applications. CS598 JY2—Topics in LLM Agents; University of Illinois: Urbana, IL, USA.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Cheng, D., Gu, Y., Huang, S., Bi, J., Huang, M., and Wei, F. (2024). Instruction pre-training: Language models are supervised multitask learners. arXiv preprint arXiv:2406.14491.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.

Dakle, P. P., Rallabandi, S., and Raghavan, P. (2022). Understanding bloom: An empirical study on diverse nlp tasks. arXiv preprint arXiv:2211.14865.

de Souza, I. M., Bandeira, F. I., Xavier, D., and de Lima Filho, E. B. (2025). Low memory consumption architecture for tv platform software modules. In 2025 IEEE International Conference on Consumer Electronics (ICCE), pages 1–4.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.

Du, Y., Sun, Z., Wang, Z., Chua, H., Zhang, J., and Ong, Y.-S. (2025). Active large language model-based knowledge distillation for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 11607–11615.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv e-prints, pages arXiv–2407.

Elhanashi, A., Dini, P., Saponara, S., and Zheng, Q. (2024). Advancements in tinyml: Applications, limitations, and impact on iot devices. Electronics, 13(17):3562.

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Hillier, D., Guertler, L., Tan, C., Agrawal, P., Ruirui, C., and Cheng, B. (2024). Super tiny language models. arXiv preprint arXiv:2405.14159.

Jiang, X., Wang, H., Chen, Y., Wu, Z., Wang, L., Zou, B., Yang, Y., Cui, Z., Cai, Y., Yu, T., et al. (2020a). Mnn: A universal and efficient inference engine. Proceedings of Machine Learning and Systems, 2:1–13.

Jiang, X., Wang, H., Chen, Y., Wu, Z., Wang, L., Zou, B., Yang, Y., Cui, Z., Cai, Y., Yu, T., Lv, C., and Wu, Z. (2020b). Mnn: A universal and efficient inference engine. In MLSys.

Li, X., Lu, Z., Cai, D., Ma, X., and Xu, M. (2024). Large language models on mobile devices: Measurements, analysis, and insights. In Proceedings of the Workshop on Edge and Mobile Foundation Models, pages 1–6.

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., et al. (2024). Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905.

Mehta, S., Sekhavat, M. H., Cao, Q., Horton, M., Jin, Y., Sun, C., Mirzadeh, I., Najibi, M., Belenko, D., Zatloukal, P., and Rastegari, M. (2024). OpenELM: An Efficient Language Model Family with Open Training and Inference Framework. arXiv.org.

Qin, R., Liu, D., Xu, C., Yan, Z., Tan, Z., Jia, Z., Nassereldine, A., Li, J., Jiang, M., Abbasi, A., et al. (2024). Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems.

Sai, S., Prasad, M., Dashore, G., Chamola, V., and Sikdar, B. (2024). On-device generative ai: the need, architectures, and challenges. IEEE Consumer Electronics Magazine.

Sanh, V. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In Proceedings of Thirty-third Conference on Neural Information Processing Systems (NIPS2019).

Sarah, A., Nittur Sridhar, S., Szankin, M., and Sundaresan, S. (2025). Llama-nas: Efficient neural architecture search for large language models. In European Conference on Computer Vision, pages 67–74. Springer.

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al. (2025). Gemma 3 technical report. arXiv preprint arXiv:2503.19786.

Wang, Z., Yang, J., Qian, X., Xing, S., Jiang, X., Lv, C., and Zhang, S. (2024). Mnn-llm: A generic inference engine for fast large language model deployment on mobile devices. In Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops, pages 1–7.

Yoo, K. M., Han, J., In, S., Jeon, H., Jeong, J., Kang, J., Kim, H., Kim, K.-M., Kim, M., Kim, S., et al. (2024). Hyperclova x technical report. arXiv preprint arXiv:2404.01954.

Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M. (2019). Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE.

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al. (2025). Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479.