Text Message Routing System for Chat-Based Applications

Breno U. de Angelo; Guilherme G. Zanetti; Alberto F. De Souza; Claudine Badue; Abner G. Jacobsen; Thiago Oliveira-Santos

doi:10.5753/sbsi.2026.248585

Breno U. de Angelo UFES / Aumo S.A.
Guilherme G. Zanetti UFES / Aumo S.A.
Alberto F. De Souza UFES / Aumo S.A.
Claudine Badue UFES
Abner G. Jacobsen Aumo S.A.
Thiago Oliveira-Santos UFES

DOI: https://doi.org/10.5753/sbsi.2026.248585

Resumo

Research Context: The integration of Large Language Models (LLMs) into banking customer service promises enhanced engagement but faces a trilemma of prohibitive operational costs, stochastic inaccuracy (hallucinations), and strict safety requirements. In the Brazilian financial sector, where inaccuracies can lead to direct financial harm, relying solely on monolithic LLMs is economically unsustainable and operationally risky. Practical Problem: Financial institutions struggle with the ”inference cost trap,” where token-based pricing scales linearly with usage, and the risk of hallucinations—where models generate plausible but incorrect responses for unfamiliar queries. Proposed Solution: This study validates a semantic routing architecture that functions as an intelligent decision layer, classifying user intent into four categories (Relevant, Unrelated, Chitchat, Spam) to route queries to cost-effective handlers. Five classification paradigms were benchmarked: Zero-shot GPT, Few-shot GPT, QLoRA fine-tuned GPT, Embedding Similarity Search, and BERT-based neural models. Related IS Theory: The research builds upon Task-Technology Fit (TTF) theory and contributes to the Green IS agenda by validating energy-efficient architectures. Research Method: A balanced dataset of 6,760 messages, curated from SMS spam data, SQuAD questions, private chats, and banking FAQs, was used to evaluate accuracy, resource usage, and cost. Summary of Results: Embedding Similarity (98.52%) and BERT-based models (98.96%) achieved superior accuracy compared to Zero-shot GPT (53.78%) and matched the performance of Fine-tuned GPT (97.93%). Crucially, the embedding approach reduced operational costs by 96% (from US$18.10 to US$0.32 per million requests) and slashed memory consumption from 6.52 GB to 1.04 GB. Contributions and Impact to IS Area: The study contributes a validated pattern for Frugal AI in Portuguese, demonstrating that open-source embedding models can effectively govern LLMs. This approach mitigates hallucination risks, reduces dependency on foreign APIs, and aligns with sustainable computing principles.

Referências

Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., Tumanov, A., and Ramjee, R. (2024). Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310.

Azharudeen, M. (2024). Beyond basic chatbots: How semantic router is changing the game. [link]. Accessed: June 26, 2024.

Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., and Vulić, I. (2020). Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45.

Chen, L., Zaharia, M., and Zou, J. (2023). FrugalGPT: How to use large language models while reducing cost and improving performance. In Advances in Neural Information Processing Systems (NeurIPS).

Cunningham, P. and Delany, S. J. (2021). k-nearest neighbour classifiers: A tutorial. ACM Computing Surveys, 54(6):1–25.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems (NeurIPS), 36.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. In International Conference on Learning Representations (ICLR).

Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., and Walther, A. (2022). Predictably unequal? The effects of machine learning on credit markets. The Journal of Finance, 77(1):5–47. Discusses how algorithmic maximization can lead to distributional shifts that disadvantage vulnerable groups.

Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information, 13(2):83.

Gonçalves, A. et al. (2025). Accessibility in banking chatbots: An analysis of portuguese as a second language. In Brazilian Symposium on Information Systems (SBSI).

Han, Z., Gao, C., Liu, J., Zhang, J., and Zhang, S. Q. (2024). Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608.

He, J. and Zhai, J. (2024). FastDecode: High-throughput GPU-efficient LLM serving using heterogeneous pipelines. arXiv preprint arXiv:2403.11421.

Horsey, J. (2024). Semantic router superfast decision layer for LLMs and AI agents. [link]. Accessed: June 26, 2024.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.

Kalai, A. T., Nachum, O., Vempala, S. S., and Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. arXiv preprint arXiv:2309.06180.

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. (2023). AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv preprint arXiv:2306.00978.

Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., et al. (2025). On the biology of a large language model. Transformer Circuits Thread.

Malkov, Y. A. and Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836.

Manias, D. M., Chouman, A., and Shami, A. (2024a). Semantic routing for enhanced performance of LLM-assisted intent-based 5G core network management and orchestration. arXiv preprint arXiv:2404.15869.

Manias, D. M., Chouman, A., and Shami, A. (2024b). Towards intent-based network management: Large language models for intent extraction in 5G core networks. arXiv preprint arXiv:2403.02238.

Manik, L. P., Akbar, Z., Mustika, H. F., Indrawati, A., Rini, D. S., Fefirenta, A. D., and Djarwaningsih, T. (2021). Out-of-scope intent detection on a knowledge-based chatbot. International Journal of Intelligent Engineering and Systems, 14(5).

Morris, J. X., Kuleshov, V., Shmatikov, V., and Rush, A. M. (2023). Text embeddings reveal (almost) as much as text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).

Mussmann, S. and Ermon, S. (2016). Learning and inference via maximum inner product search. In Proceedings of The 33rd International Conference on Machine Learning (ICML), volume 48, pages 2587–2596.

Ong, I., Almahairi, A., Wu, V., Chen, W.-L., et al. (2024). RouteLLM: Learning to route LLMs with preference data. arXiv preprint arXiv:2406.18665.

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá-65B: A large language model for Portuguese. arXiv preprint arXiv:2312.11991.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pre-trained BERT models for Brazilian Portuguese. In Brazilian Conference on Intelligent Systems (BRACIS), pages 403–417. Springer.

van der Heijden, N., van der Linde, J., Vossen, P., and Shutova, E. (2025). How much do LLMs hallucinate across languages? On multilingual estimation of LLM hallucination in the wild. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Finds that smaller LLMs exhibit significantly larger hallucination rates than larger models.

Verdecchia, R., Sallou, J., and Cruz, L. (2023). Green AI: A systematic literature review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.

Wang, Z., Pang, Y., and Lin, Y. (2023). Large language models are zero-shot text classifiers. arXiv preprint arXiv:2312.01044.

Wei, J., Yang, C., Song, X., Lu, Y., Hu, N., Tran, D., Peng, D., Liu, R., Huang, D., Du, C., et al. (2024). Long-form factuality in large language models. arXiv preprint arXiv:2403.18802.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2020). Huggingface’s transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.

Women’s World Banking (2021). Algorithmic bias, financial inclusion, and gender. [link]. Accessed: 2025-01-02.

Text Message Routing System for Chat-Based Applications

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)