Leveraging Large Language Models for Anomaly Detection in Microservices Architectures

Diego Frazatto Pedroso; Luís Almeida; William Akihiro Alves Aisawa; Inês Dutra; Sarita Mazzini Bruschi

Diego Frazatto Pedroso USP
Luís Almeida University of Porto
William Akihiro Alves Aisawa USP
Inês Dutra University of Porto
Sarita Mazzini Bruschi USP

Resumo

Cloud computing has become a key enabler of scalable and high-performance applications, allowing systems to be deployed rapidly. At the same time, the increasing sophistication of cloud-native environments brings new challenges related to system dependability. Ensuring resilience under such conditions is a fundamental responsibility of IT providers, who must safeguard service continuity and operational stability. The widespread use of microservice-based designs has created an ecosystem with a growing number of interacting components, including frameworks, application layers, hypervisors, and orchestration platforms. This distributed and layered environment produces a massive volume of log data originating from heterogeneous sources. Without automated support, extracting useful insights from these logs becomes a highly complex task. One promising direction to mitigate this challenge is the use of Machine Learning, particularly methods grounded in Large Language Models (LLMs), which can dynamically detect recurring structures and anomalies in event streams. Building on this idea, our work introduces an anomaly detection framework deployed within a microservices environment running on Kubernetes with Istio. The framework integrates an LLM trained on a diverse set of fault scenarios. To create these scenarios, we relied on Chaos Mesh for fault injection and Locust for workload stress testing. The evaluation confirmed that the model achieved high accuracy in identifying anomalies. It consistently detected all injected faults, although a small number of false positives were observed. Importantly, these false alarms remained at acceptable levels, highlighting the approach’s practical applicability.

Palavras-chave: Cloud computing, Virtual machine monitors, Large language models, Microservice architectures, Computer architecture, Stability analysis, Anomaly detection, Stress, Testing, Resilience, LLM, Anomaly Detection, Microservices, AWS