Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Kayuã Oleques Paim; Rodrigo Brandão Mansilha; Diego Kreutz; Muriel Figueredo Franco; Weverton Cordeiro

doi:10.5753/sbseg.2025.11448

Kayuã Oleques Paim UFRGS http://orcid.org/0000-0002-3394-795X
Rodrigo Brandão Mansilha UNIPAMPA https://orcid.org/0000-0002-2083-653X
Diego Kreutz UNIPAMPA https://orcid.org/0000-0003-0830-0238
Muriel Figueredo Franco UFCSPA https://orcid.org/0000-0002-0208-0521
Weverton Cordeiro UFRGS https://orcid.org/0000-0001-7536-0586

DOI: https://doi.org/10.5753/sbseg.2025.11448

Resumo

The rapid proliferation of Large Language Models (LLMs) has raised significant concerns about their security against adversarial attacks. In this work, we propose a novel approach to crafting universal jailbreaks and data extraction attacks by exploiting latent space discontinuities, an architectural vulnerability related to the sparsity of training data. Unlike previous methods, our technique generalizes across various models and interfaces, proving highly effective in seven state-of-the-art LLMs and one image generation model. Initial results indicate that when these discontinuities are exploited, they can consistently and profoundly compromise model behavior, even in the presence of layered defenses. The findings suggest that this strategy has substantial potential as a systemic attack vector. Disclaimer: This paper contains examples of harmful and offensive language. Reader discretion is advised. Additional supporting materials may be provided upon formal request and are subject to the signing of a liability and ethical use agreement.

Referências

Ball, S., Kreuter, F., and Panickssery, N. (2024). Understanding jailbreak success: A study of latent space dynamics in large language models. [link].

Hu, X., Chen, P.-Y., and Ho, T.-Y. (2024). Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models.

Hui, B., Yuan, H., Gong, N., Burlina, P., and Cao, Y. (2024). Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600–3614.

Li, R., Wang, H., and Mao, C. (2025). Largo: Latent adversarial reflection through gradient optimization for jailbreaking llms. [link].

Liang, H., Sun, Y., Cai, Y., Zhu, J., and Zhang, B. (2025). Jailbreaking llms’ safeguard with universal magic words for text embedding models. [link].

Liu, A., Zhou, Y., Liu, X., Zhang, T., Liang, S., Wang, J., Pu, Y., Li, T., Zhang, J., Zhou, W., Guo, Q., and Tao, D. (2025). Compromising llm driven embodied agents with contextual backdoor attacks. IEEE Transactions on Information Forensics and Security, 20:3979–3994.

Lu, Y., Lu, C., Gan, Z., Cheng, Y., Lu, P., Chen, Z., Zhu, Q., Wang, L., Liu, C., and Gao, J. (2024). Finma: Fine-grained multimodal alignment for instruction-following vision-language models.

Russinovich, M., Salem, A., and Eldan, R. (2025). Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. In Proceedings of the 34th USENIX Security.

Song, J., Wang, Y., Li, J., Yu, R., Teng, Y., Ma, X., and Wang, Y. (2025). Jailbound: Jailbreaking internal safety boundaries of vision-language models. [link].

Wang, R., Wu, J., Xia, Y., Yu, T., Zhang, R., Rossi, R., Yao, L., and McAuley, J. (2025). CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks.

Yan, B., Li, K., Xu, M., Dong, Y., Zhang, Y., Ren, Z., and Cheng, X. (2024). On Protecting the Data Privacy of Large Language Models (LLMs): A Survey.

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., and Zhang, Y. (2024). A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confidence Computing, page 100211.

Zhang, X., Zhang, C., Li, T., Huang, Y., Jia, X., Hu, M., Zhang, J., Liu, Y., Ma, S., and Shen, C. (2025). Jailguard: A universal detection framework for prompt-based attacks on llm systems. ACM Transactions on Software Engineering and Methodology.

Zhao, S., Wu, X., Nguyen, C.-D., Jia, M., Feng, Y., and Tuan, L. A. (2024). Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation.

Zhou, S., Li, T., Wang, K., Huang, Y., Shi, L., Liu, Y., and Wang, H. (2025). Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks.

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)