Comparative study of the use of LLM for Red Team activities in LLM
Abstract
The rising of the use of LLMs with models with increasingly more parameters and context understanding, this work aims to clarify a comparison between the tools available for vulnerability detection in the scope of Red Teaming in LLMs. A systematic review of the topic was carried out and with comparisons of state-of-the-art algorithms in the area with new proposals for the use of LLMs in an adversarial way to algorithms in the same way, a new research sub-area of jailbreaks is reviewed in this article.
References
Kucharavy, A. et al. Large Language Models in Cyber-security: Threats, Exposure and Mitigation. In: 2024. P. 247. DOI: 10.1007/978-3-031-54827-7.
Woldseth, R. V. et al. On the use of artificial neural networks in topology optimisation. Structural and Multidisciplinary Optimization, Springer Science e Business Media LLC, v. 65, n. 10, out. 2022. ISSN 1615-1488. DOI: 10.1007/s00158-022-03347-1. Disponível em: DOI: 10.1007/s00158-022-03347-1.
AI Chatbot for Banking - IBM Watsonx Assistant. Disponível em: [link]. Acesso em: 14 set. 2024.
BB usa tecnologia generativa para apoiar atendimento. Disponível em: [link]. Acesso em: 14 set. 2024.
BIA Bardesco Inetligência Artificial. Disponível em: [link]. Acesso em: 14 set. 2024.
Bender, E. et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In: p. 610–623. DOI: 10.1145/3442188.3445922.
Liu, X. et al. Robustifying Safety-Aligned Large Language Models through Clean Data Curation. 2024. arXiv: 2405. 19358 [cs.CR]. Disponível em: [link].
Huang, J.; Shao, H.; Chang, K. C.-C. Are Large Pre-Trained Language Models Leaking Your Personal Information? 2022. arXiv: 2205.12628 [cs.CL]. Disponível em: [link].
Kumar, P. Adversarial attacks and defenses for large language models LLMs: methods, frameworks and challenges. Int J Multimed Info Retr 13, 26, 2024. DOI: 10.1007/s13735-024-00334-8.
Oussidi, A.; Elhassouny, A. Deep generative models: Survey. In: 2018 International Conference on Intelligent Systems and Computer Vision (ISCV). 2018. P. 1–8. DOI: 10.1109/ISACV.2018.8354080.
Shahab, O. et al. Large language models: a primer and gastroenterology applications. Therapeutic Advances in Gastroenterology, v. 17, fev. 2024. DOI: 10.1177/17562848241227031.
Vaswani, A. et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL]. Disponível em: [link].
Kaplan, J. et al. Scaling Laws for Neural Language Models. 2020. arXiv: 2001.08361 [cs.LG]. Disponível em: [link].
Devlin, J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL]. Disponível em: [link].
LLAMA 3.2. Disponível em: [link]. Acesso em: 14 set. 2024.
Lin, L. et al. Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models. 2024. arXiv: 2404.00629 [cs.CL]. Disponível em: [link].
GITHUB Openai. Disponível em: [link]. Acesso em: 14 set. 2024.
OPENAI GPT4 Research. Disponível em: [link]. Acesso em: 14 set. 2024.
Zou, A. et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. 2023. arXiv: 2307. 15043 [cs.CL]. Disponível em: [link].
Senado, A. Relator apresenta relatório atualizado sobre regulamentação da IA. Senado Notícias, 20024. Disponível em: [link].
Jiang, B. et al. DART: Deep Adversarial Automated Red Teaming for LLM Safety. 2024. arXiv: 2407.03876 [cs.CR]. Disponível em: [link].
Chen, Z. et al. AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. 2024. arXiv: 2407.12784 [cs.LG]. Disponível em: [link].
Gong, X. et al. Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs. 2024. arXiv: 2409. 14866 [cs.CR]. Disponível em: [link].
Zhang, J. et al. EnJa: Ensemble Jailbreak on Large Language Models. 2024. arXiv: 2408.03603 [cs.CR]. Disponível em: [link].
Liu, X. et al. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. 2024. arXiv: 2310. 04451 [cs.CL]. Disponível em: [link].
Du, Y. et al. Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models. 2024. arXiv: 2408.14853 [cs.CL]. Disponível em: [link].
Liu, H. et al. Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models. 2024. arXiv: 2408. 14866 [cs.CL]. Disponível em: [link].
Ding, P. et al. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. 2024. arXiv: 2311.08268 [cs.CL]. Disponível em: [link].
VICUNA, An Opensource Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Disponível em: [link]. Acesso em: 14 set. 2024.