COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

AI-generated keywords: Jailbreaks Large Language Models (LLMs) Controllable Attacks COLD-Attack Framework Adversarial LLM Attacks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Growing focus on jailbreaks on Large Language Models (LLMs)
  • Importance of considering jailbreaks with various attributes such as contextual coherence and sentiment/stylistic variations
  • Study of controllable jailbreaking is highly beneficial
  • Formal formulation of the problem of generating controllable attacks and its connection to controllable text generation in natural language processing
  • Adaptation of Energy-based Constrained Decoding with Langevin Dynamics (COLD) algorithm for controllable text generation
  • Introduction of COLD-Attack framework that automates and unifies the search for adversarial LLM attacks while considering control requirements such as fluency, stealthiness, sentiment, and left-right coherence
  • Extensive experiments conducted on different LLMs demonstrating broad applicability of the framework
  • Strong controllability, high success rates, and transferability across different models observed in generating attacks using the framework
  • Contribution to understanding and mitigating risks associated with jailbreaks on LLMs
  • Improved control over attack generation while considering diverse attributes of contextual coherence and sentiment/stylistic variations
  • Availability of code for easy adoption and implementation.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu

Abstract: Jailbreaks on Large language models (LLMs) have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.

Submitted to arXiv on 13 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.08679v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In recent times, there has been a growing focus on jailbreaks on Large Language Models (LLMs). To ensure a comprehensive evaluation of LLM safety, it is crucial to consider jailbreaks with various attributes such as contextual coherence and sentiment/stylistic variations. Therefore, studying controllable jailbreaking becomes highly beneficial. enabled by the leads to diverse new jailbreak scenarios beyond generating fluent suffix attacks. It allows for addressing new controllable attack settings like revising a user query adversarially with minimal paraphrasing or inserting stealthy attacks in context with left-right coherence. The authors formally formulate the problem of generating controllable attacks and establish a novel connection between this problem and controllable text generation in the field of natural language processing. They adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), an advanced and efficient algorithm in controllable text generation. The , introduced in this paper titled "COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability," automates and unifies the search for adversarial LLM attacks while considering various control requirements such as fluency, stealthiness, sentiment, and left-right coherence. Extensive experiments conducted on different LLMs including Llama-2, Mistral, Vicuna, Guanaco, and GPT-3.5 demonstrate the broad applicability of . The framework exhibits strong controllability, high success rates in generating attacks, and transferability across different models. Overall,presents a significant contribution to understanding and mitigating risks associated with jailbreaks on LLMs. The proposed offers improved control over attack generation while considering diverse attributes of contextual coherence and sentiment/stylistic variations. The availability of code further facilitates the adoption and implementation of this framework.
Created on 14 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.