On the Exploitability of Instruction Tuning

AI-generated keywords: Instruction Tuning Exploitability Data Poisoning LLMs AutoPoison

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The paper explores the concept of instruction tuning in language models.
  • Adversaries can exploit instruction tuning by injecting specific examples into the training data to change the model's behavior.
  • Content injection is one way adversaries achieve this, by injecting training examples that mention target content and elicit desired behavior from downstream models.
  • The authors propose an automated data poisoning pipeline called AutoPoison to address this issue.
  • AutoPoison incorporates versatile attack goals into poisoned data using an oracle LLM, allowing adversaries to achieve their desired exploitable behavior.
  • Two example attacks showcased are content injection and over-refusal attacks, aiming to induce specific exploitable behaviors in the model.
  • AutoPoison enables adversaries to change a model's behavior by poisoning only a small fraction of data while maintaining stealthiness in the poisoned examples.
  • Data quality plays a crucial role in the behavior of instruction-tuned models and responsible deployments of LLMs.
  • The research sheds light on the exploitability of instruction tuning and provides insights into how adversaries can manipulate LLMs through injected examples.
  • The proposed AutoPoison pipeline offers a powerful tool for achieving these manipulations while maintaining stealthiness.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein

19 pages, 9 figures

Abstract: Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. For example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. To achieve this goal, we propose \textit{AutoPoison}, an automated data poisoning pipeline. It naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle LLM. We showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. We quantify and benchmark the strength and the stealthiness of our data poisoning scheme. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. We hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of LLMs. Code is available at \url{https://github.com/azshue/AutoPoison}.

Submitted to arXiv on 28 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.17194v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper titled "On the Exploitability of Instruction Tuning" explores the concept of instruction tuning, a technique used to align large language models (LLMs) with human intents. The authors investigate how an adversary can exploit this technique by injecting specific instruction-following examples into the training data in order to intentionally change the behavior of the model. One way that adversaries can achieve this is through content injection, where they inject training examples that mention target content and elicit a desired behavior from downstream models. To address this issue, the authors propose an automated data poisoning pipeline called AutoPoison. This pipeline incorporates versatile attack goals into poisoned data using an oracle LLM, allowing adversaries to achieve their desired exploitable behavior. The paper showcases two example attacks: content injection and over-refusal attacks. Each attack aims to induce a specific exploitable behavior in the model. The authors quantify and benchmark the strength and stealthiness of their data poisoning scheme. The results demonstrate that AutoPoison enables adversaries to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. This highlights how data quality affects the behavior of instruction-tuned models and emphasizes the importance of data quality for responsible deployments of LLMs. In conclusion, this work sheds light on the exploitability of instruction tuning and provides insights into how adversaries can manipulate LLMs by injecting specific examples into training data. The proposed AutoPoison pipeline offers a powerful tool for achieving these manipulations while maintaining stealthiness. By raising awareness about these vulnerabilities, this research contributes to responsible deployments of LLMs and provides code for AutoPoison at [https://github.com/azshue/AutoPoison](https://github.com/azshue/AutoPoison).
Created on 05 Jul. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.