On the Exploitability of Instruction Tuning
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- The paper explores the concept of instruction tuning in language models.
- Adversaries can exploit instruction tuning by injecting specific examples into the training data to change the model's behavior.
- Content injection is one way adversaries achieve this, by injecting training examples that mention target content and elicit desired behavior from downstream models.
- The authors propose an automated data poisoning pipeline called AutoPoison to address this issue.
- AutoPoison incorporates versatile attack goals into poisoned data using an oracle LLM, allowing adversaries to achieve their desired exploitable behavior.
- Two example attacks showcased are content injection and over-refusal attacks, aiming to induce specific exploitable behaviors in the model.
- AutoPoison enables adversaries to change a model's behavior by poisoning only a small fraction of data while maintaining stealthiness in the poisoned examples.
- Data quality plays a crucial role in the behavior of instruction-tuned models and responsible deployments of LLMs.
- The research sheds light on the exploitability of instruction tuning and provides insights into how adversaries can manipulate LLMs through injected examples.
- The proposed AutoPoison pipeline offers a powerful tool for achieving these manipulations while maintaining stealthiness.
Authors: Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
Abstract: Instruction tuning is an effective technique to align large language models (LLMs) with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. For example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. To achieve this goal, we propose \textit{AutoPoison}, an automated data poisoning pipeline. It naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle LLM. We showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. We quantify and benchmark the strength and the stealthiness of our data poisoning scheme. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. We hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of LLMs. Code is available at \url{https://github.com/azshue/AutoPoison}.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.