Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
AI-generated Key Points
- Authors address the challenge of compromised safety in fine-tuned language models
- Introduce RESTA method to restore safety through task arithmetic by adding a safety vector to model weights
- Effectiveness of RESTA demonstrated in parameter-efficient and full fine-tuning scenarios across various tasks (instruction following, problem-solving) in Chinese, English, Hindi
- Generalizability of RESTA shown on existing safety evaluation benchmarks and multilingual benchmark dataset with harmful questions
- RESTA significantly reduces harmfulness of compromised models while maintaining task performance
- Source codes for RESTA available on GitHub
- Promising results for RESTA's impact on Chinese and Vietnamese languages
- Performance comparisons show improved task-specific scores with RESTA compared to base model Llama-2, SFT with dropout DARE, and their combination (RESTAd)
- Analysis on CATQA dataset reveals significant increases in unsafety scores for Llama-2 safe model under different evaluations
- Research provides insights into enhancing language model safety during fine-tuning processes using arithmetic methods like RESTA
Authors: Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria
Abstract: Aligned language models face a significant limitation as their fine-tuning often results in compromised safety. To tackle this, we propose a simple method RESTA that performs LLM safety realignment. RESTA stands for REstoring Safety through Task Arithmetic. At its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. We demonstrate the effectiveness of RESTA in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math. We also showcase the generalizability of RESTA on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. Overall, RESTA decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model's performance on the task. We release the source codes at: https://github.com/declare-lab/resta.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.