Steering Llama 2 via Contrastive Activation Addition

AI-generated keywords: Contrastive Activation Addition Language Models Steering Vectors Behavioral Question Datasets Large Language Models

AI-generated Key Points

  • Researchers Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner introduce Contrastive Activation Addition (CAA) for guiding language models.
  • CAA involves adjusting activations during forward passes by computing "steering vectors" comparing residual stream activations between positive and negative examples of specific behaviors.
  • Steering vectors are added to user prompts with positive or negative coefficients to control desired behavior.
  • CAA significantly influences model behavior while minimally impacting overall capabilities when evaluated on Llama 2 Chat with various behavioral question datasets and open-ended generation tasks.
  • Insights into CAA's mechanisms are gained through different activation space interpretation methods.
  • The research demonstrates how CAA accurately guides model outputs and provides insights into the representation of high-level concepts in Large Language Models (LLMs).
  • An example illustrating the impact of sycophancy steering using CAA on open-ended generation tasks with Llama 2 7B at layer 13 using specific multipliers is provided in Table 1.
  • Diagrams illustrate the process of generating steering vectors for CAA and applying them during inference to effectively control model behavior.
  • Overall, this research highlights Contrastive Activation Addition as an innovative approach to steer language models towards desired behaviors and its significance in understanding Large Language Models.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

License: CC BY 4.0

Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Submitted to arXiv on 09 Dec. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2312.06681v4

In their study "Steering Llama 2 via Contrastive Activation Addition," researchers Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner introduce the novel method of Contrastive Activation Addition (CAA) for guiding language models. CAA involves adjusting activations during forward passes by computing "steering vectors" that compare residual stream activations between positive and negative examples of specific behaviors. These steering vectors are then added to user prompts with either a positive or negative coefficient to finely control desired behavior. The effectiveness of CAA is evaluated on Llama 2 Chat using various behavioral question datasets and open-ended generation tasks. Results show that CAA significantly influences model behavior while minimally impacting overall capabilities. Insights into CAA's mechanisms are gained through different activation space interpretation methods. The research also demonstrates how CAA accurately guides model outputs and provides valuable insights into the representation of high-level concepts in Large Language Models (LLMs). Additionally, an example showcasing the impact of sycophancy steering using CAA on open-ended generation tasks with Llama 2 7B at layer 13 using specific multipliers is illustrated in Table 1. The study includes diagrams illustrating the process of generating steering vectors for CAA and applying them during inference to effectively control model behavior. Overall, this research highlights the potential of Contrastive Activation Addition as an innovative approach to steer language models towards desired behaviors and sheds light on its significance in understanding the inner workings of Large Language Models.
Created on 08 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.