LLMs Encode Harmfulness and Refusal Separately

AI-generated keywords: Harmfulness

AI-generated Key Points

  • Researchers identify a new dimension, harmfulness, separate from refusal in Large Language Models (LLMs)
  • Distinct harmfulness direction discovered, causing LLMs to misinterpret harmless instructions as harmful when steered along this direction
  • Jailbreak methods manipulate LLMs by reducing refusal signals without changing internal belief of harmfulness
  • Adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness
  • Model's latent representation of harmfulness can act as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

License: CC BY 4.0

Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

Submitted to arXiv on 16 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.11878v1

, , , , In this study, the researchers delve into the understanding of harmfulness in Large Language Models (LLMs) beyond just their ability to refuse harmful instructions. Previous research has shown that LLMs' refusal behaviors can be influenced by a one-dimensional subspace known as the refusal direction. Building upon this, the researchers identify a new dimension, harmfulness, which is internally encoded as a separate concept from refusal in LLMs. They discover a distinct harmfulness direction that is separate from the refusal direction. Through causal evidence, it is observed that steering along the harmfulness direction can cause LLMs to misinterpret harmless instructions as harmful. On the other hand, steering along the refusal direction typically results in direct refusal responses without altering the model's judgment on harmfulness. The researchers also find that certain jailbreak methods manipulate LLMs by reducing refusal signals without changing the model's internal belief of harmfulness. Moreover, adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness. This leads to a practical application where the model's latent representation of harmfulness can act as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks. By exploring these insights, it is revealed that LLMs possess a more robust internal comprehension of harmfulness compared to their decision-making process for diverse input instructions. This offers a fresh perspective on studying AI safety and highlights how understanding and leveraging LLMs' internal representations of harm can enhance safety mechanisms effectively. Additionally, through steering experiments with reply inversion tasks focused on different risk categories, the researchers further investigate and analyze variations in harmfulness representations across categories. These interventions aim to reduce LLMs' perception of harm by steering test instructions in reverse directions based on specific risk categories. Overall, this study sheds light on how LLMs encode and differentiate between concepts of harm and refusal internally, paving the way for improved AI safety measures and strategies in handling diverse input instructions effectively.
Created on 23 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.