LLMs Encode Harmfulness and Refusal Separately

AI-generated keywords: Harmfulness

AI-generated Key Points

Researchers identify a new dimension, harmfulness, separate from refusal in Large Language Models (LLMs)
Distinct harmfulness direction discovered, causing LLMs to misinterpret harmless instructions as harmful when steered along this direction
Jailbreak methods manipulate LLMs by reducing refusal signals without changing internal belief of harmfulness
Adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness
Model's latent representation of harmfulness can act as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

arXiv: 2507.11878v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety

Submitted to arXiv on 16 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.11878v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this study, the researchers delve into the understanding of harmfulness in Large Language Models (LLMs) beyond just their ability to refuse harmful instructions. Previous research has shown that LLMs' refusal behaviors can be influenced by a one-dimensional subspace known as the refusal direction. Building upon this, the researchers identify a new dimension, harmfulness, which is internally encoded as a separate concept from refusal in LLMs. They discover a distinct harmfulness direction that is separate from the refusal direction. Through causal evidence, it is observed that steering along the harmfulness direction can cause LLMs to misinterpret harmless instructions as harmful. On the other hand, steering along the refusal direction typically results in direct refusal responses without altering the model's judgment on harmfulness. The researchers also find that certain jailbreak methods manipulate LLMs by reducing refusal signals without changing the model's internal belief of harmfulness. Moreover, adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness. This leads to a practical application where the model's latent representation of harmfulness can act as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks. By exploring these insights, it is revealed that LLMs possess a more robust internal comprehension of harmfulness compared to their decision-making process for diverse input instructions. This offers a fresh perspective on studying AI safety and highlights how understanding and leveraging LLMs' internal representations of harm can enhance safety mechanisms effectively. Additionally, through steering experiments with reply inversion tasks focused on different risk categories, the researchers further investigate and analyze variations in harmfulness representations across categories. These interventions aim to reduce LLMs' perception of harm by steering test instructions in reverse directions based on specific risk categories. Overall, this study sheds light on how LLMs encode and differentiate between concepts of harm and refusal internally, paving the way for improved AI safety measures and strategies in handling diverse input instructions effectively.

- Researchers identify a new dimension, harmfulness, separate from refusal in Large Language Models (LLMs)
- Distinct harmfulness direction discovered, causing LLMs to misinterpret harmless instructions as harmful when steered along this direction
- Jailbreak methods manipulate LLMs by reducing refusal signals without changing internal belief of harmfulness
- Adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness
- Model's latent representation of harmfulness can act as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks

Summary- Researchers found a new way to measure how bad something is in big language models. - They discovered that these models can get confused and think harmless things are actually harmful. - Some tricks can make the models ignore warnings without changing their overall understanding of what's bad. - Trying to teach the models to accept harmful things doesn't really change how they see what's bad inside. - The model's hidden idea of what's bad can help protect against dangerous inputs and not refuse too much even when under attack. Definitions- Researchers: People who study and learn new things through experiments and observations. - Harmfulness: How much something can hurt or cause damage. - Large Language Models (LLMs): Big computer programs that understand and generate human language. - Misinterpret: To understand something incorrectly or in the wrong way. - Latent representation: A hidden way that information is stored or shown in a model.

Introduction

Large Language Models (LLMs) have been making headlines in recent years for their impressive capabilities in natural language processing tasks. However, as these models become more prevalent and integrated into various applications, concerns about their potential harmfulness have also emerged. In response to this, researchers have been exploring ways to mitigate the risks associated with LLMs. In a recent study titled "Understanding Harmfulness in Large Language Models Beyond Refusal," published by OpenAI researchers Tom B. Brown et al., a new dimension of harmfulness is identified and studied beyond just the ability of LLMs to refuse harmful instructions. This groundbreaking research offers valuable insights into how LLMs internally encode and differentiate between concepts of harm and refusal, providing a fresh perspective on AI safety measures.

The Refusal Direction

Previous studies have shown that LLMs can be influenced by a one-dimensional subspace known as the refusal direction. This means that when given an instruction that could potentially cause harm or violate ethical principles, the model has the ability to refuse it outright. However, this behavior can be manipulated through fine-tuning or other techniques. The researchers build upon this understanding by identifying a separate dimension called harmfulness within LLMs' internal representations. They discover that there is a distinct direction for harmfulness that is separate from the refusal direction.

Causal Evidence

To further investigate this concept of harmfulness, the researchers conduct causal experiments where they steer along different directions within LLMs' latent space using reply inversion tasks focused on different risk categories. These interventions aim to reduce the model's perception of harm by steering test instructions in reverse directions based on specific risk categories. Through these experiments, it is observed that steering along the harmfulness direction can cause LLMs to misinterpret harmless instructions as harmful. On the other hand, steering along the refusal direction typically results in direct refusal responses without altering the model's judgment on harmfulness.

Jailbreak Methods and Adversarial Fine-Tuning

The researchers also investigate how certain jailbreak methods, which manipulate LLMs by reducing refusal signals without changing the model's internal belief of harmfulness, can impact AI safety. They find that these methods can be effective in bypassing LLMs' refusal behaviors and causing them to generate potentially harmful outputs. Moreover, adversarially fine-tuning models to accept harmful instructions has minimal impact on the model's internal understanding of harmfulness. This highlights the need for more robust safety mechanisms beyond just relying on LLMs' ability to refuse instructions.

Latent Guard: Leveraging Harmfulness Representations for AI Safety

Based on their findings, the researchers propose a practical application called Latent Guard, where LLMs' latent representation of harmfulness can act as an intrinsic safeguard for detecting unsafe inputs and reducing over-refusals resiliently against fine-tuning attacks. This approach leverages LLMs' internal comprehension of harm rather than solely relying on their decision-making process for diverse input instructions.

Variations in Harmfulness Representations Across Categories

To further understand how LLMs encode and differentiate between concepts of harm across different risk categories, the researchers conduct steering experiments with reply inversion tasks focused on specific categories such as violence or hate speech. These interventions aim to reduce LLMs' perception of harm by steering test instructions in reverse directions based on these categories. Through this analysis, it is revealed that there are variations in how LLMs represent and perceive harm across different risk categories. This highlights the importance of considering these variations when developing safety measures for LLMs.

Conclusion

In conclusion, this study offers valuable insights into how Large Language Models internally encode and differentiate between concepts of harm and refusal. By identifying a separate dimension for harmfulness, the researchers have opened up new possibilities for developing more robust AI safety measures. Through their experiments and analysis, they have also highlighted the limitations of relying solely on LLMs' ability to refuse instructions and proposed a practical application, Latent Guard, that leverages LLMs' internal understanding of harm. This research paves the way for further exploration and development of effective strategies in handling diverse input instructions for LLMs.

Created on 23 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

61.8%

Constitutional AI: Harmlessness from AI Feedback

cs.CL

61.3%

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and …

cs.CL

60.3%

Understanding Catastrophic Forgetting in Language Models via Implicit Inferen…

cs.CL

59.1%

Code Llama: Open Foundation Models for Code

cs.CL

59.0%

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Gua…

cs.CL

58.8%

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabi…

cs.CL

57.8%

Fundamental Limitations of Alignment in Large Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.