Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

AI-generated keywords: Jailbreaking

AI-generated Key Points

Jailbreaking techniques pose a threat to Large Language Models (LLMs), manipulating them into producing restricted or harmful output.
Defense mechanism involves using another LLM as a Judge to evaluate text for harmfulness, but Judge LLMs are susceptible to token segmentation bias.
Emoji Attack technique strategically inserts emojis into text before evaluation by a Judge LLM, reducing detection accuracy and allowing unsafe content to slip past moderation filters.
Emojis introduce semantic ambiguity, making them effective in evading detection compared to traditional delimiters.
Research question focuses on whether seemingly harmless constructs like emojis can alter decision boundaries of Judge LLMs and enable harmful content to bypass moderation filters.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

arXiv: 2411.01077v5 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.

Submitted to arXiv on 01 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.01077v5

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In recent years, jailbreaking techniques have emerged as a threat to Large Language Models (LLMs), manipulating them into producing restricted or harmful output. To counter this, one defense mechanism involves using another LLM as a Judge to evaluate the generated text for harmfulness. However, it has been discovered that these Judge LLMs are susceptible to token segmentation bias, where delimiters in the text alter the tokenization process and split words into smaller sub-tokens. This manipulation of embeddings reduces detection accuracy, allowing harmful content to be misclassified as safe. In response to this vulnerability, a novel strategy known as Emoji Attack has been introduced. This technique leverages in-context learning to strategically insert emojis into text before evaluation by a Judge LLM. By inducing embedding distortions through the use of emojis, the likelihood of detecting unsafe content is significantly lowered. Unlike traditional delimiters, emojis introduce semantic ambiguity, making them particularly effective in evading detection. Through experiments on cutting-edge Judge LLMs, it has been demonstrated that Emoji Attack successfully reduces the rate of unsafe predictions and bypasses existing safeguards. The research question posed in this study revolves around whether seemingly harmless linguistic constructs like emojis can alter the decision boundaries of Judge LLMs and enable harmful content to slip past moderation filters. Furthermore, it has been highlighted that many Judge LLMs assign numerical scores to indicate the severity of content violations on ethical, legal, or safety grounds. If a response surpasses a predefined threshold score, it is flagged as unsafe. While automated moderation mechanisms show promise in addressing such issues, they remain susceptible to specific exploits like token segmentation bias. This study sheds light on the critical weakness in Judge LLMs caused by token segmentation bias and underscores the importance of understanding how minor input modifications can impact contextual understanding and ultimately influence the effectiveness of moderation mechanisms in safeguarding against harmful content generated by LLMs.

- Jailbreaking techniques pose a threat to Large Language Models (LLMs), manipulating them into producing restricted or harmful output.
- Defense mechanism involves using another LLM as a Judge to evaluate text for harmfulness, but Judge LLMs are susceptible to token segmentation bias.
- Emoji Attack technique strategically inserts emojis into text before evaluation by a Judge LLM, reducing detection accuracy and allowing unsafe content to slip past moderation filters.
- Emojis introduce semantic ambiguity, making them effective in evading detection compared to traditional delimiters.
- Research question focuses on whether seemingly harmless constructs like emojis can alter decision boundaries of Judge LLMs and enable harmful content to bypass moderation filters.

Summary- Some ways of changing how computers talk can be dangerous. - To protect against this, we use another computer to check if the talking is safe. - Putting emojis in the talking can trick the checking computer and let bad things through. - Emojis make it hard for computers to understand what's being said. - We want to know if emojis can help bad things get past safety checks. Definitions- Jailbreaking: Changing a computer's rules to do things it wasn't meant to do. - Large Language Models (LLMs): Big computers that understand and generate human language. - Manipulating: Controlling or influencing something in a clever way. - Emojis: Small pictures used in text messages to express emotions or ideas. - Semantic ambiguity: Words or symbols having more than one possible meaning.

Title: "Emoji Attack: A Novel Strategy to Evade Detection by Judge LLMs" Introduction: The rise of Large Language Models (LLMs) has brought about numerous advancements in natural language processing, but it has also raised concerns about the potential for these models to produce harmful or restricted output. To address this issue, researchers have proposed using another LLM as a Judge to evaluate generated text for safety and ethical considerations. However, recent studies have revealed a vulnerability in this approach known as token segmentation bias, where delimiters in the text can manipulate the tokenization process and reduce detection accuracy. In response to this weakness, a new technique called Emoji Attack has been introduced. What is Emoji Attack? Emoji Attack is a strategy that leverages emojis to evade detection by Judge LLMs. By inserting emojis into text before evaluation, it induces embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters such as punctuation marks or spaces, emojis introduce semantic ambiguity and make it difficult for moderation mechanisms to accurately assess the context of the text. How Does It Work? The key concept behind Emoji Attack is in-context learning. By strategically placing emojis within sentences or phrases, it alters the decision boundaries of Judge LLMs and enables harmful content to slip past moderation filters undetected. This manipulation of embeddings reduces detection accuracy and allows potentially dangerous content to be misclassified as safe. Why Are Emojis Effective? Emojis are particularly effective at evading detection due to their inherent ambiguity and ability to convey multiple meanings depending on context. For example, an emoji like 🚫 could be interpreted as "stop" or "no," but when used in conjunction with other words or symbols, its meaning can change entirely. This makes it challenging for Judge LLMs to accurately assess whether a piece of text contains harmful content. Implications for Automated Moderation Mechanisms: Many online platforms use automated moderation mechanisms to detect and remove harmful content. These mechanisms often assign numerical scores to indicate the severity of content violations, and if a response surpasses a predefined threshold score, it is flagged as unsafe. However, this study highlights the critical weakness in Judge LLMs caused by token segmentation bias and underscores the importance of understanding how minor input modifications can impact contextual understanding. Conclusion: The Emoji Attack technique has been shown to successfully reduce the rate of unsafe predictions and bypass existing safeguards on cutting-edge Judge LLMs. This research sheds light on the vulnerability of automated moderation mechanisms to specific exploits like token segmentation bias and emphasizes the need for further investigation into how seemingly harmless linguistic constructs like emojis can be used to manipulate text and evade detection. As LLMs continue to advance, it is crucial for researchers and platform moderators alike to stay vigilant in identifying potential vulnerabilities and developing effective countermeasures.

Created on 11 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

65.2%

Jailbreaking Proprietary Large Language Models using Word Substitution Cipher

cs.CL

61.1%

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Gua…

cs.CL

60.7%

Security and Privacy Challenges of Large Language Models: A Survey

cs.CL

60.2%

Scalable and Transferable Black-Box Jailbreaks for Language Models via Person…

cs.CL

60.2%

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabi…

cs.CL

60.1%

Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for L…

cs.CL

56.6%

Adversarial Fine-Tuning of Language Models: An Iterative Optimisation Approac…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.