Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

AI-generated keywords: Jailbreaking

AI-generated Key Points

  • Jailbreaking techniques pose a threat to Large Language Models (LLMs), manipulating them into producing restricted or harmful output.
  • Defense mechanism involves using another LLM as a Judge to evaluate text for harmfulness, but Judge LLMs are susceptible to token segmentation bias.
  • Emoji Attack technique strategically inserts emojis into text before evaluation by a Judge LLM, reducing detection accuracy and allowing unsafe content to slip past moderation filters.
  • Emojis introduce semantic ambiguity, making them effective in evading detection compared to traditional delimiters.
  • Research question focuses on whether seemingly harmless constructs like emojis can alter decision boundaries of Judge LLMs and enable harmful content to bypass moderation filters.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

License: CC BY 4.0

Abstract: Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.

Submitted to arXiv on 01 Nov. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2411.01077v5

, , , , In recent years, jailbreaking techniques have emerged as a threat to Large Language Models (LLMs), manipulating them into producing restricted or harmful output. To counter this, one defense mechanism involves using another LLM as a Judge to evaluate the generated text for harmfulness. However, it has been discovered that these Judge LLMs are susceptible to token segmentation bias, where delimiters in the text alter the tokenization process and split words into smaller sub-tokens. This manipulation of embeddings reduces detection accuracy, allowing harmful content to be misclassified as safe. In response to this vulnerability, a novel strategy known as Emoji Attack has been introduced. This technique leverages in-context learning to strategically insert emojis into text before evaluation by a Judge LLM. By inducing embedding distortions through the use of emojis, the likelihood of detecting unsafe content is significantly lowered. Unlike traditional delimiters, emojis introduce semantic ambiguity, making them particularly effective in evading detection. Through experiments on cutting-edge Judge LLMs, it has been demonstrated that Emoji Attack successfully reduces the rate of unsafe predictions and bypasses existing safeguards. The research question posed in this study revolves around whether seemingly harmless linguistic constructs like emojis can alter the decision boundaries of Judge LLMs and enable harmful content to slip past moderation filters. Furthermore, it has been highlighted that many Judge LLMs assign numerical scores to indicate the severity of content violations on ethical, legal, or safety grounds. If a response surpasses a predefined threshold score, it is flagged as unsafe. While automated moderation mechanisms show promise in addressing such issues, they remain susceptible to specific exploits like token segmentation bias. This study sheds light on the critical weakness in Judge LLMs caused by token segmentation bias and underscores the importance of understanding how minor input modifications can impact contextual understanding and ultimately influence the effectiveness of moderation mechanisms in safeguarding against harmful content generated by LLMs.
Created on 11 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.