SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

AI-generated keywords: Text-to-image generative models Not-Safe-for-Work (NSFW) content Safety filters SneakyPrompt Adversarial attacks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao focus on text-to-image generative models and the challenges of filtering out Not-Safe-for-Work (NSFW) content.
  • Models like Stable Diffusion and DALL$\cdot$E 2 have practical applications but struggle with NSFW content filtering.
  • Researchers commonly use safety filters based on textual or visual cues to block NSFW content.
  • SneakyPrompt is introduced as an automated attack framework to assess the robustness of safety filters in cutting-edge text-to-image generative models.
  • SneakyPrompt utilizes reinforcement learning to craft prompts that generate NSFW images capable of evading existing safety filters.
  • Evaluation shows SneakyPrompt's effectiveness in generating NSFW content using DALL$\cdot$E 2 with its default closed-box safety filter enabled.
  • SneakyPrompt surpasses existing adversarial attacks in query efficiency and image quality when tested against open-source state-of-the-art safety filters on a Stable Diffusion model.
  • The study highlights the importance of robust security measures in AI systems to address vulnerabilities in safety filters.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Abstract: Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E 2 have attracted much attention since their publication due to their wide application in the real world. One challenging problem of text-to-image generative models is the generation of Not-Safe-for-Work (NSFW) content, e.g., those related to violence and adult. Therefore, a common practice is to deploy a so-called safety filter, which blocks NSFW content based on either text or image features. Prior works have studied the possible bypass of such safety filters. However, existing works are largely manual and specific to Stable Diffusion's official safety filter. Moreover, the bypass ratio of Stable Diffusion's safety filter is as low as 23.51% based on our evaluation. In this paper, we propose the first automated attack framework, called SneakyPrompt, to evaluate the robustness of real-world safety filters in state-of-the-art text-to-image generative models. Our key insight is to search for alternative tokens in a prompt that generates NSFW images so that the generated prompt (called an adversarial prompt) bypasses existing safety filters. Specifically, SneakyPrompt utilizes reinforcement learning (RL) to guide an agent with positive rewards on semantic similarity and bypass success. Our evaluation shows that SneakyPrompt successfully generated NSFW content using an online model DALL$\cdot$E 2 with its default, closed-box safety filter enabled. At the same time, we also deploy several open-source state-of-the-art safety filters on a Stable Diffusion model and show that SneakyPrompt not only successfully generates NSFW content, but also outperforms existing adversarial attacks in terms of the number of queries and image qualities.

Submitted to arXiv on 20 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.12082v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters," authors Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao delve into the realm of text-to-image generative models and the challenges posed by Not-Safe-for-Work (NSFW) content generation. The advent of models like Stable Diffusion and DALL$\cdot$E 2 has sparked significant interest due to their practical applications in various real-world scenarios. However, one major hurdle faced by these models is the need to filter out NSFW content such as violence and adult themes. To address this issue, researchers commonly deploy safety filters that block NSFW content based on either textual or visual cues. Previous studies have explored ways to bypass these safety filters; however, most approaches have been manual and tailored specifically to Stable Diffusion's official safety filter. The authors note that the bypass ratio of Stable Diffusion's safety filter stands at a mere 23.51% based on their evaluations. In response to this challenge, the authors introduce SneakyPrompt as an automated attack framework designed to assess the robustness of real-world safety filters in cutting-edge text-to-image generative models. The key innovation behind SneakyPrompt lies in its ability to identify alternative tokens within a prompt that lead to the generation of NSFW images capable of evading existing safety filters. Leveraging reinforcement learning (RL), SneakyPrompt guides an agent towards crafting prompts with positive rewards based on semantic similarity and successful bypasses. The evaluation conducted by the authors demonstrates SneakyPrompt's efficacy in generating NSFW content using an online model like DALL$\cdot$E 2 with its default closed-box safety filter enabled. Additionally, several open-source state-of-the-art safety filters were deployed on a Stable Diffusion model, revealing that SneakyPrompt not only successfully generates NSFW content but also surpasses existing adversarial attacks in terms of query efficiency and image quality. Overall, "SneakyPrompt" presents a novel approach towards evaluating the resilience of safety filters in text-to-image generative models against NSFW content generation. This study sheds light on potential vulnerabilities and highlights the importance of robust security measures in AI systems.
Created on 07 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.