FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts

AI-generated keywords: FigStep

AI-generated Key Points

FigStep: A novel black-box jailbreak algorithm for LVLM vulnerabilities
LVLMs: Advancements in AI through multimodal incorporation of images and text
Safety Alignment: Addressing risks of overreliance on LLM safety assurances
SafeBench: Comprehensive benchmark for LVLM safety assessment and improvement

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang

arXiv: 2311.05608v3 - DOI (cs.CR)

AAAI 2025 (Oral)

License: CC BY 4.0

Abstract: Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e.g., images). Despite this advancement, the safety of LVLMs remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying LLMs. In this paper, we propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on six promising open-source LVLMs. Not merely to demonstrate the efficacy of FigStep, we conduct comprehensive ablation studies and analyze the distribution of the semantic embeddings to uncover that the reason behind the success of FigStep is the deficiency of safety alignment for visual embeddings. Moreover, we compare FigStep with five text-only jailbreaks and four image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs and better attack performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques. Our code and datasets are available at https://github.com/ThuCCSLab/FigStep .

Submitted to arXiv on 09 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.05608v3

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In this paper, the authors introduce FigStep, a novel black-box jailbreak algorithm designed to exploit vulnerabilities in Large Vision-Language Models (LVLMs). These models represent a significant advancement in AI by incorporating multiple modalities such as images along with text. However, the safety of LVLMs has not been adequately explored, leading to potential risks associated with overreliance on safety assurances from underlying Large Language Models (LLMs). <kw>FigStep:</kw> A Novel Black-Box Jailbreak Algorithm for LVLM Vulnerabilities <kw>LVLMs:</kw> Advancements in AI through Multimodal Incorporation of Images and Text <kw>Safety Alignment:</kw> Addressing Risks of Overreliance on LLM Safety Assurances <kw>SafeBench:</kw> Comprehensive Benchmark for LVLM Safety Assessment and Improvement FigStep operates by converting prohibited textual content into images using typography, effectively bypassing existing safety measures. The experimental results demonstrate that FigStep achieves an impressive average attack success rate of 82.50% across six prominent open-source LVLMs. Through comprehensive ablation studies and analysis of semantic embeddings distribution, the authors reveal that FigStep's success stems from the lack of safety alignment for visual embeddings in current LVLMs. Furthermore, the study compares FigStep against five text-only and four image-based jailbreak techniques, highlighting its superior performance in terms of attack costs and success rates. The findings underscore the vulnerability of LVLMs to jailbreak attacks and emphasize the need for innovative cross-modality safety alignment methods to enhance model security. To facilitate further research and experimentation, the authors introduce SafeBench, a comprehensive safety benchmark comprising 500 harmful questions generated through LLM-based dataset generation techniques. The study focuses on analyzing two promising open-source LVLMs - LLaVA-v1.5 and MiniGPT4 - showcasing how FigStep can effectively exploit weaknesses in these models. Overall, this work sheds light on the critical importance of addressing security concerns in LVLMs and provides valuable insights into developing robust defenses against potential adversarial attacks. The code and datasets associated with FigStep are available for further exploration on GitHub at https://github.com/ThuCCSLab/FigStep.

- FigStep: A novel black-box jailbreak algorithm for LVLM vulnerabilities
- LVLMs: Advancements in AI through multimodal incorporation of images and text
- Safety Alignment: Addressing risks of overreliance on LLM safety assurances
- SafeBench: Comprehensive benchmark for LVLM safety assessment and improvement

Summary 1. FigStep is a new way to unlock things on the computer when there are problems. 2. LVLMs help computers learn better by using both pictures and words together. 3. Safety Alignment makes sure we don't trust the computer too much for safety. 4. SafeBench checks how safe the computer learning is and helps make it better. Definitions- FigStep: A special method to break through problems in computer systems without knowing how they work inside. - LVLMs: Advanced technology that helps computers learn more effectively by combining images and text. - Safety Alignment: Making sure we are not relying too much on the safety promises of computer systems. - SafeBench: A tool that tests and improves the safety of computer learning processes.

Introduction

Artificial Intelligence (AI) has made significant strides in recent years, with the incorporation of multiple modalities such as images and text leading to even more advanced models. Large Vision-Language Models (LVLMs) are a prime example of this, combining visual and textual information to achieve impressive performance on various tasks. However, as these models become increasingly prevalent in real-world applications, their safety and security have come under scrutiny. In this paper, the authors introduce FigStep, a novel black-box jailbreak algorithm designed to exploit vulnerabilities in LVLMs. The study highlights the potential risks associated with overreliance on safety assurances from underlying Large Language Models (LLMs), which may not adequately address cross-modality concerns. To facilitate further research and experimentation, the authors also introduce SafeBench - a comprehensive benchmark for LVLM safety assessment and improvement.

The Need for Safety Alignment in LVLMs

The use of multiple modalities in AI has led to significant advancements in natural language processing and computer vision tasks. However, it also introduces new challenges when it comes to ensuring model safety and security. While LLMs have been extensively studied for their vulnerabilities and defenses against adversarial attacks, there is still much work to be done regarding cross-modality alignment. LVLMs rely heavily on LLMs for their language understanding capabilities but also incorporate visual embeddings that may not receive the same level of scrutiny when it comes to safety measures. This lack of alignment can leave LVLMs vulnerable to attacks that exploit weaknesses in visual embeddings.

The FigStep Algorithm

FigStep operates by converting prohibited textual content into images using typography techniques effectively bypassing existing safety measures that only consider text inputs. The algorithm takes advantage of the fact that most LVLMs do not perform any checks on visual embeddings or consider them as potential sources of harmful content. The experimental results demonstrate that FigStep achieves an impressive average attack success rate of 82.50% across six prominent open-source LVLMs. Through comprehensive ablation studies and analysis of semantic embeddings distribution, the authors reveal that FigStep's success stems from the lack of safety alignment for visual embeddings in current LVLMs.

Comparison with Existing Jailbreak Techniques

To further highlight the effectiveness of FigStep, the study compares it against five text-only and four image-based jailbreak techniques. The results show that FigStep outperforms these methods in terms of attack costs and success rates, showcasing its potential as a powerful tool for exploiting vulnerabilities in LVLMs.

The Importance of SafeBench

One significant contribution of this work is the introduction of SafeBench - a comprehensive benchmark comprising 500 harmful questions generated through LLM-based dataset generation techniques. This benchmark provides researchers with a standardized set of tasks to evaluate their models' safety and security against various attacks, including those using cross-modality inputs.

Case Studies: LLaVA-v1.5 and MiniGPT4

To showcase how FigStep can effectively exploit weaknesses in LVLMs, the study focuses on analyzing two promising open-source models - LLaVA-v1.5 and MiniGPT4. The results demonstrate how FigStep can successfully bypass existing safety measures in these models, highlighting their vulnerability to jailbreak attacks.

Conclusion

This paper introduces FigStep - a novel black-box jailbreak algorithm designed to exploit vulnerabilities in Large Vision-Language Models (LVLMs). Through extensive experiments and comparisons with existing techniques, the authors demonstrate its effectiveness in attacking multiple open-source LVLMs. The study also highlights the need for cross-modality safety alignment methods to enhance model security and introduces SafeBench - a comprehensive benchmark for evaluating LVLM safety and security. Overall, this work sheds light on the critical importance of addressing security concerns in LVLMs and provides valuable insights into developing robust defenses against potential adversarial attacks. The code and datasets associated with FigStep are available for further exploration on GitHub at https://github.com/ThuCCSLab/FigStep. This will enable researchers to build upon this work and continue to advance the safety and security of LVLMs.

Created on 07 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

59.3%

InfoFlood: Jailbreaking Large Language Models with Information Overload

cs.CR

54.4%

Prompt Stealing Attacks Against Large Language Models

cs.CR

54.2%

In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

cs.CR

53.4%

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Co…

cs.CR

53.2%

Chain-of-Thought Prompting of Large Language Models for Discovering and Fixin…

cs.CR

53.0%

Efficient Detection of Toxic Prompts in Large Language Models

cs.CR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.