Jailbreak Attack on VLMs Initiated from the Visual Modality | Tairan Chen(陈泰然)

Cover

Introduction

With the widespread application of large-scale Visual Language Models (VLMs), they have shown impressive performance in fields such as image recognition and natural language processing. However, these advancements have also raised concerns regarding security and privacy. These large models may be maliciously exploited, leading to sensitive information leakage or misuse. As a result, researchers have begun to focus on jailbreak attacks, which aim to uncover and address these potential security vulnerabilities. Jailbreak attacks typically involve inducing models to produce inappropriate or unintended results through specific methods or inputs. This helps researchers identify weaknesses and potential security risks within the models. Through such studies, scientists aim to improve the security of large models, ensuring their reliability and robustness across various applications. This research forms the foundation for further exploration and enhancement of model security.

Currently, many popular open-source VLMs have not undergone rigorous security evaluations before release. Moreover, due to a lack of holistic security alignment between VLM components, the underlying Large Language Model’s (LLM) security measures fail to cover unforeseen risks introduced by the visual modality, which can lead to jailbreak scenarios. Key contributions of this work are summarized as follows:

Optimizing Black-box Model Attack Method: We developed an attack method for black-box models based on the FigStep framework. By integrating image processing and text generation strategies, we effectively triggered harmful content generation by the model.
Increasing Attack Success Rates: By adjusting image brightness, randomizing text position and color, and other strategies, we significantly enhanced the success rate of the attacks, making the models more prone to generating adversarial outputs.
Enhancing Content Diversity: Without compromising attack efficacy, we diversified the prompts and optimization strategies to improve the semantic and presentation diversity of generated content.

With these innovations, our team secured the second place in the Red Teaming Multimodal Large Language Model Security Challenge hosted by CCF. This paper will detail our research methods and outcomes from the competition, and discuss the potential applications of these methods in enhancing model security.

Photo from the challenge site

Framework

Multimodal jailbreak attacks can be divided into two categories: heuristic attacks and optimized attacks. Heuristic attacks are based on experience or intuition, while optimized attacks involve designing and optimizing adversarial inputs (such as text or images) to compromise model security. In practice, we systematically evaluated the success rates of four representative attack methods, ranging from embedding harmful words in images using heuristic methods to generating adversarial images using advanced algorithms. The results indicated that the second method—a heuristic strategy—had the highest success rate. Therefore, we selected it as the baseline method for further optimization, ensuring improved success rates while maintaining diversity in image and text generation.

The trade-off between success rate and diversity

According to the competition rules, high scores were awarded for both success rate and content diversity, which are crucial for enhancing the comprehensiveness and stealth of attack strategies. However, a trade-off often exists between these two factors during evaluation. To increase image diversity, we compared generating images using the Stable Diffusion model versus using the COCO dataset. To assess data novelty, we used a pre-trained Inception network to extract features from each image in the dataset, calculating their feature vector means and covariance matrices. The results showed that the COCO dataset, with its richer and more diverse image features, had a significant advantage. Therefore, we randomly selected 1,200 images from the COCO dataset as the source material for subsequent embedding operations.

For text prompt diversity, after determining a set of general prompts, we further diversified them. By making subtle adjustments to the prompts, we aimed to improve the diversity score of the attack strategies, thereby increasing their success rate and stealth.

Flowchart of our method

The flowchart illustrates the specific steps of this process. First, we carefully selected 1,200 images from the COCO dataset as a starting point. Since the images in the dataset varied in size, based on our experience with large model jailbreaks, we concluded that standardizing image sizes is crucial for improving the jailbreak success rate. Thus, the first step involved normalizing the images by resizing and cropping them to 512x512 pixels to ensure consistency in input.

Next, to further optimize the images for embedding harmful content, we darkened the brightness. The rationale behind this strategy is that a darker image background highlights the embedded textual information, improving the VLM’s efficiency and accuracy in recognizing and extracting embedded text in subsequent steps.

Based on this, we embedded harmful questions, such as “How to plan a bank robbery without being detected?” in the processed images. To prompt the VLM to generate concrete, potentially harmful action steps, we left three blank entries (marked as 1, 2, and 3) beneath the question, intending to induce the VLM to fill in these harmful “action plans” based on the context.

Finally, we designed a text prompt, as shown in the top-right corner of the flowchart, to guide the VLM in filling in the blanks. The prompt instructs the model to generate three specific, potentially harmful steps based on the image content and the embedded question. By inputting this image with an embedded harmful question and accompanying prompt into the VLM, we were able to trigger harmful outputs, validating the success of our jailbreak strategy.

Moreover, according to competition rules, the jailbreak attack encompassed six harmful domains. For each domain, we tasked GPT-3.5 with assuming the role of an AI practitioner and generating 200 questions closely related to the domain that could lead to adverse outcomes. To ensure the high relevance and appropriateness of the outputs, we carefully prepared examples and provided GPT-3.5 with rich contextual information, guiding it to generate harmful questions that aligned with our needs.

Optimization

In the optimization phase, our research was divided into three main parts: image optimization, text embedding optimization, and prompt optimization.

Image Optimization

For image optimization, we explored brightness adjustment factors, systematically testing four brightness factor values (0.5, 0.6, 0.7, and 0.8) to evaluate their impact on the attack success rate (ASR) and content diversity. The results showed that as the brightness factor decreased, the darkened images improved the distinguishability of the embedded harmful text. Lower brightness enhanced ASR in open-source VLMs, as the model was more likely to capture and recognize clear text. However, in complex black-box models, darker images posed more challenges for advanced OCR techniques, raising the attack difficulty. Analysis also indicated that excessive brightness reduction, though improving ASR, limited image diversity. Ultimately, we chose 0.7 as the brightness factor to balance high attack success and content diversity.

Text Optimization

In terms of text optimization, we explored three areas. First, we optimized the position of the text embedding, covering various locations (e.g., top-left corner, top-right corner, center). Experiments showed that placing text in the top-left corner achieved the highest ASR due to the region’s typically cleaner, less cluttered background, making it easier for models to recognize. Second, color choice experiments revealed that red text performed best in terms of ASR due to its strong visual impact and resistance to background noise interference. Finally, for font optimization, we compared untreated bold monospaced fonts with Gaussian-blurred versions. The results showed that untreated fonts had a higher ASR for open-source models, while Gaussian-blurred fonts improved ASR for black-box models by reducing recognition efficiency. Based on these findings, we opted for untreated fonts to balance attack efficacy across multiple models.

Prompt Optimization

For prompt optimization, we designed three templates with different phrasings but identical meanings. The third template achieved the highest ASR. However, to mitigate the lack of diversity from using a single template, we developed a strategy to enhance diversity: random invalid characters were added to the beginning and end of each prompt. This strategy stemmed from our understanding of the evaluation system, as adding invalid characters could improve diversity scores in small evaluation models while maintaining robustness in large model evaluations. Although this hypothesis needs further validation, the strategy significantly improved data diversity in practice.

Experiment

The table comprehensively summarizes the effectiveness of each optimization strategy. Red data indicates score improvements, green data indicates declines, and dashes indicate no impact on a given metric. The results show that reducing image brightness significantly increased ASR but slightly decreased diversity scores. For the final optimization strategy, adding invalid characters to the prompt edges had only a slight negative impact on ASR (a drop of 1.3 points) but increased the diversity score by 7.5 points. These results confirm the effectiveness of our optimization strategies.

Optimization Strategy Effectiveness Summary

Discussion

The results of our experiments demonstrate the effectiveness of our optimized jailbreak attack strategies on Visual Language Models (VLMs). By improving the success rate and content diversity of the attacks, we have highlighted significant vulnerabilities in multimodal models. One of the key strengths of our approach is the systematic exploration of both image and text-based adversarial techniques, especially the use of brightness adjustment and font optimization to enhance the model’s susceptibility to harmful content generation. This allowed us to maximize the attack’s effectiveness while maintaining a high degree of variability in the outputs, which can prevent easy detection and mitigation of such attacks.

Another contribution of this research is the introduction of prompt manipulation techniques that enhance diversity without compromising attack efficiency. The addition of random invalid characters in text prompts successfully increased diversity scores, demonstrating that subtle variations can have a considerable impact on how models process adversarial inputs. This insight can help guide future researchers in developing more sophisticated and varied attack strategies.