LVLM-SafeR

Abstract

Warning: this paper contains example data that may be offensive or harmful. Although many existing evaluation datasets have been proposed to assess the safety of Large Vision-Language Models (LVLMs) on malicious prompt-image pairs, the research community lacks a systematic investigation into LVLMs' reasonable refusal toward both safe and unsafe pairs. We define a control group consisting of an unsafe prompt-image pair and a safe pair, in which these two pairs share the same prompt or image. In a control group, an LVLM shows reasonable refusal if it refuses the former pair and responds to the latter. Otherwise, the model displays false refusal, such as refusing both pairs or none. For example, a control group contains an image depicting violent behavior and two prompts based on the same visual information. An LVLM should respond to the safe prompt How to deter this behavior? and refuse the unsafe prompt How to promote this behavior?. To bridge this gap, we present LVLM-SafeR, a challenging and high-quality benchmark designed to measure Safety-related Refusal in LVLMs. The evaluation results from 9 closed-source LVLMs, 23 open-source LVLMs and 4 LVLM safety alignment approaches demonstrate that existing LVLMs have notable issues in providing proper refusals. Furthermore, we explore the effects of post-hoc/mixed safety fine-tuning, full/LoRA safety fine-tuning, and inference-time parameters (top-p, temperature) on LVLMs. Then we propose an effective prompt-engineering baseline to instruct LVLMs to give more reasonable refusals.

Taxonomy of LVLM-SafeR with Concrete Samples

LVLM-SafeR Analysis

Main Experimental Results

Prompt to guide GPT-3.5 for automatic refusal evaluation, which contains a prompt prefix, demonstration examples and the response of an LVLM.

Main Results on Closed-source LVLMs and Alignment Techniques

Main Results on Open-source LVLMs

We can conclude that both base LLMs and cross-modal training methods play a vital role in LVLMs' safety alignment capability.

More Experimental Analysis

Ablation study of VLGuard: (a) LLaVA-v1.5-7B as baseline and (b) LLaVA-v1.5-13B as baseline. The word VLG.is the abbreviation for VLGuard. We can find that LoRA fine-tuning does not reach comparable capability of safety-related reasonable refusal as full fine-tuning.

Ablation study of inference-time parameters of GPT-4o: (a) temperature and (b) top-p. It can be observed that as temperature and top-p decrease, the model becomes more inclined to generate responses with higher confidence levels, leading to a higher rejection rate (type T2 and T4) and fewer unsafe responses (type T1).

A Baseline for More Reasonable Refusal

We add a prompt prefix to each original prompt in LVLM-SafeR, instructing LVLMs to give more reasonable refusals.

Quantitative evaluation results of the designed prompt prefix.

Qualitative results of the designed prompt prefix.

More qualitative results of the designed prompt prefix on GPT-4V (the first two cases).

More qualitative results of the designed prompt prefix on GPT-4V (the second two cases).

More qualitative results of the designed prompt prefix on GPT-4V (the third two cases).

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the first two cases).

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the second two cases).

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the third two cases).

Don't Always Say No to Me: Benchmarking Safety-Related Refusal in Large VLM