Don't Always Say No to Me: Benchmarking Safety-Related Refusal in Large VLM

Xin Liu1,2,*, Zhichen Dong2,*, Zhanhui Zhou2, Yichen Zhu3, Yunshi Lan1,†, Chao Yang2,†,
1East China Normal University, 2Shanghai AI Laboratory, 3University of Toronto
*Contribute equally, Corresponding author
Two demo samples

Four unsafe/safe prompt-image pairs from two control groups, and the corresponding responses from GPT-4V.

Abstract

Warning: this paper contains example data that may be offensive or harmful. Although many existing evaluation datasets have been proposed to assess the safety of Large Vision-Language Models (LVLMs) on malicious prompt-image pairs, the research community lacks a systematic investigation into LVLMs' reasonable refusal toward both safe and unsafe pairs. We define a control group consisting of an unsafe prompt-image pair and a safe pair, in which these two pairs share the same prompt or image. In a control group, an LVLM shows reasonable refusal if it refuses the former pair and responds to the latter. Otherwise, the model displays false refusal, such as refusing both pairs or none. For example, a control group contains an image depicting violent behavior and two prompts based on the same visual information. An LVLM should respond to the safe prompt How to deter this behavior? and refuse the unsafe prompt How to promote this behavior?. To bridge this gap, we present LVLM-SafeR, a challenging and high-quality benchmark designed to measure Safety-related Refusal in LVLMs. The evaluation results from 9 closed-source LVLMs, 23 open-source LVLMs and 4 LVLM safety alignment approaches demonstrate that existing LVLMs have notable issues in providing proper refusals. Furthermore, we explore the effects of post-hoc/mixed safety fine-tuning, full/LoRA safety fine-tuning, and inference-time parameters (top-p, temperature) on LVLMs. Then we propose an effective prompt-engineering baseline to instruct LVLMs to give more reasonable refusals.

Taxonomy of LVLM-SafeR with Concrete Samples

Taxonomy of LVLM-SafeR with concrete samples

LVLM-SafeR Analysis

Benchmark Analysis

Main Experimental Results

exp_eval_prompt

Prompt to guide GPT-3.5 for automatic refusal evaluation, which contains a prompt prefix, demonstration examples and the response of an LVLM.

Main Results on Closed-source LVLMs and Alignment Techniques

Main_results_1

Main Results on Open-source LVLMs

Main_results_open_source_lvlms

We can conclude that both base LLMs and cross-modal training methods play a vital role in LVLMs' safety alignment capability.

More Experimental Analysis

exp_vlguard_ablation

Ablation study of VLGuard: (a) LLaVA-v1.5-7B as baseline and (b) LLaVA-v1.5-13B as baseline. The word VLG.is the abbreviation for VLGuard. We can find that LoRA fine-tuning does not reach comparable capability of safety-related reasonable refusal as full fine-tuning.

exp_temp_topp_ablation

Ablation study of inference-time parameters of GPT-4o: (a) temperature and (b) top-p. It can be observed that as temperature and top-p decrease, the model becomes more inclined to generate responses with higher confidence levels, leading to a higher rejection rate (type T2 and T4) and fewer unsafe responses (type T1).

A Baseline for More Reasonable Refusal

exp_prompt_prefix

We add a prompt prefix to each original prompt in LVLM-SafeR, instructing LVLMs to give more reasonable refusals.

exp_prompt_prefix_result

Quantitative evaluation results of the designed prompt prefix.

exp_prompt_prefix_two_demo_samples

Qualitative results of the designed prompt prefix.

appendix_prompt_prefix_openai_1

More qualitative results of the designed prompt prefix on GPT-4V (the first two cases).

appendix_prompt_prefix_openai_2

More qualitative results of the designed prompt prefix on GPT-4V (the second two cases).

appendix_prompt_prefix_openai_3

More qualitative results of the designed prompt prefix on GPT-4V (the third two cases).

appendix_prompt_prefix_claude_1

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the first two cases).

appendix_prompt_prefix_claude_2

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the second two cases).

appendix_prompt_prefix_claude_3

More qualitative results of the designed prompt prefix on Claude-3-Haiku (the third two cases).