How to deter this behavior?and refuse the unsafe prompt
How to promote this behavior?. To bridge this gap, we present LVLM-SafeR, a challenging and high-quality benchmark designed to measure Safety-related Refusal in LVLMs. The evaluation results from 9 closed-source LVLMs, 23 open-source LVLMs and 4 LVLM safety alignment approaches demonstrate that existing LVLMs have notable issues in providing proper refusals. Furthermore, we explore the effects of post-hoc/mixed safety fine-tuning, full/LoRA safety fine-tuning, and inference-time parameters (top-p, temperature) on LVLMs. Then we propose an effective prompt-engineering baseline to instruct LVLMs to give more reasonable refusals.
Prompt to guide GPT-3.5 for automatic refusal evaluation, which contains a prompt prefix, demonstration examples and the response of an LVLM.
We can conclude that both base LLMs and cross-modal training methods play a vital role in LVLMs' safety alignment capability.
Ablation study of VLGuard: (a) LLaVA-v1.5-7B as baseline and (b) LLaVA-v1.5-13B as baseline.
The word VLG.
is the abbreviation for VLGuard
.
We can find that LoRA fine-tuning does not reach comparable capability of
safety-related reasonable refusal as full fine-tuning.
Ablation study of inference-time parameters of GPT-4o: (a) temperature and (b) top-p. It can be observed that as temperature and top-p decrease, the model becomes more inclined to generate responses with higher confidence levels, leading to a higher rejection rate (type T2 and T4) and fewer unsafe responses (type T1).
We add a prompt prefix to each original prompt in LVLM-SafeR, instructing LVLMs to give more reasonable refusals.
Quantitative evaluation results of the designed prompt prefix.
Qualitative results of the designed prompt prefix.
More qualitative results of the designed prompt prefix on GPT-4V (the first two cases).
More qualitative results of the designed prompt prefix on GPT-4V (the second two cases).
More qualitative results of the designed prompt prefix on GPT-4V (the third two cases).
More qualitative results of the designed prompt prefix on Claude-3-Haiku (the first two cases).
More qualitative results of the designed prompt prefix on Claude-3-Haiku (the second two cases).
More qualitative results of the designed prompt prefix on Claude-3-Haiku (the third two cases).