Boosting Jailbreak Attack with Momentum

Zhang, Yihao; Wei, Zeming

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2405

Computer Science > Machine Learning

Title: Boosting Jailbreak Attack with Momentum

Authors: Yihao Zhang, Zeming Wei

(Submitted on 2 May 2024)

Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at this https URL

Comments:	ICLR 2024 Workshop on Reliable and Responsible Foundation Models
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
Cite as:	arXiv:2405.01229 [cs.LG]
	(or arXiv:2405.01229v1 [cs.LG] for this version)

Submission history

From: Yihao Zhang [view email]
[v1] Thu, 2 May 2024 12:18:14 GMT (63kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.01229

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Machine Learning

Title: Boosting Jailbreak Attack with Momentum

Submission history