DPP-Based Adversarial Prompt Searching for Lanugage Models

Zhang, Xu; Wan, Xiaojun

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2403

Change to browse by:

Computer Science > Computation and Language

Title: DPP-Based Adversarial Prompt Searching for Lanugage Models

Authors: Xu Zhang, Xiaojun Wan

(Submitted on 1 Mar 2024)

Abstract: Language models risk generating mindless and offensive content, which hinders their safe deployment. Therefore, it is crucial to discover and modify potential toxic outputs of pre-trained language models before deployment. In this work, we elicit toxic content by automatically searching for a prompt that directs pre-trained language models towards the generation of a specific target output. The problem is challenging due to the discrete nature of textual data and the considerable computational resources required for a single forward pass of the language model. To combat these challenges, we introduce Auto-regressive Selective Replacement Ascent (ASRA), a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP). Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content. Furthermore, our analysis reveals a strong correlation between the success rate of ASRA attacks and the perplexity of target outputs, while indicating limited association with the quantity of model parameters.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.00292 [cs.CL]
	(or arXiv:2403.00292v1 [cs.CL] for this version)

Submission history

From: Xu Zhang [view email]
[v1] Fri, 1 Mar 2024 05:28:06 GMT (7483kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2403.00292

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: DPP-Based Adversarial Prompt Searching for Lanugage Models

Submission history