A Causal Explainable Guardrails for Large Language Models

Chu, Zhixuan; Wang, Yan; Li, Longfei; Wang, Zhibo; Qin, Zhan; Ren, Kui

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2405

Change to browse by:

References & Citations

NASA ADS

Bookmark

(what is this?)

Computer Science > Computation and Language

Title: A Causal Explainable Guardrails for Large Language Models

Authors: Zhixuan Chu, Yan Wang, Longfei Li, Zhibo Wang, Zhan Qin, Kui Ren

(Submitted on 7 May 2024)

Abstract: Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs towards desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardaril, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardaril's effectiveness in steering LLMs towards desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes. We discuss the limitations and future research directions, highlighting the need for ongoing research to address the ethical implications of large language models.

Comments:	23 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.04160 [cs.CL]
	(or arXiv:2405.04160v1 [cs.CL] for this version)

Submission history

From: Yan Wang [view email]
[v1] Tue, 7 May 2024 09:55:05 GMT (2268kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.04160

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Computer Science > Computation and Language

Title: A Causal Explainable Guardrails for Large Language Models

Submission history