The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Zhang, Tianjun; Liu, Fangchen; Wong, Justin; Abbeel, Pieter; Gonzalez, Joseph E.

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2302

Computer Science > Computation and Language

Title: The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Authors: Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez

(Submitted on 10 Feb 2023)

Abstract: Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. The so-called algorithm, Reinforcement Learning with Human Feedback (RLHF) demonstrates impressive performance on the GPT series models. However, the underlying Reinforcement Learning (RL) algorithm is complex and requires an additional training pipeline for reward and value networks. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. Such an algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline. To achieve this, we formulate instruction alignment problem for language models as a goal-reaching problem in decision making. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions. The resulting two-stage algorithm shed light to a family of reward-free approaches that utilize the hindsightly relabeled instructions based on feedback. We evaluate the performance of HIR extensively on 12 challenging BigBench reasoning tasks and show that HIR outperforms the baseline algorithms and is comparable to or even surpasses supervised finetuning.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2302.05206 [cs.CL]
	(or arXiv:2302.05206v1 [cs.CL] for this version)

Submission history

From: Fangchen Liu [view email]
[v1] Fri, 10 Feb 2023 12:16:38 GMT (3177kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2302.05206

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Submission history