Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Nath, Swaroop; Siledar, Tejpalsingh; Muddu, Sankara Sri Raghava Ravindra; Rangaraju, Rupasai; Khadilkar, Harshad; Bhattacharyya, Pushpak; Banerjee, Suman; Patil, Amey; Singh, Sudhanshu Shekhar; Chelliah, Muthusamy; Garera, Nikesh

Full-text links:

Download:

Current browse context:

cs.CL

< prev | next >

new | recent | 2402

Computer Science > Computation and Language

Title: Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Authors: Swaroop Nath, Tejpalsingh Siledar, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Harshad Khadilkar, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera

(Submitted on 23 Feb 2024 (v1), last revised 18 Apr 2024 (this version, v2))

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning Language Models (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ. OpinPref: hf.co/opin-pref) for usage under MIT License.

Comments:	19 pages, 6 figures, 21 tables
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2402.15473 [cs.CL]
	(or arXiv:2402.15473v2 [cs.CL] for this version)

Submission history

From: Swaroop Nath [view email]
[v1] Fri, 23 Feb 2024 18:05:06 GMT (7890kb,D)
[v2] Thu, 18 Apr 2024 06:38:22 GMT (8085kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2402.15473

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computation and Language

Title: Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization

Submission history