Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Lazic, Nevena; Yin, Dong; Abbasi-Yadkori, Yasin; Szepesvari, Csaba

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2102

Computer Science > Machine Learning

Title: Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Authors: Nevena Lazic, Dong Yin, Yasin Abbasi-Yadkori, Csaba Szepesvari

(Submitted on 25 Feb 2021)

Abstract: In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2102.12611 [cs.LG]
	(or arXiv:2102.12611v1 [cs.LG] for this version)

Submission history

From: Dong Yin [view email]
[v1] Thu, 25 Feb 2021 00:55:07 GMT (2563kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2102.12611

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Improved Regret Bound and Experience Replay in Regularized Policy Iteration

Submission history