Policy Evaluation Networks

Harb, Jean; Schaul, Tom; Precup, Doina; Bacon, Pierre-Luc

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2002

Computer Science > Machine Learning

Title: Policy Evaluation Networks

Authors: Jean Harb, Tom Schaul, Doina Precup, Pierre-Luc Bacon

(Submitted on 26 Feb 2020)

Abstract: Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

Comments:	12 pages, 11 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Cite as:	arXiv:2002.11833 [cs.LG]
	(or arXiv:2002.11833v1 [cs.LG] for this version)

Submission history

From: Jean Harb [view email]
[v1] Wed, 26 Feb 2020 23:00:27 GMT (299kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2002.11833

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Policy Evaluation Networks

Submission history