A short variational proof of equivalence between policy gradients and soft Q learning

Richemond, Pierre H.; Maginnis, Brendan

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 1712

Change to browse by:

Computer Science > Machine Learning

Title: A short variational proof of equivalence between policy gradients and soft Q learning

Authors: Pierre H. Richemond, Brendan Maginnis

(Submitted on 22 Dec 2017)

Abstract: Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:1712.08650 [cs.LG]
	(or arXiv:1712.08650v1 [cs.LG] for this version)

Submission history

From: Pierre Richemond [view email]
[v1] Fri, 22 Dec 2017 20:37:16 GMT (36kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1712.08650

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: A short variational proof of equivalence between policy gradients and soft Q learning

Submission history