We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.LG

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Machine Learning

Title: Refined Policy Improvement Bounds for MDPs

Abstract: The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in \cite{Schulman2015, Achiam2017} and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.
Comments: Workshop on Reinforcement Learning Theory, ICML 2021
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
Cite as: arXiv:2107.08068 [cs.LG]
  (or arXiv:2107.08068v1 [cs.LG] for this version)

Submission history

From: Mark Gluzman [view email]
[v1] Fri, 16 Jul 2021 18:22:30 GMT (29kb)

Link back to: arXiv, form interface, contact.