Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

Wang, Jinghan; Wang, Mengdi; Yang, Lin F.

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 2212

Computer Science > Machine Learning

Title: Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

Authors: Jinghan Wang, Mengdi Wang, Lin F. Yang

(Submitted on 1 Dec 2022)

Abstract: This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}{\delta})$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $\delta$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $\Omega(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}{\delta})$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}{\delta})$ up to some logarithmic factors.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2212.00603 [cs.LG]
	(or arXiv:2212.00603v1 [cs.LG] for this version)

Submission history

From: Lin Yang [view email]
[v1] Thu, 1 Dec 2022 15:57:58 GMT (21kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2212.00603

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP

Submission history