We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Mathematics > Optimization and Control

Title: Myopic Quantal Response Policy: Thompson Sampling Meets Behavioral Economics

Abstract: We study a new family of multi-armed bandit (MAB) algorithms called Myopic Quantal Response (MQR). It prescribes a simple way to randomize over arms according to historical data and a "coefficient of exploitation," which explicitly controls the exploration-exploitation trade-off. We show that MQR partially extends the Thompson Sampling (TS) algorithm. It is also a dynamic version of quantal response models where the expected utilities are directly estimated from historical rewards. Based on theoretical analysis and numerical experiments, we believe the significance of MQR is three-fold. First, it provides a conceptual framework to understand further the exploration-exploitation trade-off. Second, it can be used as a structural estimation tool to learn from realized actions and rewards how much a given policy (either generated by human beings or algorithms) is "under" or "over" exploring. Finally, it can inspire new MAB algorithms that improve well-studied algorithms (e.g., TS) in a nonasymptotic setting.
Subjects: Optimization and Control (math.OC)
Cite as: arXiv:2207.01028 [math.OC]
  (or arXiv:2207.01028v1 [math.OC] for this version)

Submission history

From: Jingying Ding [view email]
[v1] Sun, 3 Jul 2022 12:57:13 GMT (319kb,D)

Link back to: arXiv, form interface, contact.