We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: QoS-Aware Machine Learning-based Multiple Resources Scheduling for Microservices in Cloud Environment

Authors: Lei Liu
Abstract: Microservices have been dominating in the modern cloud environment. To improve cost efficiency, multiple microservices are normally co-located on a server. Thus, the run-time resource scheduling becomes the pivot for QoS control. However, the scheduling exploration space enlarges rapidly with the increasing server resources - cores, cache, bandwidth, etc. - and the diversity of microservices. Consequently, the existing schedulers might not meet the rapid changes in service demands. Besides, we observe that there exist resource cliffs in the scheduling space. It not only impacts the exploration efficiency, making it difficult to converge to the optimal scheduling solution, but also results in severe QoS fluctuation. To overcome these problems, we propose a novel machine learning-based scheduling mechanism called OSML. It uses resources and runtime states as the input and employs two MLP models and a reinforcement learning model to perform scheduling space exploration. Thus, OSML can reach an optimal solution much faster than traditional approaches. More importantly, it can automatically detect the resource cliff and avoid them during exploration. To verify the effectiveness of OSML and obtain a well-generalized model, we collect a dataset containing over 2-billion samples from 11 typical microservices running on real servers over 9 months. Under the same QoS constraint, experimental results show that OSML outperforms the state-of-the-art work, and achieves around 5 times scheduling speed.
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Operating Systems (cs.OS)
Cite as: arXiv:1911.13208 [cs.DC]
  (or arXiv:1911.13208v2 [cs.DC] for this version)

Submission history

From: Lei Liu [view email]
[v1] Tue, 26 Nov 2019 21:05:00 GMT (2487kb)
[v2] Mon, 2 Dec 2019 08:29:41 GMT (2483kb)
[v3] Tue, 6 Sep 2022 11:10:37 GMT (3534kb,D)

Link back to: arXiv, form interface, contact.