References & Citations
Computer Science > Distributed, Parallel, and Cluster Computing
Title: FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
(Submitted on 6 Jun 2023 (v1), last revised 8 Feb 2024 (this version, v2))
Abstract: Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.
Submission history
From: Minchen Yu [view email][v1] Tue, 6 Jun 2023 12:19:05 GMT (500kb,D)
[v2] Thu, 8 Feb 2024 12:34:16 GMT (357kb,D)
Link back to: arXiv, form interface, contact.