We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models

Abstract: The recent advance of self-supervised learning associated with the Transformer architecture enables natural language processing (NLP) to exhibit extremely low perplexity. Such powerful models demand ever-increasing model size, and thus, large amounts of computations and memory footprints. In this paper, we propose an efficient inference framework for large-scale generative language models. As the key to reducing model size, we quantize weights by a non-uniform quantization method. Then, quantized matrix multiplications are accelerated by our proposed kernel, called nuQmm, which allows a wide trade-off between compression ratio and accuracy. Our proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs. We demonstrate that nuQmm can accelerate the inference speed of the GPT-3 (175B) model by about 14.4 times and save energy consumption by 93%.
Comments: 13 pages (including 2 pages of References), 13 figures, 5 tables
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
Cite as: arXiv:2206.09557 [cs.DC]
  (or arXiv:2206.09557v1 [cs.DC] for this version)

Submission history

From: Gunho Park [view email]
[v1] Mon, 20 Jun 2022 03:48:17 GMT (1454kb)

Link back to: arXiv, form interface, contact.