Efficient Memory Management for Large Language Model Serving with PagedAttention ...

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be...

Full description

Bibliographic Details
Main Authors: Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao, Gonzalez, Joseph E., Zhang, Hao, Stoica, Ion
Format: Article in Journal/Newspaper
Language:unknown
Published: arXiv 2023
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2309.06180
https://arxiv.org/abs/2309.06180
id ftdatacite:10.48550/arxiv.2309.06180
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2309.06180 2023-11-05T03:44:31+01:00 Efficient Memory Management for Large Language Model Serving with PagedAttention ... Kwon, Woosuk Li, Zhuohan Zhuang, Siyuan Sheng, Ying Zheng, Lianmin Yu, Cody Hao Gonzalez, Joseph E. Zhang, Hao Stoica, Ion 2023 https://dx.doi.org/10.48550/arxiv.2309.06180 https://arxiv.org/abs/2309.06180 unknown arXiv Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 Machine Learning cs.LG Distributed, Parallel, and Cluster Computing cs.DC FOS Computer and information sciences Article article CreativeWork Preprint 2023 ftdatacite https://doi.org/10.48550/arxiv.2309.06180 2023-10-09T10:55:14Z High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and ... : SOSP 2023 ... Article in Journal/Newspaper Orca DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Machine Learning cs.LG
Distributed, Parallel, and Cluster Computing cs.DC
FOS Computer and information sciences
spellingShingle Machine Learning cs.LG
Distributed, Parallel, and Cluster Computing cs.DC
FOS Computer and information sciences
Kwon, Woosuk
Li, Zhuohan
Zhuang, Siyuan
Sheng, Ying
Zheng, Lianmin
Yu, Cody Hao
Gonzalez, Joseph E.
Zhang, Hao
Stoica, Ion
Efficient Memory Management for Large Language Model Serving with PagedAttention ...
topic_facet Machine Learning cs.LG
Distributed, Parallel, and Cluster Computing cs.DC
FOS Computer and information sciences
description High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and ... : SOSP 2023 ...
format Article in Journal/Newspaper
author Kwon, Woosuk
Li, Zhuohan
Zhuang, Siyuan
Sheng, Ying
Zheng, Lianmin
Yu, Cody Hao
Gonzalez, Joseph E.
Zhang, Hao
Stoica, Ion
author_facet Kwon, Woosuk
Li, Zhuohan
Zhuang, Siyuan
Sheng, Ying
Zheng, Lianmin
Yu, Cody Hao
Gonzalez, Joseph E.
Zhang, Hao
Stoica, Ion
author_sort Kwon, Woosuk
title Efficient Memory Management for Large Language Model Serving with PagedAttention ...
title_short Efficient Memory Management for Large Language Model Serving with PagedAttention ...
title_full Efficient Memory Management for Large Language Model Serving with PagedAttention ...
title_fullStr Efficient Memory Management for Large Language Model Serving with PagedAttention ...
title_full_unstemmed Efficient Memory Management for Large Language Model Serving with PagedAttention ...
title_sort efficient memory management for large language model serving with pagedattention ...
publisher arXiv
publishDate 2023
url https://dx.doi.org/10.48550/arxiv.2309.06180
https://arxiv.org/abs/2309.06180
genre Orca
genre_facet Orca
op_rights Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
op_doi https://doi.org/10.48550/arxiv.2309.06180
_version_ 1781704615881867264