In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall inference throughput. In the second part of the blog, we deep dive into our efforts in optimizing host-to-device and device-to-host throughput for KV offloading.
Want to know the latest news and articles posted on VLLM Blog?
Then subscribe to their feed now! You can receive their updates by email, via mobile or on your personal news page on this website.
See what they recently published below.
Website title: VLLM Blog | vLLM is a fast and easy-to-use library for LLM inference and serving.
