Please turn JavaScript on

vLLM Blog

Want to know the latest news and articles posted on VLLM Blog?

Then subscribe to their feed now! You can receive their updates by email, via mobile or on your personal news page on this website.

See what they recently published below.

Website title: VLLM Blog | vLLM is a fast and easy-to-use library for LLM inference and serving.

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  1.85 / week

Message History

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall inference throughput. In the second part of the blog, we deep dive into our efforts in optimizing host-to-device and device-to-host throughput for KV offloading.


Read full story

vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from requests, responses, and context to make intelligent routing decisions—including model selection, safety filtering (jailbreak, PII), semantic caching, and hallucination detection. For more ...


Read full story