Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference
<h1> Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework </h1> <p>At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated <strong>llm-d</strong> to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes.</p> <p>If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: <strong>vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing</strong>. llm-d fills exactly this gap a
Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework
At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes.
If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing. llm-d fills exactly this gap as a middleware layer, delivering Disaggregated Serving, hierarchical KV Cache offloading, and prefix-cache-aware routing — all Kubernetes-native.
The Three Bottlenecks llm-d Solves
Running LLM inference in production consistently hits three core bottlenecks:
Bottleneck Problem llm-d Solution
Resource Imbalance Prefill (prompt processing) is GPU compute-intensive; Decode (token generation) is memory bandwidth-intensive — running both in the same Pod caps GPU utilization at 40–60%
Disaggregated Serving — Separate Prefill/Decode into independent Pod Pools with independent scaling
KV Cache Waste Repeated computation of identical system prompts; cache stored only in expensive GPU HBM; cache hit rates plummet in multi-tenant environments
Hierarchical KV Cache Offloading — GPU HBM → CPU DRAM → NVMe tiering + Prefix Caching
Routing Inefficiency Standard Kubernetes Service uses round-robin/random routing — ignoring cache state, model loading status, and GPU topology
Endpoint Picker (EPP) — Prefix-cache-aware routing maximizes cache hit rates
Architecture Deep Dive
The core design philosophy of llm-d is "middleware between the inference engine (vLLM) and the orchestration layer (KServe)". It leverages vLLM's high-performance inference kernels while adding distributed scaling and intelligent routing as Kubernetes-native capabilities.
1. Gateway API Inference Extension (GAIE) & Endpoint Picker
llm-d implements the Kubernetes Gateway API Inference Extension (GAIE). Instead of default round-robin Service routing, the Endpoint Picker (EPP) computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached. In multi-tenant SaaS environments sharing the same system prompt, this maximizes KV Cache hit rates.
# InferencePool CRD — llm-d routing configuration apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: llm-pool namespace: llm-serving spec: targetPortNumber: 8000 selector: app: vllm-llm-d endpointPickerConfig: extensionRef: name: llm-d-epp group: "" kind: Service# InferencePool CRD — llm-d routing configuration apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: name: llm-pool namespace: llm-serving spec: targetPortNumber: 8000 selector: app: vllm-llm-d endpointPickerConfig: extensionRef: name: llm-d-epp group: "" kind: ServiceEnter fullscreen mode
Exit fullscreen mode
2. Disaggregated Serving: Prefill/Decode Separation
LLM inference consists of two distinct phases. Prefill processes the entire input prompt at once to build the KV Cache — GPU compute-intensive. Decode reads the KV Cache to generate tokens one by one — memory bandwidth-bound. These phases have completely different hardware requirements, so running them in the same Pod wastes resources.
llm-d separates these phases into independent Pod Pools. The Prefill Pool runs on nodes with high GPU compute performance, while the Decode Pool runs on nodes with wide memory bandwidth — each auto-scaling independently.
`# Prefill Pool — GPU compute-optimized nodes apiVersion: apps/v1 kind: Deployment metadata: name: vllm-prefill namespace: llm-serving spec: replicas: 4 template: spec: containers:
- name: vllm image: vllm/vllm-openai:v0.8.0 args:
- --model=Qwen/Qwen3-32B
- --tensor-parallel-size=2
- --enable-prefix-caching
- --kv-transfer-config='{"kv_connector":"PyNcclConnector"}' resources: limits: nvidia.com/gpu: "2" nodeSelector: llm-d/role: prefill ---`
Decode Pool — memory bandwidth-optimized nodes
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-decode namespace: llm-serving spec: replicas: 8 template: spec: containers:
- name: vllm image: vllm/vllm-openai:v0.8.0 args:
- --model=Qwen/Qwen3-32B
- --tensor-parallel-size=2
- --enable-prefix-caching
- --kv-transfer-config='{"kv_connector":"PyNcclConnector"}' resources: limits: nvidia.com/gpu: "2" nodeSelector: llm-d/role: decode`
Enter fullscreen mode
Exit fullscreen mode
3. Hierarchical KV Cache Offloading
In LLM inference, the KV Cache lives in GPU HBM — the most expensive and capacity-limited memory. For models with long context windows (128K+ tokens), the KV Cache can consume most of the GPU memory.
llm-d's Hierarchical KV Cache Offloading uses a 3-tier memory hierarchy:
Cache Tier Storage Access Latency Use Case
L1 Hot GPU HBM (H100: 80GB) < 1μs Currently active inference sessions
L2 Warm CPU DRAM (512GB–2TB) 10–50μs Recently used cache, Prefix Cache
L3 Cold NVMe SSD (multi-TB) 100–500μs Cold cache, long context history
Combined with Prefix Caching, this dramatically reduces redundant computation in multi-tenant environments sharing system prompts.
4. LeaderWorkerSet (LWS) and Multi-Node Expert Parallelism
For Mixture of Experts (MoE) models or models with hundreds of billions of parameters that don't fit in a single node's GPU memory, llm-d uses Kubernetes LeaderWorkerSet (LWS) primitives to orchestrate tensor parallelism and Expert Parallelism across multiple nodes.
Performance Benchmarks: v0.5 Results with Qwen3-32B
Official benchmarks from llm-d v0.5 testing Qwen3-32B with 8 vLLM Pods on 16 NVIDIA H100 GPUs:
Metric Baseline K8s Service llm-d v0.5 Improvement
TTFT (Time to First Token) P99 hundreds of ms Near-Zero Latency Significant
Throughput Baseline ~120,000 tokens/sec Linear scaling
GPU Utilization 40–60% 80%+ ~2x
KV Cache Hit Rate Low (random routing) EPP Prefix-Aware Major improvement
The Prefix Caching + EPP combination particularly shines in multi-tenant SaaS scenarios. When serving thousands of concurrent users sharing the same system prompt, TTFT approaches near-zero by routing requests to Pods that already hold the prefix cache.
v0.5 Key Features Summary
Feature Description Use Case
Hierarchical KV Offloading GPU → CPU → NVMe 3-tier cache 128K+ long context, multi-session
Cache-Aware LoRA Routing Routes requests to Pods with the correct LoRA adapter Per-customer fine-tuned model serving
Resilient Networking (UCCL) NVIDIA UCCL-based high-speed GPU interconnect Multi-node tensor parallelism
Scale-to-Zero Autoscaling Scales Pod count to 0 when no traffic Cost optimization (nights/weekends)
Wide Expert Parallelism Distributes MoE Experts across multiple nodes Mixtral, DeepSeek MoE models
Open Benchmarking Standardized, reproducible benchmark framework Hardware/config comparison
Hardware Agnosticism
A core design principle of llm-d is vendor neutrality. It supports NVIDIA H100/A100, AMD MI300X, Intel Gaudi, and Google TPU v5. You can even run heterogeneous accelerator configurations — Prefill Pool on NVIDIA H100 for compute, Decode Pool on AMD MI300X for memory bandwidth — managed declaratively with Kubernetes nodeSelector and DRA.
Quick Start Guide
# 1. Prerequisites kubectl version --client # v1.30+ recommended helm version # v3.12+ nvidia-smi # GPU driver check# 1. Prerequisites kubectl version --client # v1.30+ recommended helm version # v3.12+ nvidia-smi # GPU driver check2. Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
3. Install Gateway API Inference Extension
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
4. Deploy llm-d via Helm
helm repo add llm-d https://llm-d.github.io/llm-d-deployer helm repo update
helm install llm-d llm-d/llm-d
--namespace llm-serving
--create-namespace
--set model.name=Qwen/Qwen3-32B
--set prefill.replicas=2
--set decode.replicas=4
--set gpu.type=nvidia-h100
--set autoscaling.enabled=true
--set autoscaling.scaleToZero=true
5. Verify deployment
kubectl get pods -n llm-serving -w kubectl get inferencepool -n llm-serving`
Enter fullscreen mode
Exit fullscreen mode
llm-d vs Existing Solutions
Feature vLLM Standalone KServe + vLLM llm-d + vLLM
Disaggregated Serving Not supported Not supported Prefill/Decode independent Pools
KV Cache Tiering GPU HBM only GPU HBM only GPU → CPU → NVMe
Routing Single Pod Round-robin Prefix-Cache-Aware EPP
Multi-Node Parallel Manual setup Limited LWS + NCCL/UCCL native
LoRA Routing Single Pod only Not supported Cache-Aware LoRA routing
Scale-to-Zero Not supported Requires Knative Native support
Hardware NVIDIA-centric NVIDIA-centric NVIDIA, AMD, Intel, TPU
Practical Considerations
Be aware this is a CNCF Sandbox project. Sandbox is CNCF's early-stage designation, meaning production stability is not yet fully validated. Features are evolving rapidly, and breaking API changes may occur. Test thoroughly in staging before production deployment.
Coexistence with KServe. llm-d complements rather than replaces KServe. KServe handles model lifecycle (deployment, rollout, canary), while llm-d handles inference-specific routing and cache optimization as a layered architecture.
Monitoring and benchmarking. Use llm-d's Open Benchmarking framework to quantitatively compare TTFT, TPOT, throughput, and KV Cache utilization before and after adoption.
Conclusion
llm-d joining the CNCF Sandbox marks a significant milestone: Kubernetes is evolving into the de facto standard for AI inference infrastructure. When IBM, Red Hat, and Google donate their framework to CNCF, and NVIDIA, AMD, and Intel all join as partners, the industry consensus is clear.
If you're already using vLLM or KServe, llm-d is a natural extension of your stack — boosting GPU utilization with Disaggregated Serving, reducing TTFT with prefix-cache-aware routing, and optimizing memory costs with Hierarchical KV Cache. Given its Sandbox status, we recommend validating quantitative gains through Open Benchmarking in staging before production rollout.
References
-
llm-d Official Site — Docs, quickstart, guides
-
CNCF Blog: Welcome llm-d to the CNCF
-
IBM Research: Donating llm-d to CNCF
-
GitHub: llm-d Organization
This article was written with AI assistance (Claude) and reviewed by the ManoIT editorial team. Technical facts were cross-verified against official documentation.
Originally published at ManoIT Tech Blog.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemistralmodel
OpenAI Tops $850 Billion Valuation | Bloomberg Tech 4/1/2026
Bloomberg’s Tim Stenovec discusses OpenAI’s recent mega-funding round that valued the company at $852 billion. Plus, Anthropic blames the accidental release of internal source code behind its Claude coding assistant on human error. And, it’s launch day for Artemis II as NASA prepares to send astronauts back to the moon’s vicinity. (Source: Bloomberg)
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - wsj.com
<a href="https://news.google.com/rss/articles/CBMimgNBVV95cUxNUGtkMHo4UWtjV2c5NjVFb2FCWVZKNmtINldFU1h1S1JNaVRiaVEwVjFfOXRKUHhYRFZYREg4b255V2VQY01RRUl1NEhhSm1JVlR2VVg0bi1jS0NBTWR5VzEwRHZwYWF6QXRZSjdRT0ItZ1l2c1d5RmszV3A0WGU0WjB6UGNfeldnVkVlc21lRDZSa2NoVVo0TEdId2JIVE93M0J1NzlWRjFJRGRGemFlV3FoQ29ZZVBpZkRGZTlFQWpFY0duSUFUYXRpbVZDamlwdnY3V1JPUWNFZHMxYi1VQmZxQnBOU2sxY1JvTURDTHlIb3BDNl8wUXRhREx0aURuY2N5TjFCX0tPN1d0LVdkUHBRNWVxVHNKa3pHZEdqNUY5RzhfMUdjMjBGQmJWQUw1clR3eHA2Z21ZLU03QTNPMlJ6LXJwYV9fS3NPXzF2Vk9ZSVZVUTRaMEM5b2pUYWR0SC1CbGQtbERmTVI2MzdKeTBPUjk1cmlENVllS056c2o1NzAyZTVPY1EzdENteHZBOU1FVkxxclAyQQ?oc=5" target="_blank">Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model</a> <font color="#6f6f6f">wsj.com</font>
Mistral CEO: AI companies should pay a content levy in Europe - Financial Times
<a href="https://news.google.com/rss/articles/CBMicEFVX3lxTE5URWlfSDBjcE1BeWZ2ZUFUbFAzS2lvZGVjVmVwTVdxTmdtdkp1OFZZejN2a3FFRzE1UFFIQVN4UnlmVFIwT19fMGRMR3NkX2tUNEtfa2R6UFZXOWVOWG43X1lnaXZ0WUNzTUNKN0tIWjA?oc=5" target="_blank">Mistral CEO: AI companies should pay a content levy in Europe</a> <font color="#6f6f6f">Financial Times</font>
'Europe needs AI cloud infrastructure’: Mistral raises $830m for data centre near Paris - Euronews.com
<a href="https://news.google.com/rss/articles/CBMiwgFBVV95cUxPdTBKOUdpQlNIS2pXMXJWUFJFWXZrRHpXWkgwWWluaFdvT1o4TmhqZDMxcE9VVHhGWkhLaTh3WTlwSjdYZXBuWS01OUdwUi1oaktndG5hTFBwQnFCM0J6QVFOYmU5MzNReFFqWjVVNkxNc3NyUzFKSTFSdVQ5aVZmMURJVVZEVzBPU0ZBRXR3amhWY3pUYmRiVzNCdWx2ekNhYUx3Wi1ubm92RmNFdDl5VWhCZmZIdk1nWlpOOVo2dlpQUQ?oc=5" target="_blank">'Europe needs AI cloud infrastructure’: Mistral raises $830m for data centre near Paris</a> <font color="#6f6f6f">Euronews.com</font>



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!