Models claude mistral model language model benchmark release

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference

Dev.to AIby daniel jeongApril 1, 20268 min read0 views

<h1> Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework </h1> At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes. If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing. llm-d fills exactly this gap a

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework

At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes.

If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing. llm-d fills exactly this gap as a middleware layer, delivering Disaggregated Serving, hierarchical KV Cache offloading, and prefix-cache-aware routing — all Kubernetes-native.

The Three Bottlenecks llm-d Solves

Running LLM inference in production consistently hits three core bottlenecks:

Bottleneck Problem llm-d Solution

Resource Imbalance Prefill (prompt processing) is GPU compute-intensive; Decode (token generation) is memory bandwidth-intensive — running both in the same Pod caps GPU utilization at 40–60%

Disaggregated Serving — Separate Prefill/Decode into independent Pod Pools with independent scaling

KV Cache Waste Repeated computation of identical system prompts; cache stored only in expensive GPU HBM; cache hit rates plummet in multi-tenant environments

Hierarchical KV Cache Offloading — GPU HBM → CPU DRAM → NVMe tiering + Prefix Caching

Routing Inefficiency Standard Kubernetes Service uses round-robin/random routing — ignoring cache state, model loading status, and GPU topology

Endpoint Picker (EPP) — Prefix-cache-aware routing maximizes cache hit rates

Architecture Deep Dive

The core design philosophy of llm-d is "middleware between the inference engine (vLLM) and the orchestration layer (KServe)". It leverages vLLM's high-performance inference kernels while adding distributed scaling and intelligent routing as Kubernetes-native capabilities.

1. Gateway API Inference Extension (GAIE) & Endpoint Picker

llm-d implements the Kubernetes Gateway API Inference Extension (GAIE). Instead of default round-robin Service routing, the Endpoint Picker (EPP) computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached. In multi-tenant SaaS environments sharing the same system prompt, this maximizes KV Cache hit rates.

# InferencePool CRD — llm-d routing configuration apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata:  name: llm-pool  namespace: llm-serving spec:  targetPortNumber: 8000  selector:  app: vllm-llm-d  endpointPickerConfig:  extensionRef:  name: llm-d-epp  group: ""  kind: Service

# InferencePool CRD — llm-d routing configuration apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata:  name: llm-pool  namespace: llm-serving spec:  targetPortNumber: 8000  selector:  app: vllm-llm-d  endpointPickerConfig:  extensionRef:  name: llm-d-epp  group: ""  kind: Service

Enter fullscreen mode

Exit fullscreen mode

2. Disaggregated Serving: Prefill/Decode Separation

LLM inference consists of two distinct phases. Prefill processes the entire input prompt at once to build the KV Cache — GPU compute-intensive. Decode reads the KV Cache to generate tokens one by one — memory bandwidth-bound. These phases have completely different hardware requirements, so running them in the same Pod wastes resources.

llm-d separates these phases into independent Pod Pools. The Prefill Pool runs on nodes with high GPU compute performance, while the Decode Pool runs on nodes with wide memory bandwidth — each auto-scaling independently.

`# Prefill Pool — GPU compute-optimized nodes apiVersion: apps/v1 kind: Deployment metadata: name: vllm-prefill namespace: llm-serving spec: replicas: 4 template: spec: containers:

name: vllm image: vllm/vllm-openai:v0.8.0 args:
--model=Qwen/Qwen3-32B
--tensor-parallel-size=2
--enable-prefix-caching
--kv-transfer-config='{"kv_connector":"PyNcclConnector"}' resources: limits: nvidia.com/gpu: "2" nodeSelector: llm-d/role: prefill ---`

Decode Pool — memory bandwidth-optimized nodes

apiVersion: apps/v1 kind: Deployment metadata: name: vllm-decode namespace: llm-serving spec: replicas: 8 template: spec: containers:

name: vllm image: vllm/vllm-openai:v0.8.0 args:
--model=Qwen/Qwen3-32B
--tensor-parallel-size=2
--enable-prefix-caching
--kv-transfer-config='{"kv_connector":"PyNcclConnector"}' resources: limits: nvidia.com/gpu: "2" nodeSelector: llm-d/role: decode`

Enter fullscreen mode

Exit fullscreen mode

3. Hierarchical KV Cache Offloading

In LLM inference, the KV Cache lives in GPU HBM — the most expensive and capacity-limited memory. For models with long context windows (128K+ tokens), the KV Cache can consume most of the GPU memory.

llm-d's Hierarchical KV Cache Offloading uses a 3-tier memory hierarchy:

Cache Tier Storage Access Latency Use Case

L1 Hot GPU HBM (H100: 80GB) < 1μs Currently active inference sessions

L2 Warm CPU DRAM (512GB–2TB) 10–50μs Recently used cache, Prefix Cache

L3 Cold NVMe SSD (multi-TB) 100–500μs Cold cache, long context history

Combined with Prefix Caching, this dramatically reduces redundant computation in multi-tenant environments sharing system prompts.

4. LeaderWorkerSet (LWS) and Multi-Node Expert Parallelism

For Mixture of Experts (MoE) models or models with hundreds of billions of parameters that don't fit in a single node's GPU memory, llm-d uses Kubernetes LeaderWorkerSet (LWS) primitives to orchestrate tensor parallelism and Expert Parallelism across multiple nodes.

Performance Benchmarks: v0.5 Results with Qwen3-32B

Official benchmarks from llm-d v0.5 testing Qwen3-32B with 8 vLLM Pods on 16 NVIDIA H100 GPUs:

Metric Baseline K8s Service llm-d v0.5 Improvement

TTFT (Time to First Token) P99 hundreds of ms Near-Zero Latency Significant

Throughput Baseline ~120,000 tokens/sec Linear scaling

GPU Utilization 40–60% 80%+ ~2x

KV Cache Hit Rate Low (random routing) EPP Prefix-Aware Major improvement

The Prefix Caching + EPP combination particularly shines in multi-tenant SaaS scenarios. When serving thousands of concurrent users sharing the same system prompt, TTFT approaches near-zero by routing requests to Pods that already hold the prefix cache.

v0.5 Key Features Summary

Feature Description Use Case

Hierarchical KV Offloading GPU → CPU → NVMe 3-tier cache 128K+ long context, multi-session

Cache-Aware LoRA Routing Routes requests to Pods with the correct LoRA adapter Per-customer fine-tuned model serving

Resilient Networking (UCCL) NVIDIA UCCL-based high-speed GPU interconnect Multi-node tensor parallelism

Scale-to-Zero Autoscaling Scales Pod count to 0 when no traffic Cost optimization (nights/weekends)

Wide Expert Parallelism Distributes MoE Experts across multiple nodes Mixtral, DeepSeek MoE models

Open Benchmarking Standardized, reproducible benchmark framework Hardware/config comparison

Hardware Agnosticism

A core design principle of llm-d is vendor neutrality. It supports NVIDIA H100/A100, AMD MI300X, Intel Gaudi, and Google TPU v5. You can even run heterogeneous accelerator configurations — Prefill Pool on NVIDIA H100 for compute, Decode Pool on AMD MI300X for memory bandwidth — managed declaratively with Kubernetes nodeSelector and DRA.

Quick Start Guide

# 1. Prerequisites kubectl version --client # v1.30+ recommended helm version # v3.12+ nvidia-smi # GPU driver check

# 1. Prerequisites kubectl version --client # v1.30+ recommended helm version # v3.12+ nvidia-smi # GPU driver check

2. Install Gateway API CRDs

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

3. Install Gateway API Inference Extension

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

4. Deploy llm-d via Helm

helm repo add llm-d https://llm-d.github.io/llm-d-deployer helm repo update

helm install llm-d llm-d/llm-d
--namespace llm-serving
--create-namespace
--set model.name=Qwen/Qwen3-32B
--set prefill.replicas=2
--set decode.replicas=4
--set gpu.type=nvidia-h100
--set autoscaling.enabled=true
--set autoscaling.scaleToZero=true

5. Verify deployment

kubectl get pods -n llm-serving -w kubectl get inferencepool -n llm-serving`

Enter fullscreen mode

Exit fullscreen mode

llm-d vs Existing Solutions

Feature vLLM Standalone KServe + vLLM llm-d + vLLM

Disaggregated Serving Not supported Not supported Prefill/Decode independent Pools

KV Cache Tiering GPU HBM only GPU HBM only GPU → CPU → NVMe

Routing Single Pod Round-robin Prefix-Cache-Aware EPP

Multi-Node Parallel Manual setup Limited LWS + NCCL/UCCL native

LoRA Routing Single Pod only Not supported Cache-Aware LoRA routing

Scale-to-Zero Not supported Requires Knative Native support

Hardware NVIDIA-centric NVIDIA-centric NVIDIA, AMD, Intel, TPU

Practical Considerations

Be aware this is a CNCF Sandbox project. Sandbox is CNCF's early-stage designation, meaning production stability is not yet fully validated. Features are evolving rapidly, and breaking API changes may occur. Test thoroughly in staging before production deployment.

Coexistence with KServe. llm-d complements rather than replaces KServe. KServe handles model lifecycle (deployment, rollout, canary), while llm-d handles inference-specific routing and cache optimization as a layered architecture.

Monitoring and benchmarking. Use llm-d's Open Benchmarking framework to quantitatively compare TTFT, TPOT, throughput, and KV Cache utilization before and after adoption.

Conclusion

llm-d joining the CNCF Sandbox marks a significant milestone: Kubernetes is evolving into the de facto standard for AI inference infrastructure. When IBM, Red Hat, and Google donate their framework to CNCF, and NVIDIA, AMD, and Intel all join as partners, the industry consensus is clear.

If you're already using vLLM or KServe, llm-d is a natural extension of your stack — boosting GPU utilization with Disaggregated Serving, reducing TTFT with prefix-cache-aware routing, and optimizing memory costs with Hierarchical KV Cache. Given its Sandbox status, we recommend validating quantitative gains through Open Benchmarking in staging before production rollout.

References

llm-d Official Site — Docs, quickstart, guides
CNCF Blog: Welcome llm-d to the CNCF
IBM Research: Donating llm-d to CNCF
GitHub: llm-d Organization

This article was written with AI assistance (Claude) and reviewed by the ManoIT editorial team. Technical facts were cross-verified against official documentation.

Originally published at ManoIT Tech Blog.

Original source

Dev.to AI

https://dev.to/x4nent/complete-guide-to-llm-d-cncf-sandbox-kubernetes-native-distributed-llm-inference-1imj

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemistralmodel

Market NewsLive

OpenAI Tops $850 Billion Valuation | Bloomberg Tech 4/1/2026

Bloomberg’s Tim Stenovec discusses OpenAI’s recent mega-funding round that valued the company at $852 billion. Plus, Anthropic blames the accidental release of internal source code behind its Claude coding assistant on human error. And, it’s launch day for Artemis II as NASA prepares to send astronauts back to the moon’s vicinity. (Source: Bloomberg)

Bloomberg Technology

1mabout 1 hour ago

ReleasesLive

Apple releases security fix for older iPhones and iPads to protect against DarkSword attacks

The security update protects a raft of older iPhones and iPads from attacks linked to leaked hacking tools called DarkSword.

TechCrunch AI

1mabout 1 hour ago

ModelsLive

We still highly recommend these 3 older laptop models - especially while they're on sale

Amazon's Spring Sale has concluded. Grab last-minute deals on these high-performing older laptops before it's too late.

ZDNet Big Data

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 151 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

We still highly recommend these 3 older laptop models - especially while they're on sale

Amazon's Spring Sale has concluded. Grab last-minute deals on these high-performing older laptops before it's too late.

ZDNet Big Data

1mabout 1 hour ago

Models

Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model - wsj.com

<a href="https://news.google.com/rss/articles/CBMimgNBVV95cUxNUGtkMHo4UWtjV2c5NjVFb2FCWVZKNmtINldFU1h1S1JNaVRiaVEwVjFfOXRKUHhYRFZYREg4b255V2VQY01RRUl1NEhhSm1JVlR2VVg0bi1jS0NBTWR5VzEwRHZwYWF6QXRZSjdRT0ItZ1l2c1d5RmszV3A0WGU0WjB6UGNfeldnVkVlc21lRDZSa2NoVVo0TEdId2JIVE93M0J1NzlWRjFJRGRGemFlV3FoQ29ZZVBpZkRGZTlFQWpFY0duSUFUYXRpbVZDamlwdnY3V1JPUWNFZHMxYi1VQmZxQnBOU2sxY1JvTURDTHlIb3BDNl8wUXRhREx0aURuY2N5TjFCX0tPN1d0LVdkUHBRNWVxVHNKa3pHZEdqNUY5RzhfMUdjMjBGQmJWQUw1clR3eHA2Z21ZLU03QTNPMlJ6LXJwYV9fS3NPXzF2Vk9ZSVZVUTRaMEM5b2pUYWR0SC1CbGQtbERmTVI2MzdKeTBPUjk1cmlENVllS056c2o1NzAyZTVPY1EzdENteHZBOU1FVkxxclAyQQ?oc=5" target="_blank">Exclusive | Meta Is Delaying the Rollout of Its Flagship AI Model</a> wsj.com

GNews AI Llama

1m11 months ago

Models

Mistral CEO: AI companies should pay a content levy in Europe - Financial Times

<a href="https://news.google.com/rss/articles/CBMicEFVX3lxTE5URWlfSDBjcE1BeWZ2ZUFUbFAzS2lvZGVjVmVwTVdxTmdtdkp1OFZZejN2a3FFRzE1UFFIQVN4UnlmVFIwT19fMGRMR3NkX2tUNEtfa2R6UFZXOWVOWG43X1lnaXZ0WUNzTUNKN0tIWjA?oc=5" target="_blank">Mistral CEO: AI companies should pay a content levy in Europe</a> Financial Times

GNews AI Mistral

1m13 days ago

Models

'Europe needs AI cloud infrastructure’: Mistral raises $830m for data centre near Paris - Euronews.com

<a href="https://news.google.com/rss/articles/CBMiwgFBVV95cUxPdTBKOUdpQlNIS2pXMXJWUFJFWXZrRHpXWkgwWWluaFdvT1o4TmhqZDMxcE9VVHhGWkhLaTh3WTlwSjdYZXBuWS01OUdwUi1oaktndG5hTFBwQnFCM0J6QVFOYmU5MzNReFFqWjVVNkxNc3NyUzFKSTFSdVQ5aVZmMURJVVZEVzBPU0ZBRXR3amhWY3pUYmRiVzNCdWx2ekNhYUx3Wi1ubm92RmNFdDl5VWhCZmZIdk1nWlpOOVo2dlpQUQ?oc=5" target="_blank">'Europe needs AI cloud infrastructure’: Mistral raises $830m for data centre near Paris</a> Euronews.com

GNews AI Mistral

1m2 days ago