Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark
Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression. System Specs Component Spec GPU NVIDIA GeForce RTX 5090 (32GB VRAM) CPU AMD Ryzen 9 9950X3D (16-core) RAM 64GB DDR5 OS Windows 11 Setup Model : gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB) Build : TheTom/llama-cpp-turboquant branch feature/turboquant-kv-cache , merged with latest upstream master for Gemma 4 support KV Cache : turbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16) Config : --n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3 Benchmark Results Test Speed (t/s) pp4096 3,362.71 pp16384 3,047.00 pp65536 2,077.96 pp131072 1,428.80 pp262144 899.55 tg128 61.51 VRAM usage at 262K : 27.7 GB / 32 GB (4.3 GB headroom) G
Could not retrieve the full article text.
Read on Reddit r/LocalLLaMA →Reddit r/LocalLLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1sbdihw/gemma_4_31b_at_256k_full_context_on_a_single_rtx/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!