Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessI Renamed All 43 Tools in My MCP Server. Here's Why I Did It Now.Dev.to AIWhy AI Pilots Fail — And the 5 Patterns That Actually Get to ProductionDev.to AIBuilding Predictive Maintenance Systems for Infrastructure MonitoringDev.to AIThe Best Scribe Alternative in 2026 (Privacy-First, AI-Ready)Dev.to AII Started Building a Roguelike RPG — Powered by On-Device AI #2Dev.to AIGR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue LiftDev.to AIFAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained ReasoningDev.to AIOwn Your Data: The Wake-Up CallDev.to AIHow I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)Dev.to AIClaude Code subagent patterns: how to break big tasks into bounded scopesDev.to AIIntercom Opens Fin to the World - The AI Economy | Ken YeungGNews AI RAGAnthropic says Claude subscriptions will no longer cover usage on third-party tools like OpenClaw starting April 4 at 12pm PT, to better manage capacity (Boris Cherny/@bcherny)TechmemeBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessI Renamed All 43 Tools in My MCP Server. Here's Why I Did It Now.Dev.to AIWhy AI Pilots Fail — And the 5 Patterns That Actually Get to ProductionDev.to AIBuilding Predictive Maintenance Systems for Infrastructure MonitoringDev.to AIThe Best Scribe Alternative in 2026 (Privacy-First, AI-Ready)Dev.to AII Started Building a Roguelike RPG — Powered by On-Device AI #2Dev.to AIGR4AD: Kuaishou's Production-Ready Generative Recommender for Ads Delivers 4.2% Revenue LiftDev.to AIFAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained ReasoningDev.to AIOwn Your Data: The Wake-Up CallDev.to AIHow I Replaced 6 Paid AI Subscriptions With One Free Tool (Saved $86/Month)Dev.to AIClaude Code subagent patterns: how to break big tasks into bounded scopesDev.to AIIntercom Opens Fin to the World - The AI Economy | Ken YeungGNews AI RAGAnthropic says Claude subscriptions will no longer cover usage on third-party tools like OpenClaw starting April 4 at 12pm PT, to better manage capacity (Boris Cherny/@bcherny)Techmeme
AI NEWS HUBbyEIGENVECTOREigenvector

Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark

Reddit r/LocalLLaMAby /u/PerceptionGrouchy187 https://www.reddit.com/user/PerceptionGrouchy187April 3, 20262 min read1 views
Source Quiz

Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression. System Specs Component Spec GPU NVIDIA GeForce RTX 5090 (32GB VRAM) CPU AMD Ryzen 9 9950X3D (16-core) RAM 64GB DDR5 OS Windows 11 Setup Model : gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB) Build : TheTom/llama-cpp-turboquant branch feature/turboquant-kv-cache , merged with latest upstream master for Gemma 4 support KV Cache : turbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16) Config : --n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3 Benchmark Results Test Speed (t/s) pp4096 3,362.71 pp16384 3,047.00 pp65536 2,077.96 pp131072 1,428.80 pp262144 899.55 tg128 61.51 VRAM usage at 262K : 27.7 GB / 32 GB (4.3 GB headroom) G

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →
Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Gemma 4 31B…llamamodelbenchmarkreleasefeatureglobalReddit r/Lo…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 128 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!