Open Source AI llama model launch github llama.cpp quantization

VRAM optimization for gemma 4

Reddit r/LocalLLaMAby /u/Sadman782 https://www.reddit.com/user/Sadman782April 3, 20262 min read2 views

🧒Explain Like I'm 5Simple language

Hey little friend! Imagine your computer is like a big toy box, and inside, there's a super-smart robot brain named Gemma! 🤖

Sometimes, Gemma needs a special crayon box called "VRAM" to draw its thoughts. But Gemma was being a bit greedy and taking up too many crayons, even before it started drawing! This made the crayon box full too fast. 🖍️💨

But guess what? Smart grown-ups found a magic trick! They added a secret word, -np 1, to Gemma's instructions. It's like telling Gemma, "Hey, only you are drawing right now, so don't grab ALL the crayons!" ✨

Now, Gemma uses much fewer crayons, and there's more space in the crayon box for all its amazing drawings! Hooray! 🎉

TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why. The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build. A few things that actually help with VRAM: The SWA cache s

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sb80yv/vram_optimization_for_gemma_4/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellaunch

Models

A “diff” tool for AI: Finding behavioral differences in new models - Anthropic

A “diff” tool for AI: Finding behavioral differences in new models Anthropic

Google News: AI Safety

1m22 days ago

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m3 days ago

ModelsRecent

Google DeepMind’s AI model leading the pack in hurricane forecasting - Yahoo

Google DeepMind’s AI model leading the pack in hurricane forecasting Yahoo

Google News: DeepMind

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 187 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AIFresh

Run Linux containers on Android, no root required

Article URL: https://github.com/ExTV/Podroid Comments URL: https://news.ycombinator.com/item?id=47633131 Points: 53 # Comments: 15

Hacker News Top

1mabout 7 hours ago

Open Source AIFresh

v4.3.3 - Gemma 4 support!

Changes Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1. ik_llama.cpp support : Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 5

text-gen-webui Releases

3mabout 5 hours ago

Open Source AIFresh

B70: Quick and Early Benchmarks & Backend Comparison

llama.cpp: f1f793ad0 (8657) This is a quick attempt to just get it up and running. Lots of oneapi runtime still using "stable" from Intels repo. Kernel 6.19.8+deb13-amd64 with an updated xe firmware built. Vulkan is Debian but using latest Mesa compiled from source. Openvino is 2026.0. Feels like everything is "barely on the brink of working" (which is to be expected). sycl: $ build/bin/llama-bench -hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL -p 512,16384 -n 128,512 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp512 | 798.07 ± 2.72 | | qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | SYCL | 99 | pp16384

Reddit r/LocalLLaMA

3mabout 6 hours ago

Open Source AIRecent

Energy constraints loom larger than water for Colorado AI boom, experts say - Colorado Politics

Energy constraints loom larger than water for Colorado AI boom, experts say Colorado Politics

GNews AI energy

1m2 days ago