Models llama model version llama.cpp unsloth

Gemma 4 is seriously broken when using Unsloth and llama.cpp

Reddit r/LocalLLaMAby /u/Tastetrykker https://www.reddit.com/user/TastetrykkerApril 3, 20261 min read1 views

Hi! Just checking, am I the only one who has serious issues with Gemma 4 locally? I've played around with Gemma 4 using Unsloth quants on llama.cpp, and it's seriously broken. I'm using the latest changes from llama.cpp, along with the reccomended temperature, top-p and top-k. Giving it an article and asking it to list all typos along with the corrected version gives total nonsense. Here is a random news article I tested it with: https://www.bbc.com/news/articles/ce843ge47z4o I've tried the 26B MoE, I've tried the 31B, and I've tried UD-Q8_K_XL, Q8_0, and UD-Q4_K_XL. They all have the same issue. As a control, I tested the same thing in Google AI Studio, and there the models work great, finding actual typos instead of the nonsense I get locally. submitted by /u/Tastetrykker [link] [comment

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sb4gzj/gemma_4_is_seriously_broken_when_using_unsloth/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodelversion

Open Source AIFresh

Speculative decoding works great for Gemma 4 31B in llama.cpp

I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding: --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Testing with (on a 3090): ./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 --jinja --temp 1.0 --top-p 0.95 --top-k 64 -ngl 1000 -st -f prompt.txt --no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Gave me: [ Prompt: 607.3 t/s | Generation: 36.6 t/s ] draft acceptance rate = 0.44015 ( 820 accepted / 1863 generated) vs. [ Prompt: 613.8 t/s | Generation: 32.9 t/s ] submitted by /u/Leopold_Boom [link] [comments]

Reddit r/LocalLLaMA

1mabout 5 hours ago

ModelsFresh

Gemma 4 - 4B vs Qwen 3.5 - 9B ?

Hello! anyone tried the 4B Gemma 4 model and the Qwen 3.5 9B model and can tell us their feedback? On the benchmark Qwen seems to be doing better, but I would appreciate any personal experience on the matter Thanks! submitted by /u/No-Mud-1902 [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsFresh

Speed difference on Gemma 4 26B-A4B between Bartowski Q4_K_M and Unsloth Q4_K_XL

I've noticed this on Qwen3.5 35B before as well, there is a noticeable speed difference between Unsloth's Q4_K_XL and Bartowski's Q4_K_M on the same model, but Gemma 4 seems particularly harsh in this regard: Bartowski gets 38 tk/s, Unsloth gets 28 tk/s... everything else is the same, settings wise. This is with the latest Unsloth quant update and latest llama.cpp version. Their size is only ~100 MB apart. Anyone have any idea why this speed difference is there? Btw, on Qwen3.5 35B I noticed that Unsloth's own Q4_K_M was also a bit faster than the Q4_K_XL, but there it was more like 39 vs 42 tk/s. submitted by /u/BelgianDramaLlama86 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 210 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Speed difference on Gemma 4 26B-A4B between Bartowski Q4_K_M and Unsloth Q4_K_XL

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsFresh

Gemma 4 - 4B vs Qwen 3.5 - 9B ?

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsLive

Capability-Based Organizations (CBO): A New Operating Model for the AI Era

Abstract Continue reading on Medium »

Medium AI

1m44 minutes ago

ModelsFresh

Kokoro TTS running on-device, CPU-only, 20x realtime!!!

I wanted a reading app where you could read, read and listen or just listen to books with word-by-word highlighting synced to TTS and i wanted the voice to actually sound good. This turned out to be a really hard challenge with Kokoro on iOS, here's what I ran into: Using MLX Swift is great but uses Metal. iOS kills Metal access the moment you background the app. If your use case needs background audio, this is a dead end. ONNX Runtime on CPU fixes the background problem, but the monolithic Kokoro model only runs at 2-3x realtime. After 30 minutes of sustained generation my phone was scorching hot. What actually worked: I split the monolithic model into a multi-stage pipeline and replaced part of the synthesis with native code on Apple's Accelerate framework. That got it to 20x realtime on

Reddit r/LocalLLaMA

2mabout 4 hours ago