Google - Gemma 4 now in Unsloth!
Google releases Gemma 4 with four new models: E2B, E4B, 26B-A4B, 31B. You can now run and train the Gemma 4 models in Unsloth. Guide / Blog: https://unsloth.ai/docs/models/gemma-4 The multimodal reasoning models are licensed under Apache 2.0. Run E2B and E4B on 6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB. GGUFs: https://huggingface.co/collections/unsloth/gemma-4 Updates Tool calls for smaller models are now more stable and don't cut off anymore Context length is now properly applied. Tool calls for all models are now +30% to +80% more accurate. Web search now actually gets web content and not just summaries Number of tool calls allowed are increased to 25 from 10 Tool calls now terminate much better, so looping / repetitions will be reduced More tool call healing and de-duplicatio
Google releases Gemma 4 with four new models: E2B, E4B, 26B-A4B, 31B.
-
You can now run and train the Gemma 4 models in Unsloth. Guide / Blog: https://unsloth.ai/docs/models/gemma-4
-
The multimodal reasoning models are licensed under Apache 2.0.
-
Run E2B and E4B on 6GB RAM, and on phones. Run 26B-A4B and 31B on ~18GB.
Updates
-
Tool calls for smaller models are now more stable and don't cut off anymore
-
Context length is now properly applied.
-
Tool calls for all models are now +30% to +80% more accurate.
-
Web search now actually gets web content and not just summaries
-
Number of tool calls allowed are increased to 25 from 10
-
Tool calls now terminate much better, so looping / repetitions will be reduced
-
More tool call healing and de-duplication logic to stop tool callings from leaking XML as well
-
Tested with unsloth/Qwen3.5-4B-GGUF (UD-Q4_K_XL), web search + code execution + thinking enabled.
Metric Before After
XML leaks in response 10/10 0/10
URL fetches used 0 4/10 runs
Runs with correct song names 0/10 2/10
Avg tool calls 5.5 3.8
Avg response time 12.3s 9.8s
Run Gemma 4 in Unsloth Studio:
What's Changed
-
studio: Polish Windows installer/setup logs by @Imagineer99 in #4736
-
feat: move folder management into model selector dropdown by @Shine1i in #4731
-
fix: clear tool status badge immediately after tool execution by @Shine1i in #4733
-
refactor flex attn to prefer flash if possible by @Datta0 in #4734
-
Fix Windows local GGUF model loading crash by @danielhanchen in #4730
-
Fix OOM model styling in Studio model selectors by @LeoBorcherding in #4738
-
feat(studio): strip org prefix in model search to surface unsloth variants by @rolandtannous in #4749
-
Fix forward compatibility with transformers 5.x by @danielhanchen in #4752
-
Architecture-aware KV cache VRAM estimation (5-path) by @danielhanchen in #4757
-
Fix save_pretrained_merged for full-finetuned models by @danielhanchen in #4755
-
Feat/prebuiltllamacpp by @mmathew23 in #4741
-
Add installer test coverage for prebuilt llama.cpp changes by @danielhanchen in #4756
-
fix: studio web search SSL failures and empty page content by @danielhanchen in #4754
-
fix: add tokenizers to no-torch deps and TORCH_CONSTRAINT for arm64 macOS py313+ by @danielhanchen in #4748
-
fix(studio): allow context length slider to reach model's native limit by @danielhanchen in #4746
-
Tests for architecture-aware KV cache estimation by @danielhanchen in #4760
-
Fix custom llama.cpp source builds and macos metal source builds by @mmathew23 in #4762
-
studio: align composer/code, unify fonts, and remove tool collapse jitter by @Imagineer99 in #4763
-
fix(chat): correct loading text for cached models during inference by @AdamPlatin123 in #4764
-
fix(security): shell injection in GGML export conversion by @mateeaaaaaaa in #4768
-
Add regression test for shell injection fix in GGML conversion by @danielhanchen in #4773
-
fix(studio): prevent small models from stalling on tool-calling tasks by @danielhanchen in #4769
-
Add regression tests for custom llama prebuilt installer by @danielhanchen in #4772
-
Feat/custom llama prebuilt by @mmathew23 in #4771
-
studio: fix chat font changes leaking outside chat page by @Imagineer99 in #4775
-
feat(studio): display images from Python tool execution in chat UI by @danielhanchen in #4778
-
ui improvement by @rolandtannous in #4781
-
UI Changes by @danielhanchen in #4782
-
fix(studio): improve tool-calling re-prompt for small models by @danielhanchen in #4783
-
Pin Gemma-4 transformers requirement to 5.5.0 stable by @danielhanchen in #4784
-
Switch llama.cpp default to mainline ggml-org by @danielhanchen in #4785
-
Use transformers v5.5-release branch, pin to 5.5.0 by @danielhanchen in #4786
-
Fix: pin transformers==4.57.6 in main Studio venv by @danielhanchen in #4788
-
fix(studio): build llama.cpp from master for Gemma 4 support by @danielhanchen in #4790
-
fix name fixed name by @rolandtannous in #4791
-
fix(studio): prioritize curated defaults in Recommended model list by @danielhanchen in #4792
-
fix windows llama.cpp compile from source issue by @mmathew23 in #4793
-
fix(studio): pin llama.cpp to b8637 (Gemma 4 support) by @danielhanchen in #4796
-
fix(studio): don't set trust_remote_code for Gemma 4 training by @danielhanchen in #4795
-
fix(studio): revert llama.cpp default tag to latest by @danielhanchen in #4797
-
fix(studio): suppress fatal error when ggml-org has no prebuilt manifest by @danielhanchen in #4799
New Contributors
-
@AdamPlatin123 made their first contribution in #4764
-
@mateeaaaaaaa made their first contribution in #4768
Full Changelog: v0.1.3-beta...v0.1.35-beta
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodeltransformer
Washington launches export initiative to ensure ‘future of AI is led by the United States’
The US government has unveiled an AI export initiative designed to cement American technological leadership while countering China’s growing influence in the sector. Washington is inviting US companies to form “preset” consortiums to offer full-stack artificial intelligence solutions around the world. Applications will be accepted until the end of June, the International Trade Administration (ITA), an agency under the US Department of Commerce, said in a news release issued on Wednesday. The...

Tencent expands OpenClaw suite with enterprise tool amid China’s ‘lobster’ craze
Tencent Holdings has launched a new OpenClaw tool for enterprises that promises easy deployment of the artificial intelligence agent as part of the Chinese internet giant’s efforts to capitalise on the “lobster” frenzy in the country. ClawPro, launched in public beta by Tencent’s cloud unit on Thursday, works as an AI agent management platform for enterprises, allowing them to deploy OpenClaw templates, select models and agents, track token consumption and manage security settings. Tencent said...
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion
Hugging Face netflix/void-model: https://huggingface.co/netflix/void-model Project page - GitHub: https://github.com/Netflix/void-model Demo: https://huggingface.co/spaces/sam-motamed/VOID submitted by /u/Nunki08 [link] [comments]

Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning
arXiv:2604.01860v1 Announce Type: new Abstract: Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scal

VRAM optimization for gemma 4
TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why. The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build. A few things that actually help with VRAM: The SWA cache s



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!