Models model benchmark quantization unsloth

My biggest Issue with the Gemma-4 Models is the Massive KV Cache!!

Reddit r/LocalLLaMAby /u/Iory1998 https://www.reddit.com/user/Iory1998April 3, 20261 min read1 views

I mean, I have 40GB of Vram and I still cannot fit the entire Unsloth Gemma-4-31B-it-UD-Q8 (35GB) even at 2K context size unless I quantize KV to Q4 with 2K context size? WTF? For comparison, I can fit the entire UD-Q8 Qwen3.5-27B at full context without KV quantization! If I have to run a Q4 Gemma-4-31B-it-UD with a Q8 KV cache, then I am better off just using Qwen3.5-27B. After all, the latter beats the former in basically all benchmarks. What's your experience with the Gemma-4 models so far? submitted by /u/Iory1998 [link] [comments]

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sbe40t/my_biggest_issue_with_the_gemma4_models_is_the/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkquantization

ProductsLive

AI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal Intelligence

AI News This Week: April 05, 2026 - A New Era of Rapid Development and Multimodal Intelligence Published: April 05, 2026 | Reading time: ~10 min This week has been nothing short of phenomenal for the AI community, with breakthroughs and announcements that promise to revolutionize the way we develop and interact with artificial intelligence. From building personal AI agents in a matter of hours to the unveiling of cutting-edge multimodal intelligence models, the pace of innovation is not just accelerating - it's transforming the landscape of what's possible. Whether you're a seasoned developer or just starting to explore the world of AI, this week's news is a must-know, offering insights into how technology is making AI more accessible, powerful, and integrated into our daily lives. Buildin

Dev.to AI

6m34 minutes ago

Open Source AILive

Untitled

You have 50 models. Each trained on different data, different domain, different patient population. You want them to get smarter from each other. So you do the obvious thing — you set up a central aggregator. Round 1: gradients in, averaged weights out. Works fine at N=5. At N=20 you notice the coordinator is sweating. At N=50, round latency has tripled, your smallest sites are timing out, and your bandwidth budget is gone. You tune the hell out of it. Same ceiling. This is not a configuration problem. This is an architecture ceiling. The math underneath it guarantees you hit a wall. A different architecture changes the math. The combinatorics you are not harvesting Start with a fact that has nothing to do with any particular framework: N agents have exactly N(N-1)/2 unique pairwise relati

Dev.to AI

10m38 minutes ago

ProductsLive

This Week in AI: April 05, 2026 - Revolutionizing Development with Personal Agents and Multimodal Intelligence

This Week in AI: April 05, 2026 - Revolutionizing Development with Personal Agents and Multimodal Intelligence Published: April 05, 2026 | Reading time: ~10 min This week has been incredibly exciting for AI enthusiasts and developers alike. With advancements in personal AI agents, multimodal intelligence, and compact models for enterprise documents, the field is rapidly evolving. One of the most significant trends is the ability to build and deploy useful AI prototypes in a remarkably short amount of time. This shift is largely due to innovative tools and ecosystems that are making AI more accessible to individual builders. In this article, we'll dive into the latest AI news, exploring what these developments mean for developers and the broader implications for the industry. Building a Per

Dev.to AI

5m34 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 239 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

research-llm-apis 2026-04-04

Release: research-llm-apis 2026-04-04 I'm working on a major change to my LLM Python library and CLI tool. LLM provides an abstraction layer over hundreds of different LLMs from dozens of different vendors thanks to its plugin system, and some of those vendors have grown new features over the past year which LLM's abstraction layer can't handle, such as server-side tool execution. To help design that new abstraction layer I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft curl commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios. Both the scripts and the captured outputs now live in this new repo. Tags: llm , apis , json , llms

Simon Willison Blog

1mabout 6 hours ago

ModelsFresh

scan-for-secrets 0.1

Release: scan-for-secrets 0.1 I like publishing transcripts of local Claude Code sessions using my claude-code-transcripts tool but I'm often paranoid that one of my API keys or similar secrets might inadvertently be revealed in the detailed log files. I built this new Python scanning tool to help reassure me. You can feed it secrets and have it scan for them in a specified directory: uvx scan-for-secrets $OPENAI_API_KEY -d logs-to-publish/ If you leave off the -d it defaults to the current directory. It doesn't just scan for the literal secrets - it also scans for common encodings of those secrets e.g. backslash or JSON escaping, as described in the README . If you have a set of secrets you always want to protect you can list commands to echo them in a ~/.scan-for-secrets.conf.sh file. Mi

Simon Willison Blog

2mabout 3 hours ago

ModelsLive

Harvard Proved Emotions Don't Make AI Smarter — That's Exactly Why You Need Soul Spec

The Myth Dies Hard "I'll tip you $200 if you get this right." "This is really important to my career." "I'm so frustrated — please help me." If you've spent any time on AI Twitter, you've seen people swear that emotional prompting makes LLMs perform better. A few anecdotal successes became gospel. The technique spread. Now Harvard has the data. It doesn't work. What the Research Actually Shows A team from Harvard and Bryn Mawr ( arXiv:2604.02236 , April 2026) ran a systematic study across 6 benchmarks, 6 emotions, 3 models (Qwen3-14B, Llama 3.3-70B, DeepSeek-V3.2), and multiple intensity levels. Finding 1: Fixed emotional prefixes have negligible effect. Adding "I'm angry about this" or "This makes me so happy" before your prompt? Across GSM8K, BIG-Bench Hard, MedQA, BoolQ, OpenBookQA, and

Dev.to AI

4m32 minutes ago

ModelsLive

Self-Improving Python Scripts with LLMs: My Journey

As a developer, I've always been fascinated by the idea of self-improving code. Recently, I've been experimenting with using Large Language Models (LLMs) to make my Python scripts more autonomous and efficient. In this article, I'll share my experience with integrating LLMs into my Python workflow and how it has revolutionized my development process. I'll also provide a step-by-step guide on how to get started with making your own Python scripts improve themselves using LLMs. My journey with LLMs began when I stumbled upon the llm_groq module, which allows you to interact with LLMs using a simple and intuitive API. I was impressed by the accuracy and speed of the model, and I quickly realized that it could be used to improve my Python scripts. The first step in making my scripts self-impro

Dev.to AI

4m24 minutes ago