Gemma 4 running on Raspberry Pi5
To be specific: RP5 8GB with SSD (but the speed is the same on the non-ssd one), running Potato OS with latest llama.cpp branch compiled. This is Gemma 4 e2b, the Unsloth variety. submitted by /u/jslominski [link] [comments]
Could not retrieve the full article text.
Read on Reddit r/LocalLLaMA →Reddit r/LocalLLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1sarlb8/gemma_4_running_on_raspberry_pi5/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamallama.cppunsloth
Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging
TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algorithm is open source, fully benchmarked, and ready to use. The Problem KV cache grows linearly with sequence length. By 4K tokens, most of it sits unused—recent tokens matter far more than old ones, yet we keep everything in VRAM at full precision. Standard approaches (quantization, pruning, distillation) are invasive. We wanted something simpler: just move the old stuff out of the way. The Solution: NES-Inspired Paging Think of it like a Game Boy's memory banking system. The cache is split into a hot region (recent tokens, full precision) and a cold region (older tokens, compressed). As new tokens arrive,

Quantizers appriciation post
Hey everyone, Yesterday I decided to try and learn how to quantize ggufs myself with reasonable quality, in order to understand the magic behind the curtain. Holy... I did not expect how much work it is, how long it takes, and requires A LOT (500GB!) of storage space for just Gemma-4-26B-A4B in various sizes. There really is an art to configuring them too, with variations between architectures and quant types. Thanks to unsloth releasing their imatrix file and huggingface showing the weight types inside their viewer, I managed to cobble something together without LLM assistance. I ran into a few hiccups and some of the information is a bit confusing, so I documented my process in the hopes of making it easier for someone else to learn and experiment. My recipe and full setup guide can be f
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

Quantizers appriciation post
Hey everyone, Yesterday I decided to try and learn how to quantize ggufs myself with reasonable quality, in order to understand the magic behind the curtain. Holy... I did not expect how much work it is, how long it takes, and requires A LOT (500GB!) of storage space for just Gemma-4-26B-A4B in various sizes. There really is an art to configuring them too, with variations between architectures and quant types. Thanks to unsloth releasing their imatrix file and huggingface showing the weight types inside their viewer, I managed to cobble something together without LLM assistance. I ran into a few hiccups and some of the information is a bit confusing, so I documented my process in the hopes of making it easier for someone else to learn and experiment. My recipe and full setup guide can be f
trunk/6c6e22937db24fe8c7b74452a6d3630c65d1c8b8: Revert "Remove TRITON=yes from CPU-only GCC11 docker configs (#179314)"
This reverts commit 670be7c . Reverted #179314 on behalf of https://github.com/izaitsevfb due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ( comment )



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!