Models llama model benchmark release application report

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU)

Reddit r/LocalLLaMAby /u/OsmanthusBloom https://www.reddit.com/user/OsmanthusBloomApril 3, 20264 min read0 views

I have an older laptop from ~2018, an Asus Zenbook UX430U. It was quite powerful in its time, with an i7-8550U CPU @ 1.80GHz (4 physical cores and an Intel iGPU), 16GB RAM and an additional NVIDIA MX150 GPU with 2GB VRAM. I think the GPU was intended for CAD applications, Photoshop filters or such - it is definitely not a gaming laptop. I'm using Linux Mint with the Cinnamon desktop using the iGPU only, leaving the MX150 free for other uses. I never thought I would run LLMs on this machine, though I've occasionally used the MX150 GPU to train small PyTorch or TensorFlow models; it is maybe 3 times faster than using just the CPU. However, when the 1-bit Bonsai 8B model was released, I couldn't resist trying out if I could run it on this GPU. So I took the llama.cpp fork from PrismML, compil

Could not retrieve the full article text.

Read on Reddit r/LocalLLaMA →

Original source

Reddit r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1sbnf8y/running_1bit_bonsai_8b_on_2gb_vram_mx150_mobile/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodelbenchmark

ProductsLive

OpenAI’s AGI boss is taking a leave of absence

OpenAI is undergoing another round of C-suite changes, according to an internal memo viewed by The Verge. Fidji Simo, OpenAI's CEO of AGI deployment - who was until recently the company's CEO of Applications - says in the memo that she will be stepping away on medical leave "for the next several weeks" due to [ ]

The Verge

6m15 minutes ago

ModelsFresh

Gemma 4 is a KV_cache Pig

Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model… The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128. I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3. I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens. I was surprised is all. The model scales well in vllm and seems quite smart. submitted by /u/IngeniousIdiocy [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

Self-Evolving AILive

Google's Gemma 4 AI can run on smartphones, no Internet required

The two largest Gemma 4 models – 26B Mixture of Experts and 31B Dense – require an 80GB Nvidia H100 GPU to run unquantized in bfloat16 format. Google claims these models deliver "frontier intelligence on personal computers" for students, researchers, and developers, providing advanced reasoning capabilities for IDEs, coding assistants, and agentic workflows. Read Entire Article

TechSpot

1m36 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 101 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Gemma 4 is a KV_cache Pig

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsFresh

Built a zero allocation, header only C++ Qwen tokenizer that is nearly 20x faster than openai Tiktoken

I'm into HPC, and C++ static, zero allocation and zero dependancy software. I was studying BPE tokenizers, how do they work, so decided to build that project. I hardcoded qwen tokenizer for LLMs developers. I really know that whole Tokenization phase in llm inference is worth less than 2% of whole time, so practically negligible, but I just "love" to do that kind of programming, it's just an educational project for me to learn and build some intuition. Surprisingly after combining multiple different optimization techniques, it scored really high numbers in benchmarks. I thought it was a fluke at first, tried different tests, and so far it completely holds up. For a 12 threads Ryzen 5 3600 desktop CPU, 1 GB of English Text Corpus: - Mine Frokenizer: 1009 MB/s - OpenAI Tiktoken: ~ 50 MB/s Fo

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsRecent

Google's new open AI model, Gemma 4, gives developers more freedom - Dev.ua

Google's new open AI model, Gemma 4, gives developers more freedom Dev.ua

GNews AI Gemma

1mabout 15 hours ago

ModelsFresh

Google Tweaks Gemini Pricing To Cut AI Costs - GuruFocus

Google Tweaks Gemini Pricing To Cut AI Costs GuruFocus

GNews AI Google

1mabout 2 hours ago