Open Source AI llama model benchmark release launch available

AMD's Lemonade Just Made Every Nvidia-Only AI Guide Obsolete

Dev.to AIby Max QuimbyApril 2, 20267 min read0 views

Search for "how to run LLMs locally" and count the Nvidia logos. CUDA this, CUDA that. If you own AMD hardware — and statistically, a lot of you do — the local AI ecosystem has treated you like a second-class citizen for years. That just changed. Lemonade is an open-source, AMD-backed local AI server that handles LLM chat, image generation, speech synthesis, and transcription — all from a single install, all running on your hardware, all private. It hit 216 points on Hacker News this week, and the discussion thread tells you everything about why AMD users are paying attention. 🍋 What Lemonade actually is: A 2MB native C++ service that auto-configures for your AMD GPU, NPU, or CPU. It exposes an OpenAI-compatible API at localhost:13305 , meaning any app that talks to OpenAI (VS Code Copilo

That just changed.

Lemonade is an open-source, AMD-backed local AI server that handles LLM chat, image generation, speech synthesis, and transcription — all from a single install, all running on your hardware, all private. It hit 216 points on Hacker News this week, and the discussion thread tells you everything about why AMD users are paying attention.

🍋 What Lemonade actually is: A 2MB native C++ service that auto-configures for your AMD GPU, NPU, or CPU. It exposes an OpenAI-compatible API at localhost:13305, meaning any app that talks to OpenAI (VS Code Copilot, Open WebUI, n8n, Continue, hundreds more) works out of the box — pointed at your own machine instead of the cloud. Zero tokens billed. Zero data leaving your network.

Why This Matters Right Now

The local AI movement has been building momentum for two years. Ollama proved the concept. LM Studio made it pretty. But both share a dirty secret: AMD support is an afterthought. ROCm drivers are a maze. Getting llama.cpp to build with the right GPU target is a weekend project. Most users give up.

Lemonade's value proposition is brutally simple: one install, it detects your hardware, it works.

"If you have an AMD machine and want to run local models with minimal headache… it's really the easiest method. This runs on my NAS, handles my home assistant setup." — HN commenter

But it's not just ease of use. Lemonade is the only open-source OpenAI-compatible server that offers AMD Ryzen AI NPU acceleration. That's a hardware advantage Nvidia literally cannot match — there is no Nvidia NPU in your laptop.

The Architecture: NPU + GPU Hybrid Execution

On Ryzen AI 300/400 series chips (Strix Point, Strix Halo), Lemonade splits the workload:

Prompt processing (prefill) → Offloaded to the NPU, which has superior compute throughput for this specific task. This minimizes Time To First Token (TTFT).

Token generation (decode) → Handed to the integrated GPU (iGPU) or discrete GPU, which has better memory bandwidth for sequential token generation.

This hybrid approach is why a Ryzen AI laptop can feel snappier than raw token-per-second numbers would suggest.

Benchmarks: What Can You Actually Expect?

These are from AMD's own benchmarks on a Ryzen AI 9 HX 370 laptop (Radeon 890M, 32GB LPDDR5X-7500) running DeepSeek-R1-Distill-Llama-8B at INT4:

Context Length Time to First Token Tokens/Second

128 tokens 0.94s 20.7 tok/s

256 tokens 1.14s 20.5 tok/s

512 tokens 1.65s 20.0 tok/s

1024 tokens 2.68s 19.2 tok/s

2048 tokens 5.01s 17.6 tok/s

Those are integrated graphics numbers. Not a $1,500 discrete GPU — a laptop chip.

📊 Community benchmarks from Strix Halo (128GB): GPT-OSS 120B at ~50 tok/s • Qwen3-Coder-Next at 43 tok/s (Q4) • Qwen3.5 35B-A3B at 55 tok/s (Q4) • Qwen3.5 27B at 11-12 tok/s (Q4, dense architecture). Yes — a 120B parameter model running at 50 tokens/second on a desktop APU with no discrete GPU.

Setup: From Zero to Running in Under 5 Minutes

Windows (Recommended)

# 1. Download the installer from GitHub releases

https://github.com/lemonade-sdk/lemonade/releases/latest

Run Lemonade_Server_Installer.exe

2. Select your models during installation

The installer auto-detects your GPU/NPU

3. Launch from desktop shortcut — that's it.

Server runs at http://localhost:13305` [blocked]

Enter fullscreen mode

Exit fullscreen mode

Linux (Ubuntu/Fedora)

# Ubuntu (snap) sudo snap install lemonade-server

# Ubuntu (snap) sudo snap install lemonade-server

Fedora (RPM)

sudo dnf install lemonade-server

Start the server

lemonade run Gemma-3-4b-it-GGUF`

Enter fullscreen mode

Exit fullscreen mode

macOS (Beta)

# Install via the official installer

https://lemonade-server.ai/install_options.html#macos

lemonade run Gemma-3-4b-it-GGUF`

Enter fullscreen mode

Exit fullscreen mode

Once running, pulling and switching models is dead simple:

# Browse available models lemonade list

# Browse available models lemonade list

Pull and run a model

lemonade pull Gemma-3-4b-it-GGUF lemonade run Gemma-3-4b-it-GGUF

Multi-modality

lemonade run SDXL-Turbo # Image gen lemonade run kokoro-v1 # Speech synthesis lemonade run Whisper-Large-v3-Turbo # Transcription`

Enter fullscreen mode

Exit fullscreen mode

Connecting Apps: The OpenAI-Compatible Trick

Because Lemonade exposes an OpenAI-standard API, any app that supports custom OpenAI endpoints works immediately:

from openai import OpenAI

client = OpenAI( base_url="http://localhost:13305/api/v1", api_key="lemonade" # required but unused )

response = client.chat.completions.create( model="Llama-3.2-1B-Instruct-Hybrid", messages=[{"role": "user", "content": "Explain quantum computing"}] ) print(response.choices[0].message.content)`

Enter fullscreen mode

Exit fullscreen mode

That same endpoint works with VS Code Copilot, Open WebUI, Continue, n8n, and any OpenAI SDK in Python, Node.js, Go, Rust, C#, Java, Ruby, or PHP.

Lemonade vs. Ollama: The Honest Comparison

Feature Lemonade Ollama

Primary focus AMD optimization + multi-modality Cross-platform model serving

GPU support ROCm (AMD), Vulkan, Metal (beta) CUDA (Nvidia), ROCm, Metal

NPU support ✅ XDNA2 (Ryzen AI 300/400) ❌ None

Modalities Chat, Vision, Image Gen, TTS, STT Chat, Vision

API compatibility OpenAI, Ollama, Anthropic Ollama, OpenAI (partial)

Multiple models ✅ Simultaneously One at a time

Mobile app ✅ iOS + Android ❌

Binary size ~2MB (server) ~200MB

Bottom line: If you're on AMD hardware, Lemonade is the better choice. If you need Nvidia CUDA support or the simplest possible cross-platform install, Ollama is still the safer bet.

One HN user ran a direct comparison on an M1 Max MacBook:

"Model: qwen3.59b. Ollama completed in about 1:44. Lemonade completed in about 1:14. So it seems faster in this very limited test."

The NPU Question: Is It Worth It?

What NPUs are good for:

Low-power "always-on" inference for small models (1-4B parameters)
Accelerating prompt processing (prefill) in hybrid mode
Running AI tasks without touching your GPU

What NPUs are NOT good for (yet):

Running large models (>10B parameters)
Matching discrete GPU speeds for raw token generation

⚠️ NPU reality check: The NPU kernels used by Lemonade's FastFlowLM backend are proprietary (free for reasonable commercial use). The llama.cpp GPU path remains fully open. If you're on a Strix Halo with 128GB RAM, the GPU path is fast enough that NPU acceleration is a nice-to-have.

What's Coming Next

The Lemonade roadmap is active:

MLX support — for better Apple Silicon performance
vLLM support — for high-throughput serving
Enhanced custom model support — easier GGUF/ONNX imports from Hugging Face

With Ubuntu 26.04 LTS adding native AMD NPU support and Lemonade 10.0 shipping Linux NPU support via FastFlowLM, Linux users are getting first-class treatment too.

The Bigger Picture

The llama.cpp creator Georgi Gerganov just joined Hugging Face — a consolidation event for the open-source local AI stack. Google's TurboQuant paper demonstrated KV cache compression to 3 bits, potentially slashing memory requirements. The infrastructure for running capable AI on consumer hardware is converging fast.

"I find it very frustrating to get LLMs, diffusion, etc. working fast on AMD. It's way too much work." — HN commenter, explaining exactly why Lemonade exists

Lemonade exists because that frustration is real, widespread, and fixable. If you've got AMD silicon, give it a shot. The install is a few minutes, the API is standard, and the models are free.

Links:

Lemonade Server — Official site
GitHub Repository — Source code + releases
AMD Developer Article — Technical deep-dive
Hacker News Discussion — Community reactions

Originally published on ComputeLeap

Original source

Dev.to AI

https://dev.to/max_quimby/amds-lemonade-just-made-every-nvidia-only-ai-guide-obsolete-2a3l

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodelbenchmark

ModelsLive

The Cathedral, the Bazaar, and the Winchester Mystery House

The following article originally appeared on Drew Breunig’s blog and is being republished here with the author’s permission. In 1998, Eric S. Raymond published the founding text of open source software development, The Cathedral and the Bazaar. In it, he detailed two methods of building software: The bazaar model was enabled by the internet, which [ ]

O'Reilly AI

1mabout 1 hour ago

ReleasesLive

ElevenLabs Launches ElevenMusic For AI Song Creation On IPhone - Dataconomy

ElevenLabs Launches ElevenMusic For AI Song Creation On IPhone Dataconomy

GNews AI music

1mabout 1 hour ago

ReleasesLive

Genpire Launches AI-powered Design and Manufacturing Platform in the United States for Consumer-Goods Brands - natlawreview.com

Genpire Launches AI-powered Design and Manufacturing Platform in the United States for Consumer-Goods Brands natlawreview.com

GNews AI manufacturing

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 171 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Open Source AI

Open Source AIFresh

VRAM optimization for gemma 4

TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why. The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build. A few things that actually help with VRAM: The SWA cache s

Reddit r/LocalLLaMA

2mabout 4 hours ago

Open Source AIFresh

April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini

Article URL: https://gist.github.com/greenstevester/fc49b4e60a4fef9effc79066c1033ae5 Comments URL: https://news.ycombinator.com/item?id=47624731 Points: 26 # Comments: 8

Hacker News Top

1mabout 3 hours ago

Open Source AILive

🔥 sponsors/LearningCircuit

Local Deep Research achieves ~95% on SimpleQA benchmark (tested with GPT-4.1-mini). Supports local and cloud LLMs (Ollama, Google, Anthropic, ...). Searches 10+ sources - arXiv, PubMed, web, and your private documents. Everything Local & Encrypted. — Trending on GitHub today with 13 new stars.

GitHub Trending

1mabout 1 hour ago

Open Source AILive

🔥 oumi-ai/oumi

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM! — Trending on GitHub today with 45 new stars.

GitHub Trending

12mabout 1 hour ago