F1 Built the Perfect Model. Then the Cars Went Racing.

More about

model

Qwen 27b and Other Dense Models Optimization

Hi All, I hadn't realized the kv cache quant made such a big difference, so I took my 64 gig mac M2 Max Studio and switched from Qwen 3.5 35b a3b to the dense 27b. I love it, it's a huge difference, but I get maybe 3 tokens a second. I have kv cache at q8, offload to gpu, flash attention, mmap, max concurrent 4, eval batch 2048, cpu set to 8, gpu offload full (64). I'm on LM Studios and run everything through Openclaw. Just wondering if there's anything I can do to speed it up. The output is wonderful, but man the slow speed causes some issues, especially for my scheduled jobs, even when I adjust them. If a heartbeat runs up against a regular message I'm f'd, Any tips would be greatly appreciated. submitted by /u/Jordanthecomeback [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

ModelsLive

Perplexity has a handful of MIT licensed embedding models

submitted by /u/richardanaya [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Open Source AILive

Qwen 3.5 Tool Calling Fixes for Agentic Use: What's Broken, What's Fixed, What You (may) Still Need

Posted - What follows after this introduction is generated by Claude Opus 4.6 after hundreds of back and forths with log analysis for tool calls that were not working, and Qwen 3.5 models getting confused from local llm providers as well as Nano-Gpt. I fixed it for my own use with Pi coding agent at the time. Some of the fixes that were needed are no longer needed (TLDR at the bottom) but most are still applicable, as validated today. If you use Qwen 3.5 models and are having issues with model performance, tool calls, or general instability, the reference below might be a useful read. In the end, the fixes below on pi coding agent + llamacpp + Bartowski's quants (for stability) is what took my experience to 99% reliability and quality with all Qwen 3.5 models (Q5_k_L). Hope it helps someon

Reddit r/LocalLLaMA

4mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 169 connections

Scroll to zoom · drag to pan · click to open

More in Models

ModelsFresh

Qwen 27b and Other Dense Models Optimization

Hi All, I hadn't realized the kv cache quant made such a big difference, so I took my 64 gig mac M2 Max Studio and switched from Qwen 3.5 35b a3b to the dense 27b. I love it, it's a huge difference, but I get maybe 3 tokens a second. I have kv cache at q8, offload to gpu, flash attention, mmap, max concurrent 4, eval batch 2048, cpu set to 8, gpu offload full (64). I'm on LM Studios and run everything through Openclaw. Just wondering if there's anything I can do to speed it up. The output is wonderful, but man the slow speed causes some issues, especially for my scheduled jobs, even when I adjust them. If a heartbeat runs up against a regular message I'm f'd, Any tips would be greatly appreciated. submitted by /u/Jordanthecomeback [link] [comments]