Running local models on Macs gets faster with Ollama's MLX support
Running local models on Macs gets faster with Ollama's MLX support
Ollama, a runtime system for operating large language models on a local computer, has introduced support for Apple’s open source MLX framework for machine learning. Additionally, Ollama says it has improved caching performance and now supports Nvidia’s NVFP4 format for model compression, making for much more efficient memory usage in certain models.
Combined, these developments promise significantly improved performance on Macs with Apple Silicon chips (M1 or later)—and the timing couldn’t be better, as local models are starting to gain steam in ways they haven’t before outside researcher and hobbyist communities.
The recent runaway success of OpenClaw—which raced its way to over 300,000 stars on GitHub, made headlines with experiments like Moltbook and became an obsession in China in particular—has many people experimenting with running models on their machines.
As developers get frustrated with rate limits and the high cost of top-tier subscriptions to tools like Claude Code or ChatGPT Codex, experimentation with local coding models has heated up. (Ollama also expanded Visual Studio Code integration recently.)
The new support is available in preview (in Ollama 0.19) and currently supports only one model—the 35 billion-parameter variant of Alibaba’s Qwen3.5. Hardware requirements are intense by normal users’ standards. Users need an Apple Silicon-equipped Mac, sure, but they also need at least 32GB of RAM, according to Ollama’s announcement.
Ars Technica AI
https://arstechnica.com/apple/2026/03/running-local-models-on-macs-gets-faster-with-ollamas-mlx-support/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodelollamaHow to Add Structured Logging to Node.js APIs with Pino 9 + OpenTelemetry (2026 Guide)
<p>Logging is the first thing you reach for when something breaks in production. Yet most Node.js APIs still write plain-text <code>console.log</code> statements that are useless in a distributed system. In 2026, <strong>structured JSON logging correlated with distributed traces</strong> is the baseline for any serious API. This guide shows you exactly how to wire up Pino 9 + OpenTelemetry so that every log line carries a <code>traceId</code> and <code>spanId</code>, making root-cause analysis a matter of seconds rather than hours.</p> <h2> Why <code>console.log</code> Kills You at Scale </h2> <p>Before diving in, let's be concrete about the problem. A log like this:<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>[2026-04-01T08:00:12.345Z] ERROR:
Why Most Agencies Deploy WordPress Multisite for the Wrong Reasons
<p><em>(Originally published on <a href="https://fachremyputra.com" rel="noopener noreferrer">fachremyputra.com</a>)</em></p> <p>Managing fifty separate WordPress instances is an operational nightmare. Updating core files, testing plugin compatibilities, and syncing theme deployments across fragmented server environments drains engineering hours and bleeds profit. The promised utopia is a Multisite network where you manage a single codebase, update a plugin once, and watch the entire network reflect the change instantly.</p> <p>I will say it clearly: most agencies push Multisite for the wrong reasons. They trap enterprise clients in a monolithic database nightmare simply because the agency wanted an easier time updating plugins. We build architecture for business ROI, not developer conveni
AgentX-Phase2: 49-Model Byzantine FBA Consensus — Building Cool Agents that Modernize COBOL to Rust
<h1> AgentX-Phase2: 49-Model Byzantine FBA Consensus </h1> <h2> Building Cool Agents that Modernize COBOL to Rust </h2> <p><strong>Author:</strong> Venkateshwar Rao Nagala | Founder & CEO<br><br> <strong>Company:</strong> For the Cloud By the Cloud | Hyderabad, India<br><br> <strong>Submission:</strong> Solo.io MCP_HACK//26 — Building Cool Agents<br><br> <strong>GitHub:</strong> <a href="https://github.com/tenalirama2005/AgentX-Phase2" rel="noopener noreferrer">https://github.com/tenalirama2005/AgentX-Phase2</a><br><br> <strong>Demo Video:</strong> <a href="https://youtu.be/5_FJA_WUlXQ" rel="noopener noreferrer">https://youtu.be/5_FJA_WUlXQ</a><br><br> <strong>Full Demo (4:44):</strong> <a href="https://youtu.be/k4Xzbp-M2fc" rel="noopener noreferrer">https://youtu.be/k4Xzbp-M2fc</a> </
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

From Kindergarten to Career Change: How CMU Designs Education for a Lifetime
<p> <img loading="lazy" src="https://www.cmu.edu/news/sites/default/files/styles/listings_desktop_1x_/public/2026-01/250516B_Surprise_EM_053.jpg.webp?itok=Ipq3jUzk" width="900" height="508" alt="Sharon Carver with students"> </p> CMU’s learning initiatives are shaped by research on how people learn, rather than by any single discipline. That approach shows up in K–12 classrooms, college courses, and workforce training programs, where learning science and AI are used to support evolving educational needs.
Build an End-to-End RAG Pipeline for LLM Applications
<p><em>This article was originally written by Shaoni Mukherjee (Technical Writer)</em></p> <p><a href="https://www.digitalocean.com/resources/articles/large-language-models" rel="noopener noreferrer">Large language models</a> have transformed the way we build intelligent applications. <a href="https://www.digitalocean.com/products/gradient/platform" rel="noopener noreferrer">Generative AI Models</a> can summarize documents, generate code, and answer complex questions. However, they still face a major limitation: they cannot access private or continuously changing knowledge unless that information is incorporated into their training data.</p> <p>Retrieval-Augmented Generation (RAG) addresses this limitation by combining information retrieval systems with generative AI models. Instead of rel
I Created a SQL Injection Challenge… And AI Failed to Catch the Biggest Security Flaw 💥
<p>I recently designed a simple SQL challenge.</p> <p>Nothing fancy. Just a login system:</p> <p>Username<br> Password<br> Basic query validation</p> <p>Seemed straightforward, right?</p> <p>So I decided to test it with AI.</p> <p>I gave the same problem to multiple models.</p> <p>Each one confidently generated a solution.<br> Each one looked clean.<br> Each one worked.</p> <p>But there was one problem.</p> <p>🚨 Every single solution was vulnerable to SQL Injection.</p> <p>Here’s what happened:</p> <p>Most models generated queries like:</p> <p>SELECT * FROM users <br> WHERE username = 'input' AND password = 'input';</p> <p>Looks fine at first glance.</p> <p>But no parameterization.<br> No input sanitization.<br> No prepared statements.</p> <p>Which means…</p> <p>A simple input like:</p> <
From one model to seven — what it took to make TurboQuant model-portable
<p>A KV cache compression plugin that only works on one model is a demo, not a tool. turboquant-vllm v1.0.0 shipped four days ago with one validated architecture: Molmo2. v1.3.0 validates seven — Llama 3.1, Mistral 7B, Qwen2.5, Phi-3-mini, Phi-4, Gemma-2, and Gemma-3. The path between those two points was more interesting than the destination.</p> <h2> What Changed </h2> <p><strong>Fused paged kernels (v1.2.0).</strong> The original architecture decompressed KV cache from TQ4 to FP16 in HBM, then ran standard attention on the result. The new fused kernel reads compressed blocks directly from vLLM's page table, decompresses in SRAM, and computes attention in a single pass. HBM traffic: 1,160 → 136 bytes per token.<br> </p> <div class="highlight js-code-highlight"> <pre class="highlight pyth

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!