🔥 ml-explore/mlx-lm
Run LLMs with MLX — Trending on GitHub today with 22 new stars.
MLX LM is a Python package for generating text and fine-tuning large language models on Apple silicon with MLX.
Some key features include:
-
Integration with the Hugging Face Hub to easily use thousands of LLMs with a single command.
-
Support for quantizing and uploading models to the Hugging Face Hub.
-
Low-rank and full model fine-tuning with support for quantized models.
-
Distributed inference and fine-tuning with mx.distributed
The easiest way to get started is to install the mlx-lm package:
With pip:
pip install mlx-lm
With conda:
conda install -c conda-forge mlx-lm
Quick Start
To generate text with an LLM use:
mlx_lm.generate --prompt "How tall is Mt Everest?"
To chat with an LLM use:
mlx_lm.chat
This will give you a chat REPL that you can use to interact with the LLM. The chat context is preserved during the lifetime of the REPL.
Commands in mlx-lm typically take command line options which let you specify the model, sampling parameters, and more. Use -h to see a list of available options for a command, e.g.:
mlx_lm.generate -h
The default model for generation and chat is mlx-community/Llama-3.2-3B-Instruct-4bit. You can specify any MLX-compatible model with the --model flag. Thousands are available in the MLX Community Hugging Face organization.
Python API
You can use mlx-lm as a module:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True, )
text = generate(model, tokenizer, prompt=prompt, verbose=True)`
To see a description of all the arguments you can do:
>>> help(generate)
Check out the generation example to see how to use the API in more detail. Check out the batch generation example to see how to efficiently generate continuations for a batch of prompts.
The mlx-lm package also comes with functionality to quantize and optionally upload models to the Hugging Face Hub.
You can convert models using the Python API:
from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3" upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)`
This will generate a 4-bit quantized Mistral 7B and upload it to the repo mlx-community/My-Mistral-7B-Instruct-v0.3-4bit. It will also save the converted model in the path mlx_model by default.
To see a description of all the arguments you can do:
>>> help(convert)
Streaming
For streaming generation, use the stream_generate function. This yields a generation response object.
For example,
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit" model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True, )
for response in stream_generate(model, tokenizer, prompt, max_tokens=512): print(response.text, end="", flush=True) print()`
Sampling
The generate and stream_generate functions accept sampler and logits_processors keyword arguments. A sampler is any callable which accepts a possibly batched logits array and returns an array of sampled tokens. The logits_processors must be a list of callables which take the token history and current logits as input and return the processed logits. The logits processors are applied in order.
Some standard sampling functions and logits processors are provided in mlx_lm.sample_utils.
Command Line
You can also use mlx-lm from the command line with:
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
This will download a Mistral 7B model from the Hugging Face Hub and generate text using the given prompt.
For a full list of options run:
mlx_lm.generate --help
To quantize a model from the command line run:
mlx_lm.convert --model mistralai/Mistral-7B-Instruct-v0.3 -q
For more options run:
mlx_lm.convert --help
You can upload new models to Hugging Face by specifying --upload-repo to convert. For example, to upload a quantized Mistral-7B model to the MLX Hugging Face community you can do:
mlx_lm.convert \ --model mistralai/Mistral-7B-Instruct-v0.3 \ -q \ --upload-repo mlx-community/my-4bit-mistralmlx_lm.convert \ --model mistralai/Mistral-7B-Instruct-v0.3 \ -q \ --upload-repo mlx-community/my-4bit-mistralModels can also be converted and quantized directly in the mlx-my-repo Hugging Face Space.
Long Prompts and Generations
mlx-lm has some tools to scale efficiently to long prompts and generations:
-
A rotating fixed-size key-value cache.
-
Prompt caching
To use the rotating key-value cache pass the argument --max-kv-size n where n can be any integer. Smaller values like 512 will use very little RAM but result in worse quality. Larger values like 4096 or higher will use more RAM but have better quality.
Caching prompts can substantially speedup reusing the same long context with different queries. To cache a prompt use mlx_lm.cache_prompt. For example:
cat prompt.txt | mlx_lm.cache_prompt \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --prompt - \ --prompt-cache-file mistral_prompt.safetensorscat prompt.txt | mlx_lm.cache_prompt \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --prompt - \ --prompt-cache-file mistral_prompt.safetensorsThen use the cached prompt with mlx_lm.generate:
mlx_lm.generate \ --prompt-cache-file mistral_prompt.safetensors \ --prompt "\nSummarize the above text."mlx_lm.generate \ --prompt-cache-file mistral_prompt.safetensors \ --prompt "\nSummarize the above text."The cached prompt is treated as a prefix to the supplied prompt. Also notice when using a cached prompt, the model to use is read from the cache and need not be supplied explicitly.
Prompt caching can also be used in the Python API in order to avoid recomputing the prompt. This is useful in multi-turn dialogues or across requests that use the same context. See the example for more usage details.
Supported Models
mlx-lm supports thousands of LLMs available on the Hugging Face Hub. If the model you want to run is not supported, file an issue or better yet, submit a pull request. Many supported models are available in various quantization formats in the MLX Community Hugging Face organization.
For some models the tokenizer may require you to enable the trust_remote_code option. You can do this by passing --trust-remote-code in the command line. If you don't specify the flag explicitly, you will be prompted to trust remote code in the terminal when running the model.
Tokenizer options can also be set in the Python API. For example:
model, tokenizer = load( "qwen/Qwen-7B", tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True}, )model, tokenizer = load( "qwen/Qwen-7B", tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True}, )Large Models
Note
This requires macOS 15.0 or higher to work.
Models which are large relative to the total RAM available on the machine can be slow. mlx-lm will attempt to make them faster by wiring the memory occupied by the model and cache. This requires macOS 15 or higher to work.
If you see the following warning message:
[WARNING] Generating with a model that requires ...
then the model will likely be slow on the given machine. If the model fits in RAM then it can often be sped up by increasing the system wired memory limit. To increase the limit, set the following sysctl:
sudo sysctl iogpu.wired_limit_mb=N
The value N should be larger than the size of the model in megabytes but smaller than the memory size of the machine.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
githubtrendingopen-source
Chinese AI rivals clash over Anthropic’s OpenClaw exit amid global token crunch
Chinese tech companies are engaged in a public war of words as they compete to capitalise on US start-up Anthropic’s decision to pull its industry-leading Claude models from open-source AI agent tool OpenClaw. The development comes as AI agents have triggered a huge increase in demand for AI tokens – the core metric of AI usage – raising questions about the long-term ability of industry players to meet this demand amid a growing global crunch in computational power. On Sunday, Anthropic...

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
arXiv:2603.24591v2 Announce Type: replace Abstract: While large language models (LLMs) have accelerated 2D software development through intent-driven "vibe coding", prototyping intelligent Extended Reality (XR) experiences remains a major challenge. The fundamental barrier is not just the steep learning curve for human creators, but that low-level sensor APIs and complex game engine hierarchies are ill-suited for LLM reasoning, routinely exceeding context windows and inducing syntax hallucinations. To bridge this gap, we contribute XR Blocks, an open-source, LLM-native WebXR framework. Unlike traditional engines, XR Blocks introduces a semantic "Reality Model" that aligns spatial computing primitives (users, physical environments, and agents) with natural language, providing a robust, conc

Sustainability Analysis of Prompt Strategies for SLM-based Automated Test Generation
arXiv:2604.02761v1 Announce Type: new Abstract: The growing adoption of prompt-based automation in software testing raises important issues regarding its computational and environmental sustainability. Existing sustainability studies in AI-driven testing primarily focus on large language models, leaving the impact of prompt engineering strategies largely unexplored - particularly in the context of Small Language Models (SLMs). This gap is critical, as prompt design directly influences inference behavior, execution cost, and resource utilization, even when model size is fixed. To the best of our knowledge, this paper presents the first systematic sustainability evaluation of prompt engineering strategies for automated test generation using SLMs. We analyze seven prompt strategies across thr
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

OpenClaw Changed How We Use AI. KiloClaw Made It Effortless to Get Started
OpenClaw is a powerful open-source AI agent, but self-hosting it is a pain. KiloClaw is OpenClaw fully hosted and managed by Kilo — sign up, connect your chat apps, and your agent is running in about a minute. No Docker, no YAML, no server babysitting. People are using it for personalized morning briefs, inbox digests, auto-building CRMs, browser automation, GitHub triage, and more. Hosting is $8/month with a 7-day free trial, inference runs through Kilo Gateway at zero markup across 500+ models, and it's free for open-source maintainers. Read All

WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
arXiv:2604.02570v1 Announce Type: new Abstract: Singular Value Decomposition (SVD) has become an important technique for reducing the computational burden of Vision Language Models (VLMs), which play a central role in tasks such as image captioning and visual question answering. Although multiple prior works have proposed efficient SVD variants to enable low-rank operations, we find that in practice it remains difficult to achieve substantial latency reduction during model execution. To address this limitation, we introduce a new computational pattern and apply SVD at a finer granularity, enabling real and measurable improvements in execution latency. Furthermore, recognizing that weight elements differ in their relative importance, we adaptively allocate relative importance to each elemen

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.
I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel. I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into. The setup My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp. For this test I deployed two models: Gemma 4 26B-A4B : Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs a

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!