🔥 microsoft/BitNet
Official inference framework for 1-bit LLMs — Trending on GitHub today with 84 new stars.
bitnet.cpp
Try it out via this demo, or build and run it on your own CPU or GPU.
bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next).
The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices. Please refer to the technical report for more details.
Latest optimization introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving 1.15x to 2.1x additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the optimization guide.
Demo
A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:
demo.mp4
What's New:
-
01/15/2026 BitNet CPU Inference Optimization
-
05/20/2025 BitNet Official GPU inference kernel
-
04/14/2025 BitNet Official 2B Parameter Model on Hugging Face
-
02/18/2025 Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
-
11/08/2024 BitNet a4.8: 4-bit Activations for 1-bit LLMs
-
10/21/2024 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
-
10/17/2024 bitnet.cpp 1.0 released.
-
03/21/2024 The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ
-
02/27/2024 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
-
10/17/2023 BitNet: Scaling 1-bit Transformers for Large Language Models
Acknowledgements
This project is based on the llama.cpp framework. We would like to thank all the authors for their contributions to the open-source community. Also, bitnet.cpp's kernels are built on top of the Lookup Table methodologies pioneered in T-MAC. For inference of general low-bit LLMs beyond ternary models, we recommend using T-MAC.
Official Models
Model Parameters CPU Kernel
I2_S TL1 TL2
BitNet-b1.58-2B-4T 2.4B x86 ✅ ❌ ✅
ARM ✅ ✅ ❌
Supported Models
❗️We use existing 1-bit LLMs available on Hugging Face to demonstrate the inference capabilities of bitnet.cpp. We hope the release of bitnet.cpp will inspire the development of 1-bit LLMs in large-scale settings in terms of model size and training tokens.
Model Parameters CPU Kernel
I2_S TL1 TL2
bitnet_b1_58-large 0.7B x86 ✅ ❌ ✅
ARM ✅ ✅ ❌
bitnet_b1_58-3B 3.3B x86 ❌ ❌ ✅
ARM ❌ ✅ ❌
Llama3-8B-1.58-100B-tokens 8.0B x86 ✅ ❌ ✅
ARM ✅ ✅ ❌
Falcon3 Family 1B-10B x86 ✅ ❌ ✅
ARM ✅ ✅ ❌
Falcon-E Family 1B-3B x86 ✅ ❌ ✅
ARM ✅ ✅ ❌
Installation
Requirements
-
python>=3.9
-
cmake>=3.22
-
clang>=18
For Windows users, install Visual Studio 2022. In the installer, toggle on at least the following options(this also automatically installs the required additional tools like CMake):
Desktop-development with C++ C++-CMake Tools for Windows Git for Windows C++-Clang Compiler for Windows MS-Build Support for LLVM-Toolset (clang)
For Debian/Ubuntu users, you can download with Automatic installation script
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
- conda (highly recommend)
Build from source
Important
If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands. Please refer to the FAQs below if you see any issues.
- Clone the repo
git clone --recursive https://github.com/microsoft/BitNet.git cd BitNetgit clone --recursive https://github.com/microsoft/BitNet.git cd BitNet- Install the dependencies
# (Recommended) Create a new conda environment conda create -n bitnet-cpp python=3.9 conda activate bitnet-cpp# (Recommended) Create a new conda environment conda create -n bitnet-cpp python=3.9 conda activate bitnet-cpppip install -r requirements.txt`
- Build the project
# Manually download the model and run with local path huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s# Manually download the model and run with local path huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_susage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] [--use-pretuned]usage: setup_env.py [-h] [--hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}] [--model-dir MODEL_DIR] [--log-dir LOG_DIR] [--quant-type {i2_s,tl1}] [--quant-embd] [--use-pretuned]Setup the environment for running inference
optional arguments: -h, --help show this help message and exit --hf-repo {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit}, -hr {1bitLLM/bitnet_b1_58-large,1bitLLM/bitnet_b1_58-3B,HF1BitLLM/Llama3-8B-1.58-100B-tokens,tiiuae/Falcon3-1B-Instruct-1.58bit,tiiuae/Falcon3-3B-Instruct-1.58bit,tiiuae/Falcon3-7B-Instruct-1.58bit,tiiuae/Falcon3-10B-Instruct-1.58bit} Model used for inference --model-dir MODEL_DIR, -md MODEL_DIR Directory to save/load the model --log-dir LOG_DIR, -ld LOG_DIR Directory to save the logging info --quant-type {i2_s,tl1}, -q {i2_s,tl1} Quantization type --quant-embd Quantize the embeddings to f16 --use-pretuned, -p Use the pretuned kernel parameters`
Usage
Basic usage
# Run inference with the quantized model python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv# Run inference with the quantized model python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnvusage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]Run inference
optional arguments: -h, --help show this help message and exit -m MODEL, --model MODEL Path to model file -n N_PREDICT, --n-predict N_PREDICT Number of tokens to predict when generating text -p PROMPT, --prompt PROMPT Prompt to generate text from -t THREADS, --threads THREADS Number of threads to use -c CTX_SIZE, --ctx-size CTX_SIZE Size of the prompt context -temp TEMPERATURE, --temperature TEMPERATURE Temperature, a hyperparameter that controls the randomness of the generated text -cnv, --conversation Whether to enable chat mode or not (for instruct models.) (When this option is turned on, the prompt specified by -p will be used as the system prompt.)`
Benchmark
We provide scripts to run the inference benchmark providing a model.
usage: e2e_benchmark.py -m MODEL [-n N_TOKEN] [-p N_PROMPT] [-t THREADS]
Setup the environment for running the inference
required arguments: -m MODEL, --model MODEL Path to the model file.
optional arguments: -h, --help Show this help message and exit. -n N_TOKEN, --n-token N_TOKEN Number of generated tokens. -p N_PROMPT, --n-prompt N_PROMPT Prompt to generate text from. -t THREADS, --threads THREADS Number of threads to use.`
Here's a brief explanation of each argument:
-
-m, --model: The path to the model file. This is a required argument that must be provided when running the script.
-
-n, --n-token: The number of tokens to generate during the inference. It is an optional argument with a default value of 128.
-
-p, --n-prompt: The number of prompt tokens to use for generating text. This is an optional argument with a default value of 512.
-
-t, --threads: The number of threads to use for running the inference. It is an optional argument with a default value of 2.
-
-h, --help: Show the help message and exit. Use this argument to display usage information.
For example:
python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4
This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.
For the model layout that do not supported by any public model, we provide scripts to generate a dummy model with the given model layout, and run the benchmark on your machine:
python utils/generate-dummy-bitnet-model.py models/bitnet_b1_58-large --outfile models/dummy-bitnet-125m.tl1.gguf --outtype tl1 --model-size 125M
Run benchmark with the generated model, use -m to specify the model path, -p to specify the prompt processed, -n to specify the number of token to generate
python utils/e2e_benchmark.py -m models/dummy-bitnet-125m.tl1.gguf -p 512 -n 128`
Convert from .safetensors Checkpoints
# Prepare the .safetensors model file huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16# Prepare the .safetensors model file huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./models/bitnet-b1.58-2B-4T-bf16Convert to gguf model
python ./utils/convert-helper-bitnet.py ./models/bitnet-b1.58-2B-4T-bf16`
FAQ (Frequently Asked Questions)📌
Q1: The build dies with errors building llama.cpp due to issues with std::chrono in log.cpp?
A: This is an issue introduced in recent version of llama.cpp. Please refer to this commit in the discussion to fix this issue.
Q2: How to build with clang in conda environment on windows?
A: Before building the project, verify your clang installation and access to Visual Studio tools by running:
clang -v
This command checks that you are using the correct version of clang and that the Visual Studio tools are available. If you see an error message such as:
'clang' is not recognized as an internal or external command, operable program or batch file.
It indicates that your command line window is not properly initialized for Visual Studio tools.
• If you are using Command Prompt, run:
"C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\VsDevCmd.bat" -startdir=none -arch=x64 -host_arch=x64
• If you are using Windows PowerShell, run the following commands:
Import-Module "C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\Tools\Microsoft.VisualStudio.DevShell.dll" Enter-VsDevShell 3f0e31ad -SkipAutomaticLocation -DevCmdArguments "-arch=x64 -host_arch=x64"
These steps will initialize your environment and allow you to use the correct Visual Studio tools.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
githubtrendingopen-source
Built a script to categorize expenses automatically. Saved 3 hours/month.
Built a script to categorize expenses automatically. Saved 3 hours/month. Spent every Sunday sorting bank transactions into categories for my freelance accounting. Business meals, software subscriptions, travel, office supplies. Copying stuff from my bank CSV into a spreadsheet. After 6 months of this I finally snapped and wrote a Python script. Before (the painful way) Every week I'd download my bank CSV export. Then open it and categorize each transaction myself: Transaction at "Starbucks" → Business meal "AWS Invoice" → Software/tools "United Airlines" → Travel "Office Depot" → Office supplies For maybe 40 to 60 transactions per week this took about 45 minutes. Hated it. The script Basic Python that reads the bank CSV and categorizes based on keywords. Nothing fancy. import pandas as pd

MCP App CSP Explained: Why Your Widget Won't Render
You built an MCP App. The tool works. The server returns data. But the widget renders as a blank iframe. You've hit the #1 problem in MCP App development : Content Security Policy . This post explains exactly how CSP works in MCP Apps, what the three domain arrays do, the mistakes that cause silent failures, and how to debug them. By the end, you'll never stare at a blank widget again. The sandbox model Every MCP App widget runs inside a sandboxed iframe. On ChatGPT, that iframe lives at a domain like yourapp.web-sandbox.oaiusercontent.com . On Claude, it's computed from a hash of your server URL. On VS Code, it's host-controlled. The sandbox blocks everything by default. No external API calls. No CDN images. No Google Fonts. No WebSocket connections. Nothing leaves the iframe unless you e
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI
v4.3.2
Changes Gemma 4 support with full tool-calling in the API and UI. 🆕 ik_llama.cpp support : Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 50 ms

How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide
How to Run Local AI Agents on Consumer‑Grade Hardware: A Practical Guide Want to run powerful AI agents without the endless API bills of cloud services? The good news is you don’t need a data‑center‑grade workstation. A single modern consumer GPU is enough to host capable 9B‑parameter models like qwen3.5:9b, giving you private, low‑latency inference at a fraction of the cost. This article walks you through the exact hardware specs, VRAM needs, software installation steps, and budget‑friendly upgrade paths so you can get a local agent up and running today—no PhD required. Why a Consumer GPU Is Enough It’s a common myth that you must buy a professional‑grade card (think RTX A6000 or multiple GPUs linked via NVLink) to run LLMs locally. In reality, for 9B‑class models the sweet spot lies in t

Show HN: The Comments Owl for HN browser extension now hides obvious "AI" items
If you want to give yourself a break from the flood of "AI" items on Hacker News until/unless you feel like reading them, the Comments Owl for Hacker News browser extension now adds a handy toggle to your right-click context menu on the main item list pages (or the extension popup, for mobile browsers) which filters out the most obvious "AI" items by title and site, using (editable) regular expressions which have been tested on the contents of these pages over the last week or so. The extension's primary functionality is to make it easier to follow comment threads across repeat visits, and catch up with recent comments, but it also offers other UI + UX tweaks, such as muting and noting users, and tweaks to the UI on mobile. Release notes and screenshots for new functionality: https://githu


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!