EDIT: Apparently I’m not alone in feeling this way, here’s a post from the same day I posted this: All LLMs Will Be Sparse BitNet Hybrids
I took a break from working on my decentralized DB to dive into the latest LLM research. As usual, I focused on opportunities for optimizations and performance improvements.
This post is mainly a summary of my private notes I made while ‘researching’.
I’ve only looked at inference (generating output, i.e. actually using the models) as that should dominate the total compute cost, and it’s what needs to be efficient for local-first computing.
I read through a lot of techniques, but by far the technique with the most potential seems to be BitNet.
BitNet uses ternary weights with the values of [-1, 0, 1], which gives it extremely appealing performance characteristics.
I. An Overview of LLM Architecture Link to heading
Before diving into optimization techniques, let me briefly explain where all the computation goes. At a simplified level an LLM consists of:
- Weights: Parameters learned during training (the actual “knowledge”)
- Activations: Intermediate values computed during text generation
- Layers: Stacked transformations, each applying specific operations
- KV Cache: Storage for already-processed tokens (grows with context length)
When generating text, the model processes input through these layers, with the primary computational workhorses being:
-
Matrix Multiplications (MatMuls): These dominate compute time. Modern GPUs are built specifically to parallelize these operations, but they’re still expensive.
-
Memory Bandwidth: Moving gigabytes of weight data between memory and computation units is often slower than the computation itself.
-
KV Cache Growth: As context length increases, the memory needed to store intermediate states grows dramatically. This is why longer chats with LLMs consume far more RAM.
These bottlenecks become acute when trying to run models locally rather than in data centers. Cloud providers can throw specialized hardware and custom chips at the problem. Your laptop… cannot.
Model sizes Link to heading
Taking a step back, I should point out how LLM sizing will likely end up working in the long run. These are just my assumptions based on the hardware realities of today.
For practical deployment you can think of models in a few different tiers, each with very distinct capabilities:
- Always-on assistant models:
- Sub-8B parameters, heavily quantized to fit in RAM or on accelerators. Useful for simple interactions and tasks, and can hand off to more capable models as needed.
- Local Specialist models:
- 12-30B parameters with special training. You’d run the largest models you can locally for specialized tasks, capable of more autonomy. “Label all my photos with descriptions”, “Research a trip to Cleveland that I would like”.
- Cloud fallback:
- Full-precision 70B+ models for the hardest tasks. Almost everyone will access these models remotely.
The optimal sizes will depend on hardware constraints, but even modest laptops can run Local Specialist Models at acceptable speeds with aggressive quantization. BitNet would give that extra push of potentially doubling the parameter sizes that could be run locally, and would significantly improve the inference speed at the same time.
At small sizes (1B-30B), like the ones most people will interact with frequently, doubling the number of params is a significant increase in capabilities. An 8B model is much more capable than a 4B model.
II. Existing Efficiency Hacks Link to heading
Several approaches aim to address these challenges with big efficiency wins, these are all in use in various forms today:
Quantization Techniques Link to heading
Quantization reduces the precision of model weights and activations. While 8-bit and 4-bit quantization are now standard practice, research is pushing toward even more extreme compression:
- Traditional quantization (8-bit, 4-bit): 4-bit weights often have very little performance degradation and are quite common in local inference today.
- Extreme quantization (2-bit and below): The current focus of several research groups.
The ParetoQ paper (arXiv:2502.02631) found something particularly interesting: 1.58-bit (3 values that vary per layer) or 2-bit (4 values, varying per layer) quantization may represent an optimal balance between compression and capability. For a fixed compute budget, you’re often better off running a larger model at lower precision than a smaller model at higher precision.
To get the most out of it though, you need to further finetune the quantized model to regain as much performance as possible after quantization. They show that it’s possible to regain a lot of the capabilities ’lost’ when quantizing by finetuning on 10-30 Billion additional tokens after the quantization.
This is a very promising approach, suggesting that post-quantization finetuning may be worthwhile for miniaturizing an existing model.
Note that this is not the same thing as BitNet. The 1.58-bit quants talked about in ParetoQ are of variable values, not the [-1, 0, 1] used by BitNet.
Mixture of Experts (MoE) Architectures Link to heading
MoE models take a fundamentally different approach to scaling. Instead of activating all parameters for every token, they contain specialized “expert” networks that are selectively activated:
- Only a small fraction of the model’s total parameters process each token
- This allows models to grow in parameter count without proportional compute increases
- For example, Qwen 3 has a model that is 30B total parameters with only 3B active to compute each token.
You get some of the benefit of the full parameter count while only doing computation on a small fraction of it per token. The downside is that you need more memory to keep all the weights loaded.
Faster inference, but higher memory than an equivalently capable ‘dense’ model.
Many of the largest open weight models do this because they are more efficient to train compared to a dense model of similar capability, and it’s rumored that the proprietary models are probably MoE.
Speculative Decoding Link to heading
This approach uses a smaller, faster model to pre-generate likely outputs, which the larger model then verifies rather than generating from scratch. Google’s Speculative Decoding paper (arXiv:2211.17192) demonstrated speedups of 2-5x using this technique.
It can also be built into the model itself, as in Xiaomi’s MiMo and I’m sure many others.
III. BitNet b1.58 Link to heading
In my opinion, the most promising approach that should be pursued further is BitNet. The model sizes are similar to extreme quantization, but it allows for a fundamental change to the underlying math that meshes extremely well with the hardware realities.
What is BitNet b1.58? Link to heading
BitNet represents weights using only three possible values: -1, 0, and 1. This is called ternary quantization, using approximately 1.58 bits per parameter (log₂(3) ≈ 1.58). Unlike traditional quantization that starts with a full-precision model and then compresses it, BitNet models are trained from scratch with these constraints.
(Important note: this 1.58-bit is not the same as in ParetoQ. ParetoQ used values other than [-1, 0, 1] that change per layer.)
Why It’s Cool Link to heading
BitNet’s approach offers several advantages:
-
MatMul Becomes Addition: When weights are limited to -1, 0, and 1, multiplication operations are replaced with much simpler operations:
- Multiplying by 0: Result is 0 (skip the operation)
- Multiplying by 1: Result is unchanged (just copy the value)
- Multiplying by -1: Result is negated (single-cycle operation)
On modern CPUs, multiplication operations typically take multiple cycles, while additions and negations take just one. They also require significantly more die space and power.
-
Tiny Memory Footprint: Representing weights with just 1.58 bits matches even the smallest quants, allowing much larger models to fit in limited memory.
-
Hardware Acceleration Potential: The simplicity of BitNet’s operations makes it ideal for specialized hardware acceleration.
BitNet v2 Link to heading
The BitNet researchers recently published a follow-up paper (arXiv:2504.18415) introducing BitNet v2, which combines 1.58-bit weights with 4-bit activations and a technique called Hadamard transformation. This approach takes advantage of recent hardware support for 4-bit math while maintaining the core benefits of the BitNet approach.
It uses 4-bit math for Activations inside the model instead of 8-bit math, but the inputs are still the same size (8 bits).
In short, by switching to using 4-bit Activations with the same size model they can:
- Keep the same size on disk
- ~Halve the size of the KV Cache, decreasing how much memory is used by the context
- Use INT4 operations, potentially doubling the theoretical computational throughput
Initial results show these extremely compressed models can achieve performance comparable to BitNet b1.58, which itself is similar to the performance of models with larger quant sizes and the same parameter count.
I’ll repeat that last point, they are claiming comparable performance to models that are twice the size in RAM, and they’re doing it while also getting rid of the multiplies.
IV. So What? Link to heading
A BitNet model with an optimized library to run it should be extremely efficient even for CPU inference, here is a chart from the BitNet repo showing inference tokens/second on ‘dummy’ models of different sizes.

Keep in mind that they are also claiming that they also achieve similar performance to existing models of the same size. Those numbers are significantly higher than what you’d get for similar sized models using CPU inference.
Those claimed numbers are roughly comparable to running existing models on my not very small GPU.
Synergy with MoE Architectures Link to heading
BitNet and MoE architectures complement each other beautifully. If BitNet allows us to significantly reduce the memory footprint of weights, we could:
- Fit much larger models in the same memory
- Increase the number of experts in an MoE model without increasing memory requirements
- Deploy sophisticated models on devices that previously couldn’t support them
This synergy could be particularly powerful for local computing, where memory is often the primary constraint.
Qwen 3’s largest model is a 235B parameter model with 22B active parameters. A BitNet v2 model of that size would only be ~80 GB and would be runnable at ~5-10 tokens / second on a high-end Desktop, without offloading to the GPU.
The Efficiency Frontier Link to heading
When evaluating model performance, we need to consider not just raw accuracy but efficiency. The relevant question isn’t “How does a 2-bit model compare to a full-precision model?” but rather “How does a larger 2-bit model compare to a smaller full-precision model when both use the same amount of computation or memory?”
Most of the literature compares against models of a similar parameter size because that is normally a good proxy for performance and memory requirements, but that falls apart for BitNet.
There’s somewhat limited data comparing like-for-like models computationally, which I find really odd. If BitNet can scale, it seems like it would be significantly more efficient when comparing to models of similar performance.
It’s Not Ready (yet) Link to heading
While BitNet can run on existing hardware, its full potential will come through specialized chips:
- Software implementations on current hardware (modest gains)
- Optimized implementations leveraging existing hardware features (significant gains)
- Custom accelerators designed specifically for BitNet-style operations (transformative gains)
The simplicity of BitNet operations makes it an ideal candidate for custom silicon that could achieve orders of magnitude better efficiency than general-purpose processors, though writing optimized CUDA kernels to take advantage of the INT4 math on the latest NVIDIA GPUs would still be impressive perf wise.
V. Open Research Questions Link to heading
Several important challenges remain:
-
Training efficiency: Training BitNet models from scratch is computationally expensive, but it seems like the ternary models may require it. I couldn’t find research on trying to quantize a larger model to [-1, 0, 1] weights, but maybe someone will find a usable technique. Regardless, even if it’s more expensive to train, it’s significantly more efficient in inference so the total compute over its lifetime should be smaller.
-
Capability retention: More research is needed to understand which capabilities are preserved and which degrade as you scale up the training and parameter size.
-
Hardware/software co-design: The full potential of BitNet will only be realized through tight integration between hardware and software design.
Conclusion Link to heading
BitNet’s 1.58-bit approach represents one of the most promising paths toward dramatically more efficient LLMs. I’m shocked there doesn’t seem to be more work going into it given its potential.
I don’t have the money, but I’m intrigued by the idea of training a decent sized model (14B-32B range) to better test how it performs. The BitNet team has been doing research on small models up to 2B parameters, so it remains to be seen if the capabilities scale as well as higher quant models, and proving capabilities in a 32B model would be a really strong proof of concept.
References Link to heading
- Microsoft BitNet
- BitNet v1: arXiv:2402.17764
- BitNet v2: arXiv:2504.18415
- ParetoQ: arXiv:2502.02631