LLM inference is memory-bound. The byte/FLOP ratio is high enough that most of the machine's time goes to moving data, not computing. Quantization reduces the data you move, which reduces the bottleneck. Weights you can quantize offline: take your time, run calibration, pick good scale factors. Activations are different, they change every token, so you quantize them on the fly, in the hot path. So here’s a problem: LLM activations have outliers. A handful of channels spike 10–100x past the rest, consistently across tokens. If you’re computing a single scale factor for the whole token, those outlier channels eat the entire dynamic range. Everything else gets crammed into what’s left, which means you end up wasting most of your effective precision on values you barely care about. There are three ways to address this: 1. Better scale / zero-point selection 2. Fix the distribution before quantizing 3. Per-channel or grouped quantization schemes Option 3 is the obvious fix. Per-channel quantization would handle this cleanly, but the hardware doesn't cooperate. Our engineer Yuma Oda tested the other two options and wrote up what he found, including what worked, what didn't, and why some combinations are less additive than they look. Read on https://lnkd.in/eEZ6KkMm
Mirai Tech Inc
Technology, Information and Internet
San Francisco, California 1,237 followers
Fastest inference engine built from scratch for Apple devices to run your models locally.
About us
Mirai is on-device layer for AI model makers & products for you to deploy and run models of any architecture directly on user devices. Mirai extends your model’s reach to user devices. Running local inference for speed and privacy, while freeing your cloud GPUs for what truly needs scale. Extend your model beyond the cloudKeep your inference backend. Add Mirai to process part of your user requests directly on user devices. Mirai is built natively for iOS and macOS. We made the fastest inference engine from scratch for Apple devices with performance in mind. Outperforming MLX and Llama.cpp. With all key SOTA models supported. Run your models locally. Free your cloud.
- Website
-
https://trymirai.com/
External link for Mirai Tech Inc
- Industry
- Technology, Information and Internet
- Company size
- 11-50 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2024
- Specialties
- ai, ml, on-device ai, and sdk
Locations
-
Primary
Get directions
San Francisco, California 94129, US
-
Get directions
111 Pier Ave
Hermosa Beach, California 90254, US
Employees at Mirai Tech Inc
Updates
-
If you've implemented speculative decoding, you've run into this: your speculator predicts the distribution correctly and still gets rejected. Let’s take an example: "The random number from 1 to 10 inclusive is" Ideal LLM: uniform 1/10 across all numbers. Good speculator: same distribution. In that situation, the naive scheme takes a random number from the speculator, then takes another random number from an LLM and can only accept the bonus token in 1/10 cases when they happen to match. We should be able to always take its guess without the loss of generation quality. The fix is straightforward. Sample from both the speculator and the LLM using the same seed via the Gumbel-max trick. Shared randomness means shared outcome when distributions match. 100% acceptance rate in this case. Near-optimal in practice. We shipped this in UZU: https://lnkd.in/dYHp4C-m Full kernel implementation: https://lnkd.in/gxdxHa6y
-
-
LFM2.5-350M is now available on Mirai. Liquid AI's smallest model outperforms Qwen3.5-0.8B on reasoning and agentic tool use. Running on Mirai in full precision, it exceeds 70 tokens/second on iPhone. We are rolling out bfloat16 support initially, with our own 4-bit and 8-bit quantized checkpoints coming soon. More on https://lnkd.in/ezbfm232
-
-
Running models on-device is easy - until you step outside standard architectures. And these days, most models aren’t standard anymore. Almost every new release introduces custom layers and novel techniques. With Mirai, the process of converting models for edge deployment is seamless: - Any model built from supported blocks can be described with a simple config. - Easy to implement custom layers when needed. - Easy to implement any custom pipeline.
-
-
LLM inference on edge devices is a memory-bound problem. To achieve maximal performance, we need to focus on: - how weights are quantized - how memory is laid out - how kernels are scheduled We see this all the time: the same model can have completely different latency depending on how it’s run. On-device just makes this more obvious. There’s nowhere to hide - either your execution is efficient, or it falls apart.
-
-
We tend to think AI quality = model quality. Bigger model → better answers. However, for local inference, performance depends on model + hardware + execution together. The same model can be much more or less efficient depending on how it’s run on-device. That means local AI isn’t just a model problem. It’s a systems problem. Of how efficiently can your model run on a real device. That’s the shift. Mirai is building the on-device inference layer, the runtime between hardware and models, turning local compute into predictable, efficient, production-ready intelligence.
-
-
AI isn’t moving from cloud → device. It’s splitting. Local models already handle a share of real-world queries, while frontier models remain essential for complex tasks. It’s redistribution. Simple workloads move closer to the user. Complex workloads stay in the cloud. The system becomes hybrid by default. Once part of your AI stack runs locally: • latency drops • costs collapse • privacy improves • reliability increases Mirai exists for this new architecture. Where intelligence is split across layers, but feels like one system.
-
-
The biggest constraint in AI isn’t models. It’s energy ⚡. Inference demand is exploding: • Google Cloud saw a 1300× increase in token processing. • NVIDIA reported 10× year-over-year growth But data centers take years to build and require massive energy infrastructure. So the real question isn’t just: “How do we build bigger models?” It’s: “How do we turn energy into intelligence more efficiently?” That’s why intelligence-per-watt matters. Just as performance-per-watt moved computing from mainframes to PCs, intelligence-per-watt may move AI from data centers to devices. Mirai is building for that transition.
-
-
AI today looks a lot like computing in the 1960s. Massive centralized systems. Shared infrastructure. Users renting time on powerful machines. Mainframes didn’t disappear because PCs became more powerful. They disappeared because efficiency improved enough to run useful workloads locally. Performance-per-watt doubled every ~1.5 years. Now we’re seeing the same pattern in AI. Local models are getting better. Local hardware is getting faster. Intelligence per watt is improving rapidly. That’s how computing moved from mainframes to PCs. And it may be how AI moves from data centers to devices. Mirai is building for that transition.
-
-
Computing progress used to be measured by raw performance. Then the metric shifted to performance per watt, and that shift moved us from mainframes to personal computers to smartphones. AI is entering a similar phase. The metric that matters isn't just model size or tokens per second. It's how much useful inference you get per unit of energy. Why this matters: – Data centers are energy-constrained, – Most AI queries are lightweight enough to run locally, – Devices already have powerful accelerators. As on-device inference gets more efficient, it naturally moves closer to the user. The same pattern that moved computing from centralized systems to personal devices. Mirai is building for that shift.
-