Mirai Labs (@trymirai) / X

Mirai Labs

87 posts

Mirai Labs

@trymirai

Frontier on-device AI lab. Models, runtime & infrastructure to make on-device AI interactive, ambient & continuous.

San Francisco

Joined January 2025

Mirai Labs reposted
Artur Chakhvadze
@norpadon
Jun 8
We are releasing our first quantized checkpoints for the Qwen3.5 series of models, co-designed jointly with our inference engine to achieve maximum possible performance on Apple hardware Starting from 0.8B, 2B and 4B models
Introducing Mirai Quantization: Redefining the speed-quality frontier for local LLMs on Apple...
From trymirai.com
68K
Mirai Labs
@trymirai
Apr 2
Google DeepMind released Gemma 4. Our engineer @norpadon analyzed the architecture
Artur Chakhvadze
@norpadon
Apr 2
Gemma 4 architecture analysis thread Just as Gemma3n, this thing has a galaxybrained architecture, very much not a standard transformer
3.2K
Mirai Labs
@trymirai
Apr 2
If you've implemented speculative decoding, you've run into this: your speculator predicts the distribution correctly and still gets rejected. Let’s take an example: "The random number from 1 to 10 inclusive is" Ideal LLM: uniform 1/10 across all numbers. Good speculator: same
1.4K
Mirai Labs
@trymirai
Mar 31
Day 0 on-device support of the latest and the smallest @liquidai model LFM 2.5 350M is now available on @trymirai
Liquid AI
@liquidai
Mar 31
Replying to @liquidai
Day 0 support across the stack: > Hardware: @AMD, @Intel, @Qualcomm > On-device: @lmstudio , @Cactuscompute, @RunAnywhereAI , @zeticai_ , @trymirai > Customization: @distil_labs
1.4K
Mirai Labs
@trymirai
Mar 31
LFM2.5-350M is now available on Mirai. @liquidai smallest model outperforms Qwen3.5-0.8B on reasoning and agentic tool use. Running on Mirai in full precision, it exceeds 70 tokens/second on iPhone.
1.9K
Mirai Labs
@trymirai
Mar 31
More on:
LFM2.5-350M on Apple Silicon
From trymirai.com
367
Mirai Labs
@trymirai
Mar 30
LLM activations have outliers. A few channels spike 100x past the rest, every token. Standard INT8 wastes almost all its precision covering them. The fix: rotate the weight space so outliers disappear before quantization. It's why QuaRot and TurboQuant work. Here's how we
Why activation quantization is harder than weight quantization (and what to do about it).
From trymirai.com
1.1K
Mirai Labs
@trymirai
Mar 25
Morton codes for GEMM
Lucky Iyinbor
@Luckyballa
Mar 25
Apple just released its programming guide for Metal Performance Primitives, and they suggest using Morton codes for tiled GEMM, but why? In computer graphics, you use such space-filling curves all of the time It makes objects that are close in space to be close in memory There
00:00
713
Mirai Labs
@trymirai
Mar 24
Considering quantized activations
:hikettei🌙
@hikettei
Mar 24
(1/n) I recently joined @trymirai, where we are working on LLM inference targeting Apple Silicon. Lately I've been digging into quantization. LLM inference is mostly memory-bound. The byte/FLOP ratio is high enough that a lot of the machine's time goes to moving data around
717
Mirai Labs
@trymirai
Mar 23
Efficient quantization is coming soon for on-device inference
Artur Chakhvadze
@norpadon
Mar 23
We are doing really cool hard tech at @trymirai, but until recently our social media feeds were full of linkedinish cringe. We decided to fix it and share more technial content I am currently working on our quantization pipeline, so here is a thread about LLM quantization
1.4K
Mirai Labs
@trymirai
Mar 22
How to improve acceptance rate in speculative decoding? Inference of LLMs requires reading large amounts of data from memory while doing relatively little compute with that data, which means that compute is significantly underutilized. github.com/trymirai/uzu can turn that
GitHub - trymirai/uzu: A high-performance inference engine for AI models
From github.com
1.8K
Mirai Labs
@trymirai
Mar 22
The simplest way to improve upon this is by realizing that the LLM is often unsure about what the next token should be. Let's take an example: "The random number from 1 to 10 inclusive is", an idealized LLM outputs uniform 1/10 probability for every number 1–10. A good speculator
361
Mirai Labs
@trymirai
Mar 22
We achieve this by sampling from both the speculator and the LLM via the Gumbel-max trick, sharing the same seed. This will achieve a 100% acceptance rate in the toy example above, and a near-optimal acceptance rate in more complicated real-world cases, while being much simpler
308
Mirai Labs
@trymirai
Mar 22
Why Muon performs exceptionally well on quantized models
ryan mathieu
@gapDEEPry
Mar 21
Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping
21K
Mirai Labs
@trymirai
Mar 20
Unlocking Apple’s 'M5-only' matmul API for every M-series chip with one hidden Metal function
eugene
@eugenebokhan
Mar 20
1/ Apple shipped Metal Performance Primitives — a GPU matmul API built on cooperative_tensor. If you look at Apple's open-source code for an example of how to use MPP, you'll find a hardcoded M5 memory layout.
1.5K
Mirai Labs
@trymirai
Mar 19
We tend to think AI quality = model quality. Bigger model → better answers. However, for local inference, performance depends on model + hardware + execution together. The same model can be much more or less efficient depending on how it’s run on-device. That means local AI
376