Log inSign up
Mirai Labs
87 posts
Image
user avatar
Mirai Labs
@trymirai
Frontier on-device AI lab. Models, runtime & infrastructure to make on-device AI interactive, ambient & continuous.
San Francisco
trymirai.com
Joined January 2025
34
Following
634
Followers
  • Mirai Labs reposted
    user avatar
    Artur Chakhvadze
    @norpadon
    Jun 8
    We are releasing our first quantized checkpoints for the Qwen3.5 series of models, co-designed jointly with our inference engine to achieve maximum possible performance on Apple hardware Starting from 0.8B, 2B and 4B models
    Image
    Introducing Mirai Quantization: Redefining the speed-quality frontier for local LLMs on Apple...
    From trymirai.com
    68K
  • user avatar
    Mirai Labs
    @trymirai
    Apr 2
    Google DeepMind released Gemma 4. Our engineer @norpadon analyzed the architecture
    user avatar
    Artur Chakhvadze
    @norpadon
    Apr 2
    Gemma 4 architecture analysis thread Just as Gemma3n, this thing has a galaxybrained architecture, very much not a standard transformer
    3.2K
  • user avatar
    Mirai Labs
    @trymirai
    Apr 2
    If you've implemented speculative decoding, you've run into this: your speculator predicts the distribution correctly and still gets rejected. Let’s take an example: "The random number from 1 to 10 inclusive is" Ideal LLM: uniform 1/10 across all numbers. Good speculator: same
    Image
    1.4K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 31
    Day 0 on-device support of the latest and the smallest @liquidai model LFM 2.5 350M is now available on @trymirai
    user avatar
    Liquid AI
    @liquidai
    Mar 31
    Replying to @liquidai
    Day 0 support across the stack: > Hardware: @AMD, @Intel, @Qualcomm > On-device: @lmstudio , @Cactuscompute, @RunAnywhereAI , @zeticai_ , @trymirai > Customization: @distil_labs
    Image
    1.4K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 31
    LFM2.5-350M is now available on Mirai. @liquidai smallest model outperforms Qwen3.5-0.8B on reasoning and agentic tool use. Running on Mirai in full precision, it exceeds 70 tokens/second on iPhone.
    Image
    1.9K
    user avatar
    Mirai Labs
    @trymirai
    Mar 31
    More on:
    Image
    LFM2.5-350M on Apple Silicon
    From trymirai.com
    367
  • user avatar
    Mirai Labs
    @trymirai
    Mar 30
    LLM activations have outliers. A few channels spike 100x past the rest, every token. Standard INT8 wastes almost all its precision covering them. The fix: rotate the weight space so outliers disappear before quantization. It's why QuaRot and TurboQuant work. Here's how we
    Image
    Why activation quantization is harder than weight quantization (and what to do about it).
    From trymirai.com
    1.1K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 25
    Morton codes for GEMM
    user avatar
    Lucky Iyinbor
    @Luckyballa
    Mar 25
    Apple just released its programming guide for Metal Performance Primitives, and they suggest using Morton codes for tiled GEMM, but why? In computer graphics, you use such space-filling curves all of the time It makes objects that are close in space to be close in memory There
    Image
    00:00
    713
  • user avatar
    Mirai Labs
    @trymirai
    Mar 24
    Considering quantized activations
    user avatar
    :hikettei🌙
    @hikettei
    Mar 24
    (1/n) I recently joined @trymirai, where we are working on LLM inference targeting Apple Silicon. Lately I've been digging into quantization. LLM inference is mostly memory-bound. The byte/FLOP ratio is high enough that a lot of the machine's time goes to moving data around
    717
  • user avatar
    Mirai Labs
    @trymirai
    Mar 23
    Efficient quantization is coming soon for on-device inference
    user avatar
    Artur Chakhvadze
    @norpadon
    Mar 23
    We are doing really cool hard tech at @trymirai, but until recently our social media feeds were full of linkedinish cringe. We decided to fix it and share more technial content I am currently working on our quantization pipeline, so here is a thread about LLM quantization
    1.4K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 22
    How to improve acceptance rate in speculative decoding? Inference of LLMs requires reading large amounts of data from memory while doing relatively little compute with that data, which means that compute is significantly underutilized. github.com/trymirai/uzu can turn that
    Image
    GitHub - trymirai/uzu: A high-performance inference engine for AI models
    From github.com
    1.8K
    user avatar
    Mirai Labs
    @trymirai
    Mar 22
    The simplest way to improve upon this is by realizing that the LLM is often unsure about what the next token should be. Let's take an example: "The random number from 1 to 10 inclusive is", an idealized LLM outputs uniform 1/10 probability for every number 1–10. A good speculator
    361
    user avatar
    Mirai Labs
    @trymirai
    Mar 22
    We achieve this by sampling from both the speculator and the LLM via the Gumbel-max trick, sharing the same seed. This will achieve a 100% acceptance rate in the toy example above, and a near-optimal acceptance rate in more complicated real-world cases, while being much simpler
    308
  • user avatar
    Mirai Labs
    @trymirai
    Mar 22
    Why Muon performs exceptionally well on quantized models
    user avatar
    ryan mathieu
    @gapDEEPry
    Mar 21
    Why does Muon beat Adam for training quantized networks? It comes down to what each optimizer treats as "distance" in weight space. Adam treats a weight matrix as a flat vector of numbers. Muon treats it as a linear map — and measures change by how much the input-output mapping
    Image
    21K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 20
    Unlocking Apple’s 'M5-only' matmul API for every M-series chip with one hidden Metal function
    user avatar
    eugene
    @eugenebokhan
    Mar 20
    1/ Apple shipped Metal Performance Primitives — a GPU matmul API built on cooperative_tensor. If you look at Apple's open-source code for an example of how to use MPP, you'll find a hardcoded M5 memory layout.
    1.5K
  • user avatar
    Mirai Labs
    @trymirai
    Mar 19
    We tend to think AI quality = model quality. Bigger model → better answers. However, for local inference, performance depends on model + hardware + execution together. The same model can be much more or less efficient depending on how it’s run on-device. That means local AI
    Image
    376

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement