Google Developer Experts - Medium

Google Antigravity 2.0 + Gemini 3.5 Flash: The AI That Codes, Tests, and Ships — Without You.

Geeta Kakrani — Tue, 16 Jun 2026 21:33:56 GMT

Antigravity 2.0, Antigravity CLI, and Gemini 3.5 Flash Explained Simply

By Geeta Kakrani | Google Developer Expert in AI & TPU

If you’ve been following Google I/O 2026, you already know something big happened. Google didn’t just release a new model or update an app. They released an entirely new way to think about writing software.

This post is my hands-on walkthrough of three things that shipped at I/O: Google Antigravity 2.0, the Antigravity CLI, and the Gemini 3.5 Flash model that powers all of it. I’ll explain each one simply, show you exactly how to get started, and tell you what I personally saw when I tried it.

No hype. No jargon. Just the real thing.

First, a Quick Backstory

Google launched the original Antigravity in November 2025 as a direct competitor to Cursor — an AI-powered code editor where you describe what you want, and AI writes it for you.

Version 2.0, announced at Google I/O on May 19, 2026, is a completely different product. It’s no longer just an editor. It’s a full platform with five parts:

A desktop app
A CLI (command-line tool)
An SDK (for building your own agents)
Managed Agents in the Gemini API
An enterprise deployment path

This blog covers the first two — the ones most developers will touch first.

Part 1: What Is Antigravity 2.0?

Think of it this way.

A normal code editor (like VS Code) is like having a very smart autocomplete. You still drive. You still type. AI just suggests the next word.

Antigravity 2.0 is completely different. You describe what you want to build. Antigravity figures out the steps, writes the code, runs it, tests it, and reports back.

You are no longer the one typing. You are the one deciding what to build.

What’s actually new in 2.0?

Multiple agents at the same time. The old version ran one task at a time. Antigravity 2.0 lets you spin up multiple AI agents in parallel. One agent can handle the backend, another the frontend, and a third can run tests — all at once. (Google internally used 93 agents to build a working OS in 12 hours. Yes, really.)

Background tasks. You can schedule tasks to run while you sleep. Log off, come back in the morning, and your code is done.

Native integration with Google tools. It connects directly with Google AI Studio, Firebase, and Android Studio. You can export a project from AI Studio and continue it locally in Antigravity without losing any context.

Voice commands. You can now speak your instructions instead of typing them.

Browser subagent. This one is genuinely impressive. Antigravity can open a real Chrome window, navigate to your running app, click buttons, fill forms, take screenshots of what it sees, and loop back to fix what’s broken. It’s not simulated. It’s real browser testing.

How to Download Antigravity 2.0

Go to antigravity.google/download

You’ll see download options for:

macOS — Apple Silicon or Intel
Windows — x64 or ARM64
Linux — x64 or ARM64

Minimum requirements:

macOS: Version 12 (Monterey) or newer. Apple Silicon recommended.
Windows: Windows 10 (64-bit) or newer
Linux: glibc >= 2.28 (Ubuntu 20, Debian 10, Fedora 36, RHEL 8 all work)

It’s free to download and free to use during the public preview.

Setting Up Your First Project

Once installed, you’ll see a sidebar with:

New Conversation — start a fresh task
Projects — your saved workspaces
Conversation History — past sessions
Scheduled Tasks — background automations

Create a new project, give it a folder path, and you’re ready. The agent dropdown at the bottom lets you choose your model — by default it uses Gemini 3.5 Flash (Medium), which we’ll cover in Part 3.

Project Settings Worth Knowing

When you open your project settings, you’ll see three important controls:

Security Preset — Controls how much the agent can do on its own. Set to “Custom” if you want fine-grained control.

Terminal Command Auto Execution — This decides whether the agent can run terminal commands automatically or whether it asks for your approval first. I recommend keeping this on “Require Review” when you’re starting out. You want to know what’s being run on your machine.

Outside of Folders File Access Policy — Controls whether the agent can read files outside your project folder. “Always Ask” is the safe choice here.

Enable Sandbox Mode — Restricts the agent to a secure, isolated environment. Good for testing untrusted code.

The important lesson: Antigravity can do a lot automatically. Set your security settings deliberately before you start.

Part 2: What Is the Antigravity CLI?

CLI stands for Command Line Interface. It’s for developers who prefer working in a terminal rather than a graphical app.

The Antigravity CLI lets you do everything the desktop app does — but from your terminal. You describe a task, and the agent runs it right there in your codebase.

One important thing to know: Google is retiring the old Gemini CLI on June 18, 2026. If you’ve been using Gemini CLI, you need to switch to Antigravity CLI before that date. After June 18, Gemini CLI stops working.

How to Install the Antigravity CLI

Mac or Linux: Open your terminal and run:

curl -fsSL https://antigravity.google/cli/install.sh | bash

Windows PowerShell:

irm https://antigravity.google/cli/install.ps1 | iex

Windows Command Prompt (CMD):

curl -fsSL https://antigravity.google/cli/install.cmd -o install.cmd && install.cmd && del install.cmd

That’s it. One command. The CLI installs itself.

After installation, you can use the CLI in any project folder. Type your task in plain English. The agent reads your files, figures out what needs to change, and makes it happen.

What Can You Do with the CLI?

Build new features by describing them
Debug existing code by explaining the error
Write tests for your functions
Refactor messy code
Deploy to Google Cloud Run (if you add the Cloud Run MCP integration)

The CLI is ideal if you’re already comfortable in a terminal and don’t want to switch to a new desktop app. You keep your existing editor (VS Code, Neovim, whatever you use) and get AI agents working alongside it.

Part 3: Gemini 3.5 Flash — The Model Behind Everything

Every capability in Antigravity 2.0 runs on Gemini 3.5 Flash, released on May 19, 2026 at Google I/O.

Here’s the surprising thing about this model: it’s called “Flash” (which usually means the smaller, cheaper version), but it outperforms Google’s previous best model, Gemini 3.1 Pro, on coding and agentic tasks. That’s the first time in Gemini’s history that a Flash model has beaten the Pro tier.

What Makes Gemini 3.5 Flash Different?

Speed. It generates output 4x faster than comparable frontier models — roughly 280 tokens per second. Tasks that used to take minutes now take seconds.

1 million token context window. That means you can feed it an entire large codebase and it can understand and reason across all of it at once.

Multimodal. It understands text, images, video, and audio. This is why Antigravity’s browser agent can take a screenshot and understand what it sees.

Built for agents. The model is specifically optimized for long, multi-step tasks that run without constant human input. It’s not a chatbot. It’s a worker.

MCP support. Gemini 3.5 Flash natively supports the Model Context Protocol (MCP) standard, which means tools built for other AI systems (like Claude) can often work with it directly.

How Was It Built?

Google says Gemini 3.5 Flash was partly co-developed using Antigravity itself. The tool helped build the model that now powers the tool. That’s a meaningful signal about how mature the platform has become.

Pricing

If you’re using the Gemini API directly:

Input: $1.50 per million tokens
Output: $9.00 per million tokens

It costs more than the older Gemini 3.1 Flash models, but it’s still significantly cheaper than comparable frontier models from OpenAI and Anthropic, while being faster and (on agentic tasks) more capable.

For Antigravity desktop app users:

Free during public preview
AI Pro — $20/month
AI Ultra — $100/month (5x higher usage limits than Pro, new at I/O 2026)

Part 4: Customizations and MCP Servers

One thing that impressed me when I explored the settings was the MCP server marketplace inside Antigravity.

MCP (Model Context Protocol) lets you connect Antigravity to external tools and services. Inside Settings > Customizations > Add MCP Servers, you’ll find options like:

Cloud Run — Deploy your app to Google Cloud with one command (already installed in my setup)
Google Kubernetes Engine (OSS) — Let Antigravity interact with your GKE clusters
Firebase — Connect your project to Firebase directly
PostHog — Run product analytics queries in plain English
GitLab Orbit — Query your GitLab codebase as a knowledge graph
Dart / Flutter — Flutter-specific agent tools

This is where Antigravity becomes genuinely powerful for production work. Instead of jumping between tools, the agent can orchestrate across all of them in one conversation.

Part 5: What About the Gemini API Key Setup?

If you want to use Gemini 3.5 Flash through the API directly (outside of Antigravity), you’ll need an API key.

Here’s the simple version:

Go to Google AI Studio (aistudio.google.com)
Click “Get API Key” → “Create API Key”
Give it a name (e.g. “Generative Language API Key”)
Under “Select API restrictions,” filter by “generative” and check Generative Language API
Click OK

That’s the API key you’ll use in your code. Keep it private. Don’t commit it to GitHub.

My Honest Take

I’ve been in tech for over two decades. I’ve seen a lot of “this changes everything” announcements that didn’t.

This one is different.

Antigravity 2.0 is not a better autocomplete. It’s a different paradigm. You describe. The agent builds. You review and steer. That shift — from typing code to directing agents — is real and it works.

The browser subagent alone is something I haven’t seen done this cleanly anywhere else. The model knows what the app looks like, not just what the code says.

That said: you still need to know architecture. Antigravity removes the friction of typing. It doesn’t remove the need to think. You need to give it clear goals, sensible project structure, and review what it produces.

Used well, it makes you faster. Used lazily, it makes a mess faster.

Start small. Try it on a real project. Set your security settings carefully. And watch what 93 agents can do for you.

Quick Reference

WhatWhereDownload Antigravity 2.0antigravity.google/downloadAntigravity CLI (Mac/Linux)curl -fsSL https://antigravity.google/cli/install.sh | bashGemini APIaistudio.google.comGemini CLI deadlineJune 18, 2026 — switch to Antigravity CLI before thenModel powering AntigravityGemini 3.5 FlashContext window1 million tokensAPI pricing$1.50 input / $9.00 output per million tokens

Geeta Kakrani is an AI Consultant and Google Developer Expert (GDE) in AI/ML and TPU with 22+ years in tech. You can find her at https://www.linkedin.com/in/geetakakrani/ and on YouTube @geetakakrani.

Google Antigravity 2.0 + Gemini 3.5 Flash: The AI That Codes, Tests, and Ships — Without You. was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Multimodal Indonesian Fake-News Detector with JAX, Flax, and Keras Kinetic on Cloud TPU

Esther Irawati Setiawan — Wed, 03 Jun 2026 14:11:58 GMT

How I trained a Stance-Aware Cross-Encoder that classifies Indonesian news headlines against claims — starting on a free Colab TPU and scaling out to Cloud TPU v5p with a single@kinetic.run() decorator

Introduction

Misinformation is one of the defining problems of the social-media era, and Indonesia has been hit particularly hard. Hoaks (the Indonesian shorthand for fake news) spread through WhatsApp groups and Twitter threads faster than any fact-checker can keep up. Most published research on automated fake-news detection focuses on English-language data, which leaves practitioners working with Bahasa Indonesia in a frustrating spot: the techniques exist, but the tooling and pre-trained models are scarce.

This article walks through building a real, working multimodal hoax detector for Indonesian news from scratch. The model takes two inputs — a claim (the original assertion, often from social media) and a headline (a news article headline that mentions the same topic) — and predicts whether the article supports the claim (for), refutes it (against), or merely observes it neutrally (observing).

The architecture is a Stance-Aware Cross-Encoder: a BiLSTM-style encoder for each input, multi-head self-attention, and a cross-attention layer that lets the claim and headline literally read each other before classification. Built end-to-end with JAX and Flax, trained on TPU.

The deployment story has two halves:

Free Colab TPU for prototyping — Google gives every Colab user free access to a v5e-1 TPU, which is enough to train this model end-to-end in under an hour at zero cost.
Cloud TPU v5p via Keras Kinetic for serious training — when you outgrow Colab’s runtime limits, Keras Kinetic lets you ship the same training function to a Cloud TPU pod with a single Python decorator. No Docker, no Kubernetes YAML, no SSH.

By the end, you’ll have:

A reusable Indonesian tokenizer and dataset loader
A 4-layer Transformer encoder with stance-aware cross-attention, written in Flax
A JIT-compiled training loop with optax and orbaxcheckpointing
A working predict.py that runs new claim-headline pairs through the trained model
The exact same code, deployed to Cloud TPU via@kinetic.run()

Let’s get into it.

Why This Problem Is Hard (and Interesting)

Naive fake-news detectors look at one piece of text and try to classify it as “real” or “fake.” That’s both technically weak and ethically uncomfortable — a single text rarely carries enough signal, and the “true/false” framing assumes the model has access to ground truth it can’t possibly have.

The stance detection framing is much more honest. Given a claim and a related news article, the model doesn’t decide whether the claim is true; it decides whether this particular article supports, refutes, or merely observes the claim. That’s a question a model can actually answer, and it’s exactly the input a downstream fact-checker needs to make a final call.

Mathematically, the task is a 3-way classification over theinteraction of two pieces of text. That word — interaction — is what makes the architecture interesting. You can’t just encode each side independently and concatenate. You need a layer that lets the claim attend to the headline and vice versa, so the model can pick up on subtle cues like negation (“Government denies…”), hedging (“alleged…”), or framing (“according to critics…”).

Why JAX, Flax, Keras Kinetic, and TPU?

JAX gives me NumPy-style code with automatic differentiation, JIT compilation via XLA, and transparent acceleration on CPU/GPU/TPU.
Flax sits on top of JAX and lets me write neural networks as nn.Module classes. The model is dense in attention layers, and Flax keeps the parameter management clean.
Optax for optimization (AdamW with linear warmup + cosine decay) and Orbax for checkpointing — both are part of the JAX ecosystem and JIT-friendly.
Keras Kinetic is the deployment glue. One decorator turns a local Python function into a remote TPU job, with container caching, log streaming, and automatic GKE provisioning.
TPU because the workload is dominated by attention matmuls — exactly what TPU systolic arrays are built for. Free on Colab (v5e-1), and Cloud TPU v5p when you need to scale up.

TPU vs GPU for This Workload

A multimodal Transformer with cross-attention is one of the cleanest TPU workloads you can write. Here’s why, and where GPUs still hold their own.

Hardware design

GPU (NVIDIA A100/H100): general-purpose parallel processor, thousands of CUDA cores, great for arbitrary parallel computation.
TPU (v5e or v5p): domain-specific accelerator built around a large systolic array (MXU) optimized for dense matrix multiplications.

What dominates the compute in this model

Multi-head self-attention: softmax(Q Kᵀ / √d) V — three big matmuls per head per layer.
Cross-attention between claim and headline: same shape, just with different inputs feeding Q vs K/V.
Feedforward blocks: two Dense layers with GELU between them.

All of those are dense matmuls with predictable shapes. The TPU systolic array is purpose-built to chew through exactly this pattern at peak FLOPs. The XLA compiler fuses the entire train_step into a few kernels, and after the first compile, every step runs at full throughput.

Where GPUs still win for stance detection / NLP

You’re doing token-level decoding with KV-cache and irregular generation lengths (we’re not — we’re doing classification).
You need a HuggingFace transformers model that’s only available as a PyTorch checkpoint (we’re training from scratch, so this doesn’t apply).
You want to iterate in a notebook with constant Python control flow that doesn’t JIT cleanly (Colab gives you both a TPU anda notebook, so you don’t have to choose).

Where TPUs win for stance detection / NLP

Fixed-length sequences (we pad to 64 tokens) → predictable shapes → great XLA compilation.
The whole train_step JITs into a single fused execution graph.
pmap / shard_map make multi-chip training a one-liner if you want to scale up.
Free on Colab, and Cloud TPU v5e is roughly $0.40/chip-hour on Spot.

Rule of thumb

Quick prototyping in a notebook with Bahasa Indonesia data → free Colab TPU. (This article.)
Iterative R&D using HuggingFace PyTorch checkpoints → GPU.
Production training with batchable, JAX-native workloads → Cloud TPU via Kinetic.
You need to fine-tune a 7B+ Indonesian LLM → that’s a different article (and a different category — vLLM or Tunix).

Now let’s actually build it.

Project Architecture

The pipeline is straightforward:

datasetika.csv (Claim, Judul, Stance)
            │
            ▼
   ┌──────────────────┐
   │ IndonesianTokenizer │ whitespace + punctuation, vocab from corpus
   └──────────────────┘
            │
            ▼
   ┌──────────────────┐
   │  FakeNewsDataset │ stratified train/val/test split, JAX-ready arrays
   └──────────────────┘
            │
            ▼
   ┌──────────────────────────────────────┐
   │  FakeNewsDetector (Flax nn.Module)   │
   │  ┌──────────────┐  ┌──────────────┐  │
   │  │ Token+Pos    │  │ Token+Pos    │  │
   │  │ Embedding    │  │ Embedding    │  │
   │  │ (Claim)      │  │ (Headline)   │  │
   │  └──────┬───────┘  └──────┬───────┘  │
   │         ▼                 ▼          │
   │  ┌──────────────┐  ┌──────────────┐  │
   │  │ Transformer  │  │ Transformer  │  │
   │  │ × N layers   │  │ × N layers   │  │
   │  └──────┬───────┘  └──────┬───────┘  │
   │         └────────┬────────┘          │
   │                  ▼                    │
   │      ┌─────────────────────┐         │
   │      │ Stance Cross-Encoder│         │
   │      │ (cross-attention +  │         │
   │      │  diff & product)    │         │
   │      └──────────┬──────────┘         │
   │                 ▼                    │
   │       Dense → Softmax (3 classes)    │
   └──────────────────────────────────────┘
            │
            ▼
   for / against / observing

The whole thing sits inside one jax.jit-compiled train_step. Now let’s walk through each piece.

Step 1 — Hardware Setup

There are two paths. Pick whichever fits your stage of the project.

Path A: Free Colab TPU (recommended for first run)

Open colab.research.google.com and create a new notebook.
Click Runtime → Change runtime type.
Under Hardware accelerator, select v5e-1 TPU.
Click Save.
Verify in a cell:

import jax
print(jax.devices())
# Expected: [TpuDevice(id=0, ...)]

That’s it. You now have a free TPU v5e chip for the duration of your Colab session.

Path B: Cloud TPU via Keras Kinetic (when you outgrow Colab)

Colab is fantastic for prototyping but has runtime limits and gets bumped under load. When you’re ready to run multi-hour training jobs, switch to a real Cloud TPU. The traditional path means provisioning a TPU VM, SSHing in, installing dependencies, and uploading scripts — Kinetic skips all of that.

On your local laptop:

pip install keras-kinetic
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
kinetic up --accelerator v5p-8 --yes

The last command provisions a GKE Autopilot cluster with a TPU v5p-8 node pool. Takes a few minutes the first time, after which you don’t touch infrastructure again until tear-down.

I’ll show the actual @kinetic.run() deployment in Step 6. For now, let’s build the model.

Step 2 — Indonesian Tokenizer and Dataset

Bahasa Indonesia is morphologically less complex than, say, Turkish or Finnish, so a whitespace + punctuation tokenizer with a learned vocabulary works surprisingly well as a baseline. (For production, swap in IndoBERT — I’ll show how at the end of this section.)

The tokenizer reserves four special tokens, builds a frequency-ranked vocabulary from the training corpus, and emits(token_ids, attention_mask) pairs at a fixed length. Standard stuff, but with one subtlety: we tokenize both the Claim and Judul (headline) columns into a shared vocabulary so the embedding layer can pick up cross-input correlations.

"""
Data Preprocessing for Indonesian Fake News Detection
Tokenizes Claim + Judul columns, encodes Stance labels.
Compatible with JAX/Flax training pipeline.
"""
import re
import numpy as np
import pandas as pd
from collections import Counter
from typing import List, Tuple, Dict
from sklearn.model_selection import train_test_split

LABEL_MAP   = {"for": 0, "against": 1, "observing": 2}
ID_TO_LABEL = {v: k for k, v in LABEL_MAP.items()}

class IndonesianTokenizer:
    """
    Lightweight whitespace + punctuation tokenizer for Indonesian text.

    For production, swap with:
        from transformers import AutoTokenizer
        tok = AutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
    """
    SPECIAL_TOKENS = {"": 0, "": 1, "": 2, "": 3}

    def __init__(self, vocab_size: int = 30_000, min_freq: int = 2):
        self.vocab_size = vocab_size
        self.min_freq   = min_freq
        self.word2id: Dict[str, int] = dict(self.SPECIAL_TOKENS)
        self.id2word: Dict[int, str] = {v: k for k, v in self.word2id.items()}

    @staticmethod
    def _clean(text: str) -> str:
        text = text.lower()
        text = re.sub(r"<[^>]+>", " ", text)                   # strip HTML
        text = re.sub(r"[^\w\s]", " ", text, flags=re.UNICODE) # keep alphanum
        text = re.sub(r"\s+", " ", text).strip()
        return text

    @staticmethod
    def tokenize(text: str) -> List[str]:
        return IndonesianTokenizer._clean(text).split()

    def build_vocab(self, texts: List[str]) -> None:
        counter: Counter = Counter()
        for t in texts:
            counter.update(self.tokenize(t))
        top_tokens = [
            word for word, freq in counter.most_common()
            if freq >= self.min_freq
        ][: self.vocab_size - len(self.SPECIAL_TOKENS)]
        for idx, word in enumerate(top_tokens, start=len(self.SPECIAL_TOKENS)):
            self.word2id[word] = idx
            self.id2word[idx]  = word
        print(f"[Tokenizer] Vocabulary size: {len(self.word2id):,} tokens")

    def encode(self, text: str, max_len: int = 128, add_special: bool = True
              ) -> Tuple[np.ndarray, np.ndarray]:
        tokens = self.tokenize(text)
        ids    = [self.word2id.get(t, self.SPECIAL_TOKENS[""]) for t in tokens]
        if add_special:
            ids = [self.SPECIAL_TOKENS[""]] + ids + [self.SPECIAL_TOKENS[""]]
        ids = ids[:max_len]
        pad_len = max_len - len(ids)
        mask    = [1] * len(ids) + [0] * pad_len
        ids     = ids + [self.SPECIAL_TOKENS[""]] * pad_len
        return np.array(ids, dtype=np.int32), np.array(mask, dtype=np.int32)

    def encode_batch(self, texts: List[str], max_len: int = 128
                    ) -> Tuple[np.ndarray, np.ndarray]:
        pairs    = [self.encode(t, max_len) for t in texts]
        ids_arr  = np.stack([p[0] for p in pairs])
        mask_arr = np.stack([p[1] for p in pairs])
        return ids_arr, mask_arr

The dataset class loads the CSV, builds the tokenizer on the combined Claim + Judul corpus, encodes both columns into fixed-length arrays, and produces stratified train/val/test splits:

class FakeNewsDataset:
    """
    Loads datasetika.csv and prepares JAX-compatible NumPy arrays.

    Expected columns:
        id, Claim, idStance, Judul, Stance
        Stance ∈ {for, against, observing}
    """
    def __init__(
        self,
        csv_path: str,
        claim_max_len:    int = 64,
        headline_max_len: int = 64,
        vocab_size:       int = 30_000,
        test_size:        float = 0.15,
        val_size:         float = 0.10,
        random_seed:      int = 42,
    ):
        self.claim_max_len    = claim_max_len
        self.headline_max_len = headline_max_len

        df = pd.read_csv(csv_path)
        print(f"[Dataset] Loaded {len(df):,} samples")
        print(f"[Dataset] Stance distribution:\n{df['Stance'].value_counts()}\n")

        self.tokenizer = IndonesianTokenizer(vocab_size)
        all_texts = df["Claim"].tolist() + df["Judul"].tolist()
        self.tokenizer.build_vocab(all_texts)

        claim_ids,    claim_masks    = self.tokenizer.encode_batch(
            df["Claim"].tolist(),  claim_max_len)
        headline_ids, headline_masks = self.tokenizer.encode_batch(
            df["Judul"].tolist(),  headline_max_len)
        labels = np.array([LABEL_MAP[s] for s in df["Stance"]], dtype=np.int32)

        # Stratified train / val / test
        indices = np.arange(len(df))
        train_idx, test_idx = train_test_split(
            indices, test_size=test_size, stratify=labels, random_state=random_seed)
        train_idx, val_idx = train_test_split(
            train_idx,
            test_size=val_size / (1 - test_size),
            stratify=labels[train_idx],
            random_state=random_seed,
        )

        self.train = self._slice(claim_ids, claim_masks, headline_ids,
                                 headline_masks, labels, train_idx)
        self.val   = self._slice(claim_ids, claim_masks, headline_ids,
                                 headline_masks, labels, val_idx)
        self.test  = self._slice(claim_ids, claim_masks, headline_ids,
                                 headline_masks, labels, test_idx)

    @staticmethod
    def _slice(claim_ids, claim_masks, head_ids, head_masks, labels, idx):
        return {
            "claim_ids":     claim_ids[idx],
            "claim_masks":   claim_masks[idx],
            "headline_ids":  head_ids[idx],
            "headline_masks": head_masks[idx],
            "labels":        labels[idx],
        }

Production swap: if you have GPU/TPU memory to spare, replace IndonesianTokenizer withAutoTokenizer.from_pretrained("indobenchmark/indobert-base-p1") and feed those token IDs into the same downstream model. The training accuracy jump is significant — IndoBERT was pre-trained on hundreds of millions of Indonesian tokens.

Step 3 — The Multimodal Architecture (JAX + Flax)

This is where it gets fun. The model has four logical pieces stacked in a Flax nn.Module:

Token + positional embedding for both inputs
Transformer encoder stack (multi-head self-attention + feedforward) for each input independently
Stance-aware cross-encoder that lets claim and headline attend to each other
Classification head that produces logits over {for, against, observing}

3.1 — Token and Positional Embedding

import jax
import jax.numpy as jnp
import flax.linen as nn
from typing import Tuple

class TokenEmbedding(nn.Module):
    """Learnable token + positional embeddings."""
    vocab_size: int
    embed_dim: int
    max_len: int = 256
    dropout_rate: float = 0.1

    @nn.compact
    def __call__(self, token_ids: jnp.ndarray, train: bool = False) -> jnp.ndarray:
        tok_emb = nn.Embed(self.vocab_size, self.embed_dim)(token_ids)  # (B, L, D)
        pos = jnp.arange(token_ids.shape[1])[None, :]                   # (1, L)
        pos_emb = nn.Embed(self.max_len, self.embed_dim)(pos)           # (1, L, D)
        x = tok_emb + pos_emb
        x = nn.LayerNorm()(x)
        x = nn.Dropout(self.dropout_rate, deterministic=not train)(x)
        return x

3.2 — Multi-Head Attention

A clean, from-scratch multi-head attention. Yes, Flax hasnn.MultiHeadDotProductAttention built in, but writing it explicitly keeps the article educational and lets you see exactly what the attention mask is doing:

class MultiHeadAttention(nn.Module):
    num_heads: int
    head_dim: int
    dropout_rate: float = 0.1

    @nn.compact
    def __call__(self, q: jnp.ndarray, k: jnp.ndarray, v: jnp.ndarray,
                 mask: jnp.ndarray = None, train: bool = False) -> jnp.ndarray:
        B, Lq, D = q.shape
        Lk = k.shape[1]
        H, Dh = self.num_heads, self.head_dim

        Q = nn.Dense(H * Dh)(q).reshape(B, Lq, H, Dh).transpose(0, 2, 1, 3)
        K = nn.Dense(H * Dh)(k).reshape(B, Lk, H, Dh).transpose(0, 2, 1, 3)
        V = nn.Dense(H * Dh)(v).reshape(B, Lk, H, Dh).transpose(0, 2, 1, 3)

        scores = jnp.einsum("bhqd,bhkd->bhqk", Q, K) / jnp.sqrt(Dh)
        if mask is not None:
            # mask: (B, Lk) -> (B, 1, 1, Lk)
            mask = mask[:, None, None, :]
            scores = jnp.where(mask == 0, -1e9, scores)

        attn = jax.nn.softmax(scores, axis=-1)
        attn = nn.Dropout(self.dropout_rate, deterministic=not train)(attn)
        out  = jnp.einsum("bhqk,bhkd->bhqd", attn, V)
        out  = out.transpose(0, 2, 1, 3).reshape(B, Lq, H * Dh)
        out  = nn.Dense(D)(out)
        return out

3.3 — Transformer Block

Standard pre-norm Transformer block with a feedforward expansion factor of 4:

class TransformerBlock(nn.Module):
    num_heads: int
    head_dim: int
    ff_dim: int
    dropout_rate: float = 0.1

    @nn.compact
    def __call__(self, x: jnp.ndarray, mask: jnp.ndarray = None,
                 train: bool = False) -> jnp.ndarray:
        # Self-attention sublayer
        h = nn.LayerNorm()(x)
        h = MultiHeadAttention(self.num_heads, self.head_dim, self.dropout_rate)(
            h, h, h, mask=mask, train=train)
        h = nn.Dropout(self.dropout_rate, deterministic=not train)(h)
        x = x + h

        # Feedforward sublayer
        h = nn.LayerNorm()(x)
        h = nn.Dense(self.ff_dim)(h)
        h = nn.gelu(h)
        h = nn.Dense(x.shape[-1])(h)
        h = nn.Dropout(self.dropout_rate, deterministic=not train)(h)
        x = x + h
        return x

3.4 — Stance Cross-Encoder (the Interesting Part)

This is what makes the architecture multimodal-aware rather than just two encoders bolted together. After the claim and headline are independently encoded, the cross-encoder lets each one attend to the other, then extracts an interaction vector by concatenating the difference and element-wise product of the two pooled representations:

class StanceCrossEncoder(nn.Module):
    """Cross-attention between claim and headline + interaction features."""
    num_heads: int
    head_dim: int
    dropout_rate: float = 0.1

    @nn.compact
    def __call__(self, claim: jnp.ndarray, headline: jnp.ndarray,
                 claim_mask: jnp.ndarray, headline_mask: jnp.ndarray,
                 train: bool = False) -> jnp.ndarray:
        # Claim attends to headline
        claim_attended = MultiHeadAttention(
            self.num_heads, self.head_dim, self.dropout_rate
        )(claim, headline, headline, mask=headline_mask, train=train)

        # Headline attends to claim
        head_attended = MultiHeadAttention(
            self.num_heads, self.head_dim, self.dropout_rate
        )(headline, claim, claim, mask=claim_mask, train=train)

        # Mean-pool with mask
        def masked_mean(x, m):
            m = m[..., None].astype(x.dtype)
            return (x * m).sum(axis=1) / jnp.maximum(m.sum(axis=1), 1.0)

        c = masked_mean(claim_attended, claim_mask)     # (B, D)
        h = masked_mean(head_attended,  headline_mask)  # (B, D)

        # Stance-aware interaction features
        interaction = jnp.concatenate(
            [c, h, jnp.abs(c - h), c * h], axis=-1
        )  # (B, 4D)
        return interaction

The four components — c, h, |c - h|, c * h — capture different aspects of how the two inputs relate. This is a classic NLI (natural language inference) trick and it works well here: |c - h| flags semantic distance, while c * h highlights agreement.

3.5 — Putting It All Together

class FakeNewsDetector(nn.Module):
    vocab_size: int
    embed_dim: int = 256
    num_heads: int = 8
    head_dim: int = 32
    ff_dim: int = 1024
    num_layers: int = 4
    dropout_rate: float = 0.1
    num_classes: int = 3

    @nn.compact
    def __call__(self, claim_ids, claim_mask, headline_ids, headline_mask,
                 train: bool = False):
        # Embed both inputs
        claim_x    = TokenEmbedding(self.vocab_size, self.embed_dim,
                                    dropout_rate=self.dropout_rate)(claim_ids, train)
        headline_x = TokenEmbedding(self.vocab_size, self.embed_dim,
                                    dropout_rate=self.dropout_rate)(headline_ids, train)

        # Independent Transformer stacks
        for _ in range(self.num_layers):
            claim_x = TransformerBlock(self.num_heads, self.head_dim,
                                       self.ff_dim, self.dropout_rate)(
                claim_x, mask=claim_mask, train=train)
            headline_x = TransformerBlock(self.num_heads, self.head_dim,
                                          self.ff_dim, self.dropout_rate)(
                headline_x, mask=headline_mask, train=train)

        # Cross-encoder interaction
        interaction = StanceCrossEncoder(self.num_heads, self.head_dim,
                                         self.dropout_rate)(
            claim_x, headline_x, claim_mask, headline_mask, train=train)

        # Classification head
        h = nn.Dense(self.embed_dim)(interaction)
        h = nn.gelu(h)
        h = nn.Dropout(self.dropout_rate, deterministic=not train)(h)
        logits = nn.Dense(self.num_classes)(h)
        return logits

A 4-layer encoder stack with embed_dim=256, num_heads=8, and ff_dim=1024 lands at around 7M parameters — small enough to train quickly on a free Colab TPU, large enough to learn meaningful stance representations.

Step 4 — Training Pipeline (TPU-Optimized)

The training loop is a standard Optax + JAX pattern, with two TPU-specific touches: everything is jax.jit-compiled, and we use Orbax for checkpointing because it handles the JAX pytreeparameter format natively.

4.1 — Train State and Loss

import optax
from flax.training import train_state

def create_train_state(rng, model, learning_rate, weight_decay,
                       claim_len, head_len, num_warmup_steps, num_total_steps):
    # Dummy inputs to initialize parameters
    dummy_claim_ids   = jnp.ones((1, claim_len), dtype=jnp.int32)
    dummy_claim_mask  = jnp.ones((1, claim_len), dtype=jnp.int32)
    dummy_head_ids    = jnp.ones((1, head_len),  dtype=jnp.int32)
    dummy_head_mask   = jnp.ones((1, head_len),  dtype=jnp.int32)

    params = model.init(
        rng, dummy_claim_ids, dummy_claim_mask,
        dummy_head_ids, dummy_head_mask, train=False,
    )["params"]

    # Linear warmup + cosine decay
    schedule = optax.warmup_cosine_decay_schedule(
        init_value=0.0, peak_value=learning_rate,
        warmup_steps=num_warmup_steps,
        decay_steps=num_total_steps - num_warmup_steps,
        end_value=learning_rate * 0.1,
    )
    optimizer = optax.adamw(schedule, weight_decay=weight_decay)

    return train_state.TrainState.create(
        apply_fn=model.apply, params=params, tx=optimizer)

def cross_entropy_loss(logits, labels):
    one_hot = jax.nn.one_hot(labels, num_classes=logits.shape[-1])
    return -jnp.mean(jnp.sum(one_hot * jax.nn.log_softmax(logits), axis=-1))

4.2 — JIT-Compiled Train and Eval Steps

@jax.jit
def train_step(state, batch, dropout_rng):
    def loss_fn(params):
        logits = state.apply_fn(
            {"params": params},
            batch["claim_ids"], batch["claim_masks"],
            batch["headline_ids"], batch["headline_masks"],
            train=True,
            rngs={"dropout": dropout_rng},
        )
        loss = cross_entropy_loss(logits, batch["labels"])
        return loss, logits

    (loss, logits), grads = jax.value_and_grad(loss_fn, has_aux=True)(state.params)
    state = state.apply_gradients(grads=grads)
    accuracy = (jnp.argmax(logits, axis=-1) == batch["labels"]).mean()
    return state, loss, accuracy

@jax.jit
def eval_step(state, batch):
    logits = state.apply_fn(
        {"params": state.params},
        batch["claim_ids"], batch["claim_masks"],
        batch["headline_ids"], batch["headline_masks"],
        train=False,
    )
    loss = cross_entropy_loss(logits, batch["labels"])
    accuracy = (jnp.argmax(logits, axis=-1) == batch["labels"]).mean()
    return loss, accuracy

The @jax.jit is doing a lot of work here. On the first call, XLA compiles the entire forward pass, loss computation, gradient calculation, and parameter update into a single fused execution graph. After that first ~30-second compile, every subsequent step runs at peak TPU throughput.

4.3 — The Training Loop

import orbax.checkpoint as ocp
from pathlib import Path
import time

def train(args):
    # Load and prepare data
    ds = FakeNewsDataset(
        args.data_path,
        claim_max_len=args.claim_len,
        headline_max_len=args.head_len,
    )

    # Create model and train state
    model = FakeNewsDetector(
        vocab_size=len(ds.tokenizer.word2id),
        embed_dim=args.embed_dim,
        num_heads=args.num_heads,
    )
    rng = jax.random.PRNGKey(0)
    init_rng, train_rng = jax.random.split(rng)

    steps_per_epoch  = len(ds.train["labels"]) // args.batch_size
    num_total_steps  = steps_per_epoch * args.epochs
    num_warmup_steps = int(0.1 * num_total_steps)

    state = create_train_state(
        init_rng, model, args.lr, weight_decay=0.01,
        claim_len=args.claim_len, head_len=args.head_len,
        num_warmup_steps=num_warmup_steps,
        num_total_steps=num_total_steps,
    )

    # Orbax checkpointer
    ckpt_dir = Path(args.output_dir).resolve()
    ckpt_dir.mkdir(parents=True, exist_ok=True)
    checkpointer = ocp.PyTreeCheckpointer()
    best_val_acc = 0.0

    for epoch in range(args.epochs):
        epoch_start = time.time()
        train_rng, shuffle_rng = jax.random.split(train_rng)

        # Shuffle training indices
        perm = jax.random.permutation(shuffle_rng, len(ds.train["labels"]))

        # Training pass
        train_loss, train_acc = 0.0, 0.0
        for step in range(steps_per_epoch):
            batch_idx = perm[step * args.batch_size:(step + 1) * args.batch_size]
            batch = {k: v[batch_idx] for k, v in ds.train.items()}
            train_rng, dropout_rng = jax.random.split(train_rng)
            state, loss, acc = train_step(state, batch, dropout_rng)
            train_loss += float(loss)
            train_acc  += float(acc)
        train_loss /= steps_per_epoch
        train_acc  /= steps_per_epoch

        # Validation pass
        val_loss, val_acc = 0.0, 0.0
        val_steps = len(ds.val["labels"]) // args.batch_size
        for step in range(val_steps):
            batch = {k: v[step * args.batch_size:(step + 1) * args.batch_size]
                     for k, v in ds.val.items()}
            loss, acc = eval_step(state, batch)
            val_loss += float(loss)
            val_acc  += float(acc)
        val_loss /= val_steps
        val_acc  /= val_steps

        elapsed = time.time() - epoch_start
        print(f"Epoch {epoch+1:3d}/{args.epochs} | "
              f"train loss {train_loss:.4f} acc {train_acc:.4f} | "
              f"val loss {val_loss:.4f} acc {val_acc:.4f} | "
              f"{elapsed:.1f}s")

        # Save best checkpoint
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            checkpointer.save(ckpt_dir / "best", state.params, force=True)
            print(f"  ✓ Saved checkpoint (val_acc={val_acc:.4f})")

4.4 — Running on Free Colab TPU

In a Colab cell:

class Args:
    data_path  = "datasetika.csv"
    output_dir = "checkpoints"
    epochs     = 20
    batch_size = 64
    lr         = 2e-4
    claim_len  = 64
    head_len   = 64
    embed_dim  = 256
    num_heads  = 8

train(Args)

Sample output on Colab v5e-1:

[Dataset] Loaded 12,847 samples
[Dataset] Stance distribution:
observing    5,932
for          4,221
against      2,694

[Tokenizer] Vocabulary size: 18,734 tokens
Epoch   1/20 | train loss 0.9932 acc 0.5414 | val loss 0.8821 acc 0.6102 | 41.3s
Epoch   2/20 | train loss 0.7881 acc 0.6587 | val loss 0.7024 acc 0.7011 | 22.8s
Epoch   3/20 | train loss 0.6014 acc 0.7321 | val loss 0.6398 acc 0.7314 | 22.5s
...
Epoch  20/20 | train loss 0.2741 acc 0.9001 | val loss 0.4012 acc 0.8412 | 22.6s

The first epoch is slower because XLA is compiling the train_step. From epoch 2 onward, every epoch takes ~22 seconds on a free Colab v5e-1. Total wall time: about 8 minutes for 20 epochs.

Step 5 — Inference on New Claims

Once trained, classifying a new claim-headline pair is a one-shot forward pass. Load the tokenizer and checkpointed parameters, encode the inputs, run the model:

import pickle
import orbax.checkpoint as ocp

def load_model(checkpoint_path: str, tokenizer_path: str,
               claim_len: int = 64, headline_len: int = 64,
               embed_dim: int = 256, num_heads: int = 8, num_layers: int = 4):
    with open(tokenizer_path, "rb") as f:
        tokenizer: IndonesianTokenizer = pickle.load(f)

    model = FakeNewsDetector(
        vocab_size=len(tokenizer.word2id),
        embed_dim=embed_dim,
        num_heads=num_heads,
        num_layers=num_layers,
    )

    checkpointer = ocp.PyTreeCheckpointer()
    params = checkpointer.restore(checkpoint_path)
    return model, params, tokenizer

def predict(model, params, tokenizer, claim: str, headline: str,
            claim_len: int = 64, headline_len: int = 64):
    claim_ids,    claim_mask    = tokenizer.encode(claim,    claim_len)
    headline_ids, headline_mask = tokenizer.encode(headline, headline_len)

    # Add batch dimension
    claim_ids     = claim_ids[None, :]
    claim_mask    = claim_mask[None, :]
    headline_ids  = headline_ids[None, :]
    headline_mask = headline_mask[None, :]

    logits = model.apply(
        {"params": params},
        claim_ids, claim_mask, headline_ids, headline_mask, train=False,
    )
    probs    = jax.nn.softmax(logits, axis=-1)[0]
    pred_idx = int(jnp.argmax(probs))
    return ID_TO_LABEL[pred_idx], {ID_TO_LABEL[i]: float(probs[i]) for i in range(3)}

# Example
model, params, tokenizer = load_model("checkpoints/best", "tokenizer.pkl")

examples = [
    {"claim":    "Pemerintah resmi menaikkan harga BBM bulan depan",
     "headline": "Kementerian ESDM bantah rencana kenaikan harga BBM",
     "expected": "against"},
    {"claim":    "Vaksin baru efektif 95 persen mencegah penularan",
     "headline": "Studi terbaru: vaksin tunjukkan efikasi 95 persen pada uji klinis fase 3",
     "expected": "for"},
    {"claim":    "Presiden umumkan kebijakan baru ekonomi",
     "headline": "Pengamat: masih perlu dikaji lebih lanjut dampaknya",
     "expected": "observing"},
]

for ex in examples:
    pred, probs = predict(model, params, tokenizer, ex["claim"], ex["headline"])
    print(f"\nClaim:    {ex['claim']}")
    print(f"Headline: {ex['headline']}")
    print(f"Predicted: {pred} (expected: {ex['expected']})")
    print(f"Probabilities: {probs}")

Step 6 — Scaling Up: Deploying to Cloud TPU with Keras Kinetic

Free Colab TPU is great for prototyping. But Colab sessions have runtime limits (12h max, often less under load), no persistent storage, and you can’t run multiple experiments in parallel. When you want to do real hyperparameter sweeps or train on a larger dataset, it’s time to graduate to Cloud TPU.

The traditional path means provisioning a TPU VM, SSHing in, installing dependencies, uploading scripts, and managing the deployment yourself. Keras Kinetic skips all of that. You decorate your training function with @kinetic.run(accelerator="v5p-8") and call it from your laptop. Kinetic packages the code, builds a container, provisions a GKE cluster with TPUs attached, runs the function on the remote pod, and streams logs back to your local terminal.

6.1 — Wrap Your Training Function

Save the existing tokenizer, dataset, model, and training code asfakenews.py. Then create train_kinetic.py:

import kinetic

@kinetic.run(
    accelerator="v5p-8",
    requirements=[
        "jax[tpu]",
        "flax",
        "optax",
        "orbax-checkpoint",
        "numpy",
        "pandas",
        "scikit-learn",
    ],
)
def train_remote(data_gcs_path: str, epochs: int = 50, batch_size: int = 128,
                 lr: float = 2e-4):
    # CRITICAL: imports happen *inside* the function. The body runs on
    # the remote TPU pod, so imports resolve against the pod's installed
    # packages, not your laptop's.
    import os
    os.environ["JAX_PLATFORMS"] = "tpu"

    from fakenews import FakeNewsDataset, FakeNewsDetector, train

    class Args:
        data_path  = data_gcs_path
        output_dir = "/tmp/checkpoints"
        epochs     = epochs
        batch_size = batch_size
        lr         = lr
        claim_len  = 64
        head_len   = 64
        embed_dim  = 256
        num_heads  = 8

    train(Args)
    return {"status": "completed", "epochs": epochs}

if __name__ == "__main__":
    # This runs on the *remote* TPU but feels like a local function call.
    result = train_remote(
        data_gcs_path="gs://my-bucket/datasetika.csv",
        epochs=50,
        batch_size=128,
    )
    print(f"Training finished: {result}")

Three things worth highlighting:

All imports go inside the function — the body runs on the remote pod, so import jax needs to resolve against the pod’s jax[tpu] package, not your laptop’s.
requirements=[...] is your remote requirements.txt — Kinetic uses it to build the container image on the first run. None of these need to be installed locally.
Data lives in GCS — your dataset goes to a Google Cloud Storage bucket, and the function reads it from gs://.... The TPU pod has automatic GCS access via its service account.

6.2 — Launch from Your Laptop

python train_kinetic.py

First run takes ~5 minutes for the container build. Subsequent runs with unchanged dependencies start in under a minute. You’ll see remote logs streamed to your terminal:

Shipping to TPU via Kinetic...
[Stage 1/4] Preflight & packaging...
[Stage 2/4] Building container image (5m, cached after this run)...
[Stage 3/4] Submitting job to GKE cluster...
[Stage 4/4] Executing on TPU v5p-8...
[remote] [Dataset] Loaded 12,847 samples
[remote] [Tokenizer] Vocabulary size: 18,734 tokens
[remote] Epoch   1/50 | train loss 0.9912 acc 0.5421 | val loss 0.8810 acc 0.6112 | 12.8s
[remote] Epoch   2/50 | train loss 0.7821 acc 0.6601 | val loss 0.6987 acc 0.7022 | 4.2s
...
[remote] Epoch  50/50 | train loss 0.1912 acc 0.9301 | val loss 0.3811 acc 0.8612 | 4.3s
Job complete. Streaming results to local...
Training finished: {'status': 'completed', 'epochs': 50}

The v5p-8 has 8 chips vs Colab’s 1, so per-epoch time drops from 22s to ~4s — about 5× faster, which matters when you’re running 50+ epochs or sweeping hyperparameters.

6.3 — Tear Down

The GKE cluster’s control plane costs money even when no TPU nodes are active. Always shut it down when you’re done:

kinetic down --yes

Training Configuration Summary

Architecture: 4-layer Transformer + Stance Cross-Encoder
Embedding dim: 256
Attention heads: 8 (head_dim 32)
Feedforward dim: 1024
Total parameters: ~7M
Optimizer: AdamW with linear warmup (10% of steps) + cosine decay
Learning rate: 2e-4 peak
Weight decay: 0.01
Dropout: 0.1
Batch size: 64 (Colab) / 128 (Cloud TPU v5p-8)
Epochs: 20 (Colab) / 50 (Cloud TPU)
Sequence length: 64 tokens for both claim and headline

Key Takeaways

Free Colab TPU is the best on-ramp for JAX/Flax in 2026

You get a full v5e-1 chip with no signup beyond a Google account. For models in the 5–50M parameter range, that’s enough to do real work, not just toy demos. If you’ve been putting off learning JAX because the cloud setup felt overwhelming, this path is friction-free.

Keras Kinetic makes the leap from Colab to Cloud TPU painless

The biggest practical lesson from this project: when Colab’s runtime limits start hurting, switching to Cloud TPU traditionally means rewriting your deployment story. With Kinetic, you wrap your existing function in a decorator and call it from the same laptop you’ve been using. The mental model — “I have a function; I want it to run on a TPU” — stays intact.

Stance detection is a more honest framing than fake-news classification

Asking a model “is this true?” puts it in an impossible position. Asking “does this article support, refute, or merely observe this claim?” gives it a question it can answer, and gives downstream fact-checkers exactly the signal they need.

Cross-attention beats independent encoders for paired-text tasks

The StanceCrossEncoder is what makes this model genuinely multimodal-aware. Concatenating two independently-encoded vectors and slapping a classifier on top works, but performance jumps significantly when you let the two inputs literally read each other before pooling. The [c, h, |c-h|, c*h] interaction trick is borrowed from NLI literature and consistently outperforms simpler combinations.

TPU is the right pick for fixed-length attention workloads

Multi-head attention with fixed sequence lengths is exactly what TPU systolic arrays were built for. The whole train_step JITs into a single fused execution graph, and after the first compile, every step runs at peak FLOPs. GPUs are still preferable when you need irregular-length sequences, KV-caching for generation, or HuggingFace PyTorch checkpoints — but for from-scratch JAX/Flax training, TPU wins on both speed and cost.

Adapting This to Other Languages and Tasks

The pipeline is generic. To use it on your own data:

Any language → just point the tokenizer at your text corpus. The whitespace tokenizer is language-agnostic. For better quality, swap in the appropriate pre-trained tokenizer (English → BERT, Indonesian → IndoBERT, Arabic → AraBERT, etc.).
Any paired-text classification task → swap the labels. The [c, h, |c-h|, c*h] interaction trick works for paraphrase detection, NLI, semantic textual similarity, duplicate question detection, and more.
Larger corpora → bump embed_dim, num_layers, and switch from a learned tokenizer to a SentencePiece or BPE one. The training loop and Kinetic deployment don’t change.

Resources

Project repository: (your GitHub link here)
JAX: github.com/google/jax
Flax: github.com/google/flax
Optax: github.com/google-deepmind/optax
Orbax: github.com/google/orbax
Keras Kinetic: github.com/keras-team/kinetic
IndoBERT (production tokenizer upgrade):huggingface.co/indobenchmark/indobert-base-p1
Cloud TPU documentation: cloud.google.com/tpu/docs
Google Colab: colab.research.google.com

Acknowledgement

Google Cloud credits were provided for this project. #TPUSprint
Thanks to the JAX, Flax, and Keras teams for building such a clean stack — training custom transformers on TPUs used to be a research-grade pain.

Tags: JAX, Flax, KerasKinetic, TPU, GoogleColab, FakeNewsDetection, StanceDetection, NaturalLanguageProcessing, MachineLearning, Indonesia

Building a Multimodal Indonesian Fake-News Detector with JAX, Flax, and Keras Kinetic on Cloud TPU was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Build and Deploy an AI Agent with Gemini CLI and Google ADK: A

Esther Irawati Setiawan — Wed, 03 Jun 2026 14:11:57 GMT

Step-by-Step Vibe Coding Tutorial

A simplified, hands-on walkthrough of Google’s “Agentverse: The Shadowblade’s Codex” codelab, focused on the build itself.

📂 All the commands from this tutorial, in one quick-reference file: github.com/estherirawati/agentverse

Google’s Agentverse codelab wraps everything in a fun fantasy theme. You’re a “Shadowblade,” your terminal is your “primary weapon,” and you’re fighting an entropy monster called The Static. The story is a great hook, and this tutorial keeps a light touch of it while focusing on what you build and how, so you can follow along step by step.

Underneath the theme, this is a genuinely good tour of the modern AI-agent workflow on Google Cloud: prompting code into existence with Gemini CLI, wiring the CLI to external tools, building a real autonomous agent with the Agent Development Kit (ADK), testing it, and shipping it to Cloud Run.

Here’s what the journey actually looks like.

What you’re really building

By the end you’ll have:

A working command-line AI assistant (Gemini CLI) that can read your intent and act on it
A personal website generated almost entirely from a single prompt
The CLI connected to external tools through MCP servers (a self-hosted Git server and an image generator)
A real autonomous agent built with ADK that picks the right “weapon” tool for each task
An automated test suite for that agent
The agent deployed to Cloud Run as a live service

The fantasy theme maps cleanly onto real engineering ideas, and the codelab makes this explicit in its “for non-gamers” sidebars. I’ll translate as we go.

Step 0: Setup (the necessary part)

You’ll need a personal Gmail account — corporate or school accounts won’t work because of the credit grant.

Claim your free Google Cloud credit through the workshop link, sign in, accept the terms.
Open Cloud Shell at console.cloud.google.com, then click Open Editor.
Clone the starter code and run the setup script:

git clone https://github.com/weimeilin79/agentverse-developer.git
chmod +x ~/agentverse-developer/*.sh
cd ~/agentverse-developer
./init.sh    # press Enter to accept the default Project ID

Point your config at the new project and turn on the APIs you’ll need:

gcloud config set project $(cat ~/project_id.txt) --quiet
gcloud services enable compute.googleapis.com artifactregistry.googleapis.com \
  run.googleapis.com cloudbuild.googleapis.com aiplatform.googleapis.com \
  iam.googleapis.com cloudresourcemanager.googleapis.com
npm update -g @google/gemini-cli

That’s the whole foundation. Now the interesting part.

Step 1: Meet your “weapon” — Gemini CLI

Gemini CLI is an open-source AI agent that lives in your terminal. It runs a reason-and-act (ReAct) loop: it reads your high-level request, breaks it into steps, picks the right tool, runs it, and checks the result.

Start it from a fresh folder:

cd ~/agentverse-developer
mkdir playground
cd playground
gemini    # choose "No" when it asks to connect the Cloud Shell editor

A few commands worth knowing right away, run inside the CLI — /help lists commands, /tools shows built-in abilities (ReadFile, WriteFile, GoogleSearch…), the ! prefix runs a normal shell command, and /memorystores context:

/help
/tools
!ls -l
/memory add "My name is [your name]. I'm learning about AI agents."
/memory show

That /memory feature is your first taste of context engineering — deliberately feeding the AI background so its output stays relevant.

Now the headline trick — generating code from a sentence:

Write a Python script called hello.py that prints "I built my first
AI-generated file" and the current date.

Then verify it actually worked:

!ls
!cat hello.py
!python3 hello.py

Exit anytime with Ctrl+C twice. That’s it — you just “vibe-coded.”

What “vibe coding” really means: instead of writing every line by hand, you describe your intent in plain language and let the assistant generate the code and config. The industry term is intent-driven development.

Step 2: Build a real website, then connect external tools

Same idea, bigger ask. Inside the CLI:

Create a personal profile website in the current folder. Dark theme,
electric blue accents. Two files: index.html and styles.css. Use flexbox
for a two-column layout. Include a placeholder spot for a profile picture.
Make the code clean and commented. Don't start any server.

Exit the CLI and preview it:

python -m http.server
# Cloud Shell → Web Preview → port 8000, then Ctrl+C to stop

Giving the CLI hands: MCP servers

So far Gemini CLI can only talk and edit local files. To let it take real action in other systems — Git, databases, APIs — you connect a Model Context Protocol (MCP) server. Think of it as a power cable between the AI’s “brain” and a tool’s “body.”

The codelab spins up Gitea, a self-hosted Git server, as the first tool:

cd ~/agentverse-developer
./gitea.sh
# Web Preview → port 3005 → log in with dev / dev

Then register it in Gemini’s config so the CLI can see it:

if [ ! -f ~/.gemini/settings.json ]; then
  echo '{"mcpServers":{"gitea":{"url":"http://localhost:8085/sse"}}}' > ~/.gemini/settings.json
else
  jq '. * {"mcpServers":{"gitea":{"url":"http://localhost:8085/sse"}}}' ~/.gemini/settings.json > ~/.gemini/settings.json.tmp && mv ~/.gemini/settings.json.tmp ~/.gemini/settings.json
fi

Relaunch the CLI from your project folder and confirm Gitea is wired up:

cd ~/agentverse-developer/playground
gemini

/mcp

With Gitea visible, you can now do version control with plain English:

Create a new Gitea repository named 'my-profile' with description
'My first AI-built website'. Don't add any content yet.

Using the Gitea tool, push index.html and styles.css to the
'my-profile' repository.

File an issue in the my-profile repo titled "Profile image is missing".
Use the Gitea tool and the 'dev' user account.

Close issue #1 in the my-profile repo. Use the 'dev' user account.

That’s the leap the codelab is really showing off: the AI stops being a chatbot and becomes an active participant in your workflow — creating repos, pushing commits, filing and closing issues.

Extensions vs. raw MCP: editing settings.json by hand is the "raw" way — useful for understanding the plumbing. In real life you'd usually install an extension (gemini extensions install ...), which bundles the MCP server, custom slash-commands, and context into one installable package. The codelab uses this approach later for an image-generation extension called Nano Banana.

Step 3: Forge the actual agent with ADK

This is the heart of the lab. You move from “AI helps me write code” to “I build an AI thing that runs on its own.” The framework is Google’s Agent Development Kit (ADK).

First, lay down the rules

Before generating any agent code, you write a GEMINI.md file. The CLI loads it automatically and treats it as persistent, project-level instructions — coding standards, naming conventions, a persona, hard constraints:

cd ~/agentverse-developer/shadowblade
. ~/agentverse-developer/set_env.sh

cat << 'EOF' > GEMINI.md
### Coding Rules for This Project
- Use Python 3 with type hints on every function.
- Every function needs a docstring explaining what it does.
- Use snake_case for variables and functions, PascalCase for classes.
- Keep code clean and readable.
EOF

This is context engineering layer two. The hierarchy is worth remembering:

~/.gemini/settings.json — global user settings
GEMINI.md — project rules, always loaded
Agent Skills (.gemini/skills/) — specialized knowledge loaded only when relevant

That third layer (progressive disclosure) is what keeps the agent fast: a repo can hold dozens of niche skills, but the AI only pulls the one that matches the task at hand.

The agent itself

The codelab has you generate agent.py from a design doc, then — because LLM output is unpredictable — swap in the known-good version:

cp ~/agentverse-developer/working_code/agent.py ~/agentverse-developer/shadowblade/
cp ~/agentverse-developer/working_code/mcp_server.py ~/agentverse-developer/shadowblade/

At its core, the agent is an LlmAgent with a model, a clear instruction block, and a list of tools wired up through an MCPToolset. The "weapons" it can wield are just @mcp.tool()-decorated Python functions in mcp_server.py — each one a small, focused tool with a descriptive docstring the model reads to decide when to use it.

Run it

cd ~/agentverse-developer
python -m venv env
source env/bin/activate
pip install --upgrade pip
pip install -r shadowblade/requirements.txt
adk run shadowblade

Then give it a command. The agent reads the “monster’s weakness,” picks the matching tool, and reports the outcome:

We're stuck against 'Perfectionism'. Its weakness is 'Elegant
Sufficiency'. Break us out!

'Dogma' blocks our path. Its weakness is 'Revolutionary Rewrite'. Take it down.

Strip away the theme and this is exactly how a real support or ops agent works: read the situation, choose the correct internal tool, execute, summarize.

Step 4: Test it (because an untested agent is a liability)

Agents are harder to test than scripts because their behavior emerges from an LLM’s multi-step reasoning. You care about two things: did it produce the right final answer, and did it take the right path (use the right tools)?

ADK gives you two complementary methods:

adk eval runs the agent against a set of predefined cases in a JSON "evalset." Each case defines the input, the expected response, and — crucially — the expected tool_uses. A neat trick the lab teaches here is synthetic test data: you hand the AI one template case and ask it to generate dozens of varied ones, scaling your test coverage cheaply.

pytest wraps the same AgentEvaluator in code, which is what makes it CI-ready:

cp ~/agentverse-developer/working_code/test_agent_initiative.py ~/agentverse-developer/shadowblade/
source ~/agentverse-developer/env/bin/activate
cd ~/agentverse-developer
. ~/agentverse-developer/set_env.sh
pytest test_agent_initiative.py

A passing run (1 passed) means the agent follows its protocol and is ready to drop into an automated pipeline. The scoring criteria split cleanly into tool_trajectory_avg_score (did it do the right thing) and response_match_score (did it say the right thing).

The lab also introduces hooks — scripts that fire at specific points in the agent loop (BeforeTool, AfterTool, etc.) so you can validate, audit, or block actions during a run without touching the agent's code. Great for security and compliance gates.

Step 5: Deploy to Cloud Run, then clean up

Build a container image and ship it as a live, autoscaling service:

. ~/agentverse-developer/set_env.sh

# create the artifact repo (ignore error if it already exists)
gcloud artifacts repositories create $REPO_NAME \
  --repository-format=docker \
  --location=$REGION \
  --description="Agent repo" 2>/dev/null || echo "Already exists, moving on"

# grant the service account the roles it needs
for ROLE in artifactregistry.admin cloudbuild.builds.editor run.admin \
  iam.serviceAccountUser aiplatform.user logging.logWriter logging.viewer; do
    gcloud projects add-iam-policy-binding $PROJECT_ID \
      --member="serviceAccount:$SERVICE_ACCOUNT_NAME" \
      --role="roles/$ROLE" --quiet
done

# prep the Dockerfile, then build and deploy
sed -i 's|COPY ./shadowblade|COPY .|g' ~/agentverse-developer/shadowblade/Dockerfile
sed -i 's|COPY shadowblade|COPY .|g' ~/agentverse-developer/shadowblade/Dockerfile

cd ~/agentverse-developer
gcloud builds submit ./shadowblade \
  --tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/my-agent:latest

gcloud run deploy my-agent \
  --image=${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/my-agent:latest \
  --region=${REGION} \
  --allow-unauthenticated \
  --set-env-vars="A2A_HOST=0.0.0.0,A2A_PORT=8080,GOOGLE_GENAI_USE_VERTEXAI=TRUE" \
  --min-instances=1 \
  --project=${PROJECT_ID}

Visit your service URL with /.well-known/agent-card.json appended to confirm the agent is live and describing itself:

https://my-agent-xxxxx-uc.a.run.app/.well-known/agent-card.json

Don’t skip cleanup — this is what stops a free-credit project from quietly billing you:

. ~/agentverse-developer/set_env.sh
gcloud run services delete my-agent --region=${REGION} --quiet
gcloud artifacts repositories delete ${REPO_NAME} --location=${REGION} --quiet
rm -rf ~/agentverse-developer ~/.gemini
rm -f ~/project_id.txt

The five ideas worth keeping

Setting the theme aside, here’s what this codelab actually teaches:

Intent-driven development — describe the outcome, let the AI generate the code, then verify it. The verification habit matters as much as the generation.
MCP servers turn a chatbot into a doer — the moment the CLI can touch Git, a database, or an API, it becomes part of your real workflow.
Context engineering is layered — global settings, persistent project rules (GEMINI.md), and on-demand skills. Good context beats clever one-off prompts.
Agents need real testing — judge both the final answer and the path taken. Synthetic data lets you scale that testing cheaply.
Ship it like software — containerize, set IAM roles, deploy to Cloud Run, and tear it down when you’re done.

The story makes it memorable; the workflow makes it useful. Both are worth keeping.

Want to go through it yourself? The full codelab is “Agentverse: The Shadowblade’s Codex” on Google Codelabs, and the starter code lives at github.com/weimeilin79/agentverse-developer. I’ve also collected every command from this tutorial in a quick-reference file on my own GitHub: github.com/estherirawati/agentverse.

How to Build and Deploy an AI Agent with Gemini CLI and Google ADK: A was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Vertex AI Is Gone. Here Is What Google Built Instead.

Geeta Kakrani — Wed, 03 Jun 2026 14:11:15 GMT

At Cloud Next 2026 in Las Vegas, Google made one of the biggest moves in its cloud history. Vertex AI, the platform that millions of developers have been using since 2021, is gone. Not deprecated. Not just renamed. Replaced, structurally, by something built from the ground up for a different era.

It is called the Gemini Enterprise Agent Platform.

And if you are building anything with AI today, you need to understand what just changed.

The Old World vs The New World

Vertex AI was built for a world where you pick a model, train it, deploy it, and call it a day. One model, one job, one endpoint.

That world is over.

Businesses today are not trying to build one AI assistant. They are trying to run hundreds, sometimes thousands, of AI agents at the same time. Agents that search the web, write emails, call APIs, talk to each other, handle customer requests, and make decisions, all simultaneously, all day long.

Vertex AI was not designed for that. So Google replaced it.

What Is the Gemini Enterprise Agent Platform

Google CEO Thomas Kurian described the strategy in simple terms at the keynote. Competitors, he said, are “handing you the pieces, not the platform.” Google wants to own the entire stack, from its custom TPU chips all the way to the three billion inboxes inside Google Workspace.

The Gemini Enterprise Agent Platform is the result. It brings together model selection, development tools, deployment infrastructure, security, and governance into a single place. Everything you need to build, run, and manage agents, under one roof.

It also absorbed Agentspace, Google’s enterprise AI search and discovery product, into a unified Gemini Enterprise offering. No more juggling separate products.

The Four Things the Platform Does

The platform is organized around four core jobs.

The first is building. For developers who want to write code, there is the Agent Development Kit, or ADK. It just hit stable version 1.0 across four languages: Python, Go, Java, and TypeScript. This is significant. Enterprise teams do not always work in Python. A Java shop can now build production-ready agents without maintaining a separate Python service just to connect to Google’s infrastructure. ADK also now includes a graph-based framework for orchestrating multiple agents working together.

For teams that do not want to write code at all, there is Agent Studio. It is a low-code interface where you describe what you want in plain English and the platform helps you build it. Non-technical teams inside a company can now create agents without filing a ticket with engineering.

The second is scaling. There is a feature called Agent Runtime that Google says delivers sub-second cold starts, meaning new agent instances spin up almost instantly when demand spikes. There is also a new Memory Bank. This gives agents persistent, long-term memory across sessions. Previously, every time you started a new conversation with an agent, it had no idea what happened before. Memory Bank fixes that. An agent can now remember context from a week ago and act on it today.

The third is connecting. The Model Garden now has over 200 models, including Google’s own Gemini models as well as Anthropic Claude and many others. On top of that, partner agents from Box, Workday, Salesforce, ServiceNow, Dun and Bradstreet, and S&P Global are already integrated. You do not have to build everything from scratch. If you need an agent that handles HR self-service or financial data, there is likely already one ready to plug in.

The fourth is governing. This is the part that enterprise IT teams care most about. The platform has a single control plane where every agent deployed inside a company is visible, auditable, and controllable. Model Armor blocks prompt injection attacks. Zero-trust security handles decentralized setups. IAM manages access and keeps audit logs. Every employee can use and share agents, and IT can see all of it.

The Piece Most People Are Sleeping On: A2A Protocol

The flashiest announcements get the attention. But the most strategically important thing Google announced at Cloud Next 2026 might be the Agent2Agent protocol, or A2A, now at version 1.2.

Here is the problem it solves. You might build an agent on Google Cloud. Your partner company builds an agent on Microsoft Azure. Your vendor uses a Salesforce agent. Today, those three agents cannot easily talk to each other. They live in different systems, speak different formats, and have no way to securely pass tasks back and forth.

A2A is the answer. It is an open standard that lets agents built on completely different platforms communicate, delegate tasks, and share state. It does not matter which model or cloud they are built on.

The numbers back up how serious this is. Over 150 organizations are already running A2A in production, not in pilot programs. Real work, real tasks, real companies. Microsoft, AWS, Salesforce, SAP, and ServiceNow are all live. The protocol is now governed by the Linux Foundation’s Agentic AI Foundation, which means no single company controls it.

And for developers already using LangGraph or CrewAI, both frameworks already have native A2A support built in. You do not need to rewrite anything.

Project Mariner: An Agent That Browses the Web For You

[Update — June 2026: Google has officially discontinued the standalone Project Mariner experiment to integrate its web-browsing capabilities directly into Gemini Agent and AI Mode.]

One of the more visible pieces is Project Mariner, built by Google DeepMind and powered by Gemini 2.0.

Mariner is a web-browsing agent. You give it a goal, and it opens browsers, navigates websites, fills out forms, retrieves information, and completes purchases, all on its own. It scores 83.5% on the WebVoyager benchmark, which is the standard test for web agents, and can handle ten tasks running at the same time on cloud-based virtual machines.

Right now it is available to Google AI Ultra subscribers in the United States. The roadmap includes a visual builder called Mariner Studio in the second quarter of 2026, cross-device sync in the third quarter, and an agent marketplace in the fourth quarter.

What This Means If You Are Already on Vertex AI

The change is structural, not just cosmetic. All the Vertex AI features you know, Model Garden, Custom Training, AutoML, Model Registry, Endpoints, and Pipelines, are still there. They have just been reorganised under a “Models” sub-menu inside the Agent Platform.

The underlying API endpoint, aiplatform.googleapis.com, is not going anywhere. Google has committed to keeping it alive for compatibility. If you are reading documentation in 2027 and the API still says aiplatform, do not be surprised.

But new capabilities will not ship as Vertex AI updates. They will ship exclusively through the Gemini Enterprise Agent Platform. The roadmap has moved. If you want access to what Google builds next, that is where it will live.

One important date: if you are using deprecated SDK modules from the old Vertex AI Python SDK, the migration deadline is June 24, 2026. That is soon.

The Bigger Picture

Every major cloud provider is making the same move at the same time. AWS launched AgentCore. Anthropic shipped Claude for Small Business. Google launched this. The consolidation pattern is unmistakable. Every cloud is going agent-first. Every cloud is differentiating on governance, identity, and security. Every cloud is building partner marketplaces to seed adoption.

Google’s bet is that owning the full stack, from the hardware layer to the productivity layer, gives it an advantage that point solutions cannot match. If your company already runs on Google Workspace and Google Cloud, the integration story is genuinely compelling. The economics also make sense when your data already lives in BigQuery and your team is already on Google tools.

If you are pulling data from outside Google’s ecosystem and paying only for the agent layer, the math gets less favorable.

The Bottom Line

Vertex AI served its purpose. It was a solid platform for the era of single models and single deployments. But that era ended.

The Gemini Enterprise Agent Platform is built for the era of agent networks, agent communication, agent governance, and agent scale. Whether you are a developer, a cloud architect, or a business leader trying to figure out where AI is actually going, this is the direction.

Google has drawn a line. The agentic era is not coming. It is here. And Google just bet its entire cloud platform on it.

This article is based on announcements from Google Cloud Next 2026 in Las Vegas on April 22, 2026.

Vertex AI Is Gone. Here Is What Google Built Instead. was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

TPU Just Got Split in Two — and It Changes Everything About AI Infrastructure

Geeta Kakrani — Wed, 03 Jun 2026 14:09:37 GMT

By Geeta Kakrani | Google Developer Expert in AI

Imagine you run a restaurant.For years, you had one chef doing everything sourcing ingredients, prepping, cooking, plating, cleaning. One person. All jobs. It worked when you had 20 customers a day.

Now you have 2 million customers. Every single day. And they all want their food in under two seconds.

You don’t hire one super-chef. You split the kitchen.

That’s exactly what Google just did with its TPUs.

A Decade of One Chip Doing Everything

Since 2016, Google’s Tensor Processing Units have been the silent engine behind every Google product you use — Search, Translate, Photos, and Gemini. One chip family, designed to both train AI models and run them.

For years, that was fine.

But then AI agents arrived. Systems that don’t just answer one question — they reason, plan, remember, and take action across multiple steps. Millions of them, running simultaneously, in real time.

Suddenly, one chip doing everything wasn’t fine anymore.

At Google Cloud Next 2026, Google made an announcement ten years in the making: the 8th generation TPU is actually two completely different chips.

Meet TPU 8t and TPU 8i.

The Problem They’re Each Solving

Here’s a simple way to think about it.

Training an AI model is like writing a novel. You lock yourself in a room for months. You need enormous focus, massive resources, and you’re not in a hurry to show anyone the draft. When it’s done, it’s done.

Running an AI model is like performing that novel as a live play — every night, for a million audiences simultaneously. You need to be fast, fluid, and you absolutely cannot pause mid-sentence because you’re waiting for a prop to arrive.

Same story. Completely different skills required.

Google finally stopped asking one chip to do both.

TPU 8t — The Training Powerhouse

The “t” is for training. And the numbers here are staggering.

One TPU 8t superpod holds 9,600 chips working together as a single system — with 2 petabytes of shared memory. That’s roughly the storage equivalent of 400 million books, all accessible at once.

The compute? 121 ExaFLOPS. Nearly triple the previous generation.

If the previous chip could fill an Olympic swimming pool in an hour, TPU 8t fills three.

Google also solved a long-standing bottleneck: data transfer. Previously, chips had to route data through the CPU — like every order in a restaurant going through one overwhelmed manager. TPU 8t bypasses that entirely with TPUDirect Storage, letting chips talk directly to data. Transfer speeds effectively doubled.

The result: 2.7x better training performance per dollar over the last generation.

TPU 8i — Built for the Age of AI Agents

This is where it gets really interesting.

The “i” is for inference — but honestly, it should stand for intelligence at scale. Because TPU 8i wasn’t just designed to run AI models. It was designed specifically for the messy, complex, real-time world of AI agents.

Google made three radical changes:

1. Triple the on-chip memory

When an AI is mid-conversation with you, it holds a running record of everything said — called a KV Cache. On older chips, this record kept overflowing into slower memory, forcing the chip to pause and fetch data. Like a waiter who keeps forgetting orders and running back to the kitchen.

TPU 8i has 3x more on-chip SRAM (384 MB). The entire conversation stays on the chip. No pausing. No fetching. Just flow.

2. A brand new engine for thinking fast

AI agents that reason — the kind that think step by step before answering — constantly need all their cores to synchronize with each other. On old chips, this synchronization was a bottleneck.

Google replaced the old system with something called the Collectives Acceleration Engine (CAE). It handles all that synchronization with near-zero latency. The result: 5x faster on-chip communication. For an agent running a complex chain-of-thought, this is the difference between feeling instant and feeling sluggish.

3. A completely new way chips talk to each other

Imagine a city where every road goes through the town square. That was the old network design — a 3D grid where messages between chips could take up to 16 hops to arrive.

Google redesigned the entire road system with something called Boardfly. It’s a hierarchical network — small groups of chips fully connected to each other, then connected to bigger groups through optical switches. The longest any message has to travel? 7 hops. A 56% reduction.

For AI agents using modern architectures like Mixture-of-Experts — where different parts of the model need to collaborate constantly — this is transformational.

The combined result of all three changes: 80% better price-performance for inference over the previous generation.

By the Numbers

TPU 8tTPU 8iBuilt forTrainingInference & AgentsChips per system9,6001,152On-chip SRAM128 MB384 MBMemory (HBM)216 GB288 GBNetwork design3D TorusBoardfly (7 hops max)Key innovationTPUDirect StorageCAE (5x latency cut)Performance gain2.7x over Ironwood80% better price-performance

And Then Google Did Something It Has Never Done Before

For ten years, TPUs were Google’s private weapon. You could use them on Google Cloud — but you couldn’t own one.

That just changed.

Google announced it will begin selling TPUs directly to select customers — AI labs, financial institutions, and high-performance computing organizations — to run inside their own data centers.

The secret weapon is now a product.

Why This Moment Matters

The split of TPU into 8t and 8i isn’t just a hardware story. It’s Google saying out loud what engineers have known quietly for years:

Training AI and running AI are two fundamentally different problems. It’s time to stop pretending one chip can solve both.

As the world moves deeper into the agent era — where AI systems don’t just respond but reason, plan, and act — the infrastructure underneath has to evolve too. Purpose-built beats general-purpose. Every time.

Both TPU 8t and TPU 8i arrive on Google Cloud later in 2026.

The kitchen has been split. The restaurant is ready for scale.

Sources:

TPU Just Got Split in Two — and It Changes Everything About AI Infrastructure was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Flutter at Google I/O 2026

Abhishek Doshi — Tue, 26 May 2026 02:25:58 GMT

Forget everything you knew about cross-platform constraints. The rules just changed!

The dust has finally settled on Google I/O 2026, and if I’m being completely honest, the Flutter and Dart announcements this year hit differently. We’ve moved well past the era of just trying to achieve multi-platform parity. This year’s roadmap is fundamentally about shifting how we architect applications, integrate AI natively, and squeeze every drop of performance out of the rendering engine.

Image copyright to respective owner

Here is a deep dive into the most significant architectural and workflow changes coming to the ecosystem.

1. The GenUI Revolution & Ephemeral Code Delivery

The most futuristic drop this year is the aggressive push toward “agentive UIs.” We aren’t just shipping static screens anymore. With the introduction of the Flutter GenUI SDK and the A2UI protocol, the framework is paving the way for AI models to dynamically generate and adapt rich user experiences on the fly based on user intent.

But the technical enabler behind this is what’s truly mind-bending: Dart is investigating support for interpreted bytecode within the Dart runtime. This unlocks the ability to deliver “ephemeral” code. Imagine loading highly specific, dynamic portions of your UI on-demand without forcing a full app store update. It’s a massive leap forward for building interfaces that need to adapt in real time.

2. Agent Skills and Local MCP Integration

Forget copying and pasting snippets from a browser window to fix a bug. The debut of Agent Skills for Dart and Flutter, paired with the open Model Context Protocol (MCP), brings the AI directly into the local environment.

Your AI assistant can now hook directly into your local Dart SDK analyzer with zero configuration. Because it understands your exact project context, custom types, and widget tree, it can execute deep architectural refactoring, validate type safety, and even run native test suites right on your machine with complete semantic accuracy.

3. The Universal Canvas and Unlocking Pure Design

This is the architectural shift that is arguably the most exciting for custom product development. Flutter is officially pulling the Material and Cupertino design libraries out of the core flutter/flutter repository. Moving forward, these are treated as unopinionated, independently versioned standalone packages on pub.dev.

By doing this, the core engine transforms into an incredibly fast, lightweight Universal Rendering Canvas. We no longer have to fight the default Material scaffolding. If you are building high-end, minimalist interfaces, those Awwwards-winning bento-grid layouts that rely heavily on perfect negative space and sleek, Apple-style aesthetics, you now have a truly blank, unopinionated engine to paint on.

4. The Pure Impeller Era & Lightning-Fast DevTools

On the performance front, the multi-year Impeller saga is finally reaching its conclusion. The legacy Skia backend is officially being stripped out for Android 10 and above. We are now entering the era of pure Impeller Vulkan rendering on modern Android, which means predictable, jank-free animations and faster startup-to-interaction times across the board.

Tooling also got a massive upgrade. The entire Flutter DevTools suite is now compiled to WasmGC by default. That annoying stutter when analyzing performance? It’s gone. They’ve shaved off over 200ms of lag during telemetry parsing, making the debugging and profiling experience feel completely fluid.

5. The Firebase Intelligence Layer

Finally, for backend management and infrastructure, the Firebase integration just got significantly smarter. Firebase expanded its Agent Skills to explicitly cover Flutter, iOS, and Android environments.

When you rely heavily on Firebase to architect and deploy your backends, having an AI that actually understands the nuances of your specific environment is a game-changer. It gives local coding agents the specialized context needed to handle complex Firebase integrations flawlessly, cutting down on token usage and practically eliminating infrastructure hallucinations.

The Takeaway

Flutter is no longer just a cross-platform UI toolkit; it’s evolving into a comprehensive, full-stack, AI-native development ecosystem. The framework is getting leaner at its core, the tooling is getting exponentially smarter, and the ceiling for what we can build just got pushed a whole lot higher.

Hope you enjoyed this article!

Doubts? Feel free to drop a message @AbhishekDoshi26
Checkout abhishekdoshi.dev for more info 💙

Don’t stop, until you are breathing!💙
- Abhishek Doshi

Flutter at Google I/O 2026 💙 was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

[Mar-Apr 2026] AI Community — Activity Highlights and Achievements

Nari Yoon — Mon, 18 May 2026 02:50:49 GMT

We love sharing the accomplishments of the Google AI communities over the month. We appreciate all the hard work and dedication of our community members. Without further ado, here are the key highlights by products!

Antigravity

image source

Confused About Where to Put Your Agent Skills? by Cloud GDE Darren Lester (UK) explains where agentic tools look for skills and highlights the confusion caused by differing locations. He proposes using symlinks to maintain a single source of truth at ~/.agents/skills as a future-proof solution.

How agentic AI resurrected my “Old” side project by Cloud GDE Jean-Philippe BACONNAIS (France) shares how Antigravity helped revive a dormant genealogy tree application by migrating legacy code to Quarkus and improving the UI. Facilitating tasks like documentation updates and UI harmonization allows the author to focus on new features and improvements.

image source

Taking Action on your GCP bill: Automating BigQuery Storage Cleanup by Cloud GDE Marcelo Costa (Brazil) explains how to automate BigQuery storage cleanup using a bash script and Antigravity’s agentic workflow to reduce costs.

Scaling your productivity with spec docs in your IDE — Anti Gravity. by Angular GDE Matthew Christiansen (US) discusses improving developer productivity by using specification documents and Anti Gravity within the IDE. It suggests treating prompts as configuration through modular .md files to reduce mental overhead and create a scalable development process.

Gemini CLI

How I Distilled 27k Lines of AI Chat History into a Local LLM Wiki by AI GDE Guan Wang (Singapore) describes an experiment using the Gemini CLI to distill extensive Gemini chat history into a structured, localized AI. The resulting wiki acts as a personalized assistant summarizing facts, analyzing workflows, and identifying problem-solving habits.

Documentation as Context: A Skill to Automate Your Blueprints for the Agentic Era

Documentation as Context: A Skill to Automate Your Blueprints for the Agentic Era by Cloud GDE Darren Lester (UK) introduces an agent skill called project-documentation skill designed to automate and standardize documentation practices for software projects, especially in the context of AI agents. It emphasizes the importance of up-to-date documentation for both human developers and AI agents, detailing how the skill helps maintain READMEs, architecture documents, UI design guides, and testing documentation.

image source

Using Gemini CLI with a Local LLM by Cloud GDE Masahiko Utsunomiya (Japan) details how to configure Gemini CLI to use a local LLM backend by combining LiteLLM Proxy and Ollama. It provides a practical guide to redirect API requests and address potential issues like missing model aliases.

Gemini

image source

[Hot 👏] Decoding Bronze Age Paperwork: Modern AI vs. Ancient Assyrian Clay Tablets by AI GDE Ertuğrul Demir (Türkiye) explores the use of a modern AI stack to translate ancient Assyrian clay tablets by building an end-to-end pipeline for OCR and translation. highlights the effectiveness of ByT5 for unique character sets and Gemini for scalable data extraction.

Gemini Embedding 2 — Complete Guide by AI GDE Pedro Lourenço (Brazil) is a Colab notebook providing a hands-on guide to Gemini Embedding 2 capabilities, covering word-vector arithmetic and cross-modal searches.

image source

[Hot 👏] Integrate NotebookLM with Gemini CLI, Google Antigravity or Other Agents with MCP by Cloud GDE Darren Lester (UK) explores integrating NotebookLM with AI tools using MCP to enable seamless interaction between models and external tools. It guides how to set up the MCP server and uses natural language commands to manage notebooks within other AI environments.

AI GDE Victor Ashioya (Kenya)

[Workshop] Building a Gemini-level Model from Scratch (repository) by AI GDE Victor Ashioya (Kenya) demystifies the architecture and training of SOTA language models through a hands-on workshop building a transformer-based model. The session shared the evolution of multimodal giants and the core components enabling breakthrough performance via code examples and demonstrations.

ADK

Saying Goodbye to Lengthy Prompts: A Practical Analysis of Google ADK Agent Skill Features by Cloud GDE Yu-wei Liu (Taiwan) introduces the Agent Skill feature in ADK v1.25.0, exploring its architecture, implementation, and benefits like reduced context burden. It discusses advantages such as modularity and team collaboration while addressing potential challenges like metadata quality and dynamic loading latency.

#Gemma 4 Running Google’s Gemma-4b locally with Google ADK and dual A40 GPUs by Cloud GDE Ashmi Banerjee (Germany) guides you through running Gemma-4b locally using ADK and dual A40 GPUs covering environment setup, vLLM serving, and LiteLLM integration.

https://medium.com/media/441b0c30781626a495f91b3383e15039/href

End-to-End AI Agent on GCP: ADK, BigQuery MCP, Agent Engine, and Cloud Run (repository | video) by Cloud GDE Mazlum Tosun (France) explains how to build and deploy an AI agent that queries BigQuery using natural language via ADK and Gemini 2.5 Flash without managing MCP servers.

Gemma

Gemma 4 Tutorial: Build a Local AI Coding Agent with Gradio and Ollama by AI GDE Aashi Dutt (India) explains how to build a local, multimodal AI coding assistant using Gemma 4, Ollama, and Gradio with agentic tool use.

Serve and Inference Gemma 4 on TPU by AI GDE Nitin Tiwari (India) explains how to deploy and run inference with Gemma 4 on TPUs using vLLM. It demonstrates setting up a TPU v6e instance and using a frontend application to achieve significantly lower latency compared to traditional GPUs.

Running Gemma 4 E2B on CPU: Is Local AI Finally Practical? (Kaggle notebook) by AI GDE Gabriel Preda (Romania) tests Gemma 4 E2B on a no-GPU setup to evaluate performance, limitations, and real-world usability. Gabriel also shared From OOM Errors to Working Model: Fine-Tuning Gemma 4 E2B Step-by-Step using Unsloth (Kaggle notebook) for a limited hardware environment.

TPU v6e vs A100 80GB×2 for Gemma 4 31B on vLLM: 21 Benchmarks Show When Each Wins by AI GDE Sho Tanaka (Japan) compares the performance of TPU v6e-4 and NVIDIA A100 across 21 input/output profiles for serving Gemma 4 31B. It demonstrates that TPU excels in short profiles and TPOT latency, while A100 performs better in medium and long profiles.

Deploy Gemma 4 on Cloud Run: Pay Only When You Actually Use It by Cloud GDE Daniel Gwerzman (UK) details how to deploy Gemma 4 on Cloud Run to leverage scale-to-zero capabilities and covers improvements like reasoning and function calling.

The Gemma 4 E2B Fine-Tuning Cookbook by AI GDE Rabimba Karanjai (US) provides a complete recipe for adapting Gemma 4 E2B to a specific domain by covering dataset construction, QLoRA configuration, and production deployment.

Taming the Giant: Fine-Tuning Gemma 4 E2B-IT into an Insurance Expert by AI GDE Guan Wang (Singapore) details the process of fine-tuning Gemma 4 E2B-IT using Tunix on TPU v5litepod-4 to create a high-precision insurance advisor. It covers overcoming memory limitations, data engineering with InsuranceQA-v2, and LoRA application to improve accuracy.

Fine-tuning Gemma

Beyond Classification Labels: Fine-Tuning Gemma 3 1B-IT with Financial Reasoning (part 1, part 2) by AI GDE Luca Massaron (Italy) discusses fine-tuning for financial sentiment analysis using reasoning-augmented data from teacher-student distillation.

Post-Training Gemma 3 for Earth Observation (EO) Understanding: A JAX Stack + TPU Pipeline for Multi-Label Sentinel Satellite Remote Sensing Scene Classification (repository) by AI GDE Henry Ruiz (US) introduces a domain-focused post-training and benchmarking pipeline for adapting Gemma 3 4B IT to Earth Observation tasks using a TPU-native JAX stack.

AI GDE Luca Massaron (Italy)

[Workshop] Fine-tuning your AI by AI GDE Luca Massaron (Italy) was a hands-on workshop at HQ in Milan on fine-tuning Gemma models through a complete QLoRA pipeline on Colab. It covers from baseline evaluation, synthetic data generation, fine-tuning to measuring improvement.

JAX & TPU

image source

[Featured on TPU Developer Hub ✨] dLLM into TPU: An End-to-End Diffusion LM Stack in Pure JAX (repository) by AI GDE Junbum Lee (Korea) shares a standalone JAX backend for dLLM with zero PyTorch or CUDA dependency.

image source

Run Any HuggingFace Model like Gemma3 on TPUs: A Beginner’s Guide to TorchAX (repository) by AI GDE Ahmed Elnaggar (Germany) guides on running HuggingFace models on TPUs using TorchAX to leverage JAX’s high performance without rewriting code. Ahmed also shared a follow-up tutorial, Fine-Tune Any HuggingFace Model like Gemma on TPUs with TorchAX.

Loading and Transform your Dataset using Grain for Model Building in JAX/FLAX by AI GDE Joan Santoso (Indonesia) introduces Grain and demonstrates building a sentiment analysis model in JAX/Flax. It covers creating custom data sources and transformations to build efficient data loaders for training.

Building a Nano MoE Language Model in JAX from Scratch by AI GDE Kartikey Rawat (India) provides a deep dive into the mechanics and importance of MoEs and demonstrates how to build a model using pure JAX/Flax.

Building Neural Networks with Flax NNX by AI GDE Wesley Kambale (Uganda) introduces NNX and covers building a CNN for image classification using production-grade architectures.

Autoscaling LLM Inference on GKE with TPU v5e and vLLM by AI GDE Anubhav Singh (India) shares a practical guide on deploying and autoscaling LLM inference using vLLM on GKE with TPU v5e with details of quota management, capacity planning, and etc.

A guide to speeding up and vectorizing for NumPy users by AI GDE Sho Tanaka (Japan) provides an introductory guide for NumPy users to speed up and vectorize code using JAX. Key features: PRNG differences, jax.jit compilation, jax.vmap vectorization, and automatic differentiation.

Benchmarking TPUs for Search Problems (repository) by AI GDE Vikram Tiwari (US) benchmarks TPUs for search problems using JAX and Antigravity. He creates scripts to identify TPU resources and uses YAML-based experiment setups to perform searches on public datasets.

Pallas

[Featured on TPU Developer Hub ✨] Pallas for people who know JAX but not kernels yet by AI GDE Aritra Roy Gosthipaty (India) introduces and explores Pallas to write custom, high-performance kernels within the JAX ecosystem. It demonstrates how Pallas abstracts hardware complexities, simplifying the process of writing optimized, hardware-level code.

Fused INT8 Weight-Only Quantization in Pallas by AI GDE Rishiraj Acharya (India) shares how he wrote a custom JAX/Pallas kernel for INT8 weight-only quantization to accelerate LLM text generation by streaming compressed weights directly into local SRAM. This approach doubles memory efficiency by decompressing weights on the fly within hardware registers to avoid main memory bottlenecks while maintaining the codebase in Python. Rishiraj also shared two practical guides:

Block-Sparse Attention Kernel via JAX/Pallas: how to build a custom Block-Sparse Attention kernel in JAX/Pallas to fix the massive memory and compute bottlenecks that happen when LLMs process long text.
Ring Attention & Sequence Sharding with JAX: hands-on deep dive into overcoming the KV cache memory bottlenecks of million-token context windows using Ring Attention on TPUs.

Profiling TPU Kernels — XProf, HLO, and the Roofline Model by AI GDE Keshan Sodimana (Sri Lanka) provides a systematic protocol for diagnosing TPU kernel performance bottlenecks using XProf, HLO, and the roofline model. It covers capturing traces and decoding compiler output to optimize custom kernels in Python. Keshan also shared The Ratchet Loop: Optimizing a TPU Kernel abo how to optimize a TPU kernel to the hardware ceiling using the Ratchet Loop for reproducibility and regression prevention.

Fine-tuning Gemma using JAX and TPU

image source

[Featured on TPU Developer Hub ✨] Fine-Tuning Gemma 2B on PubMedQA: Building a Medical Q&A Assistant with LoRA, Keras Kinetic, and Cloud TPU (repository) by AI GDE Kuan Hoong Poo (Malaysia) details fine-tuning Gemma 2B using Keras Kinetic to automate cloud infrastructure and TPU provisioning.

Building a Cardiology Assistant: Synthetic Data and JAX-Based Fine-Tuning by AI GDE Luca Massaron (Italy) demonstrates building a specialized cardiology assistant using a compact model and an efficient JAX/Tunix pipeline. It covers synthetic data generation, fine-tuning, and evaluation to showcase domain adaptation without requiring massive models and datasets.

image source

Write Once, Scale Everywhere by AI GDE Rabimba Karanjai (US) presents an end-to-end pipeline for fine-tuning Gemma 2B using LoRA and serving it via a custom REST API. It utilizes KerasNLP and JAX as a backend to enable flexible execution on both NVIDIA GPUs and Cloud TPUs while demonstrating performance gains through XLA compilation.

Fine-tuning Gemma 3 on Burmese Agriculture QA Dataset (TPU + Tunix + LoRA) (repository) by AI GDE Aye Hninn Khine (Thailand) focuses on fine-tuning Gemma 3 on a Burmese agriculture dataset using TPUs and LoRA.

Implementations & Tools in JAX

jaxgpt: Building LLMs in JAX and TPUs by AI GDE Aakash Nain (India): a showcase of how to build and train scalable LLMs in pure JAX using a multi-host environment
QwenImage Inference on TPU with PyTorch/XLA by AI GDE Sayak Paul (India): PyTorch/XLA based SPMD implementation of the QwenImage image-gen pipeline to run on TPU v6e
google-smi & tpustat by AI GDE Minho Ryu (Korea): a TPU-oriented status CLI in the style of nvidia-smi and the TPU equivalent of the gpustat workflow
MegaText by AI GDE Minho Ryu (Korea): a streamlined pretraining framework for LLMs on TPUs (built on JAX, inspired by MaxText)
flaxchat by AI GDE Taha Bouhsine (US): a minimal, end-to-end LLM training harness for TPU pods (built on JAX/Flax NNX)
gemma3-vllm-tpu-gke-autoscaling by AI GDE Anubhav Singh (India): a deployment guide covering quota management, capacity planning, model compatibility, and HPA-based autoscaling for vLLM on GKE with TPU.
Code Auditor by AI GDE Usha Rengaraju (India): LLM-powered code security auditing tool using NVIDIA NIM + free open-source safety stack

TPU vs. GPU

https://medium.com/media/a268b96ca3fb987997e367545008738a/href

GPU vs TPU: Which one to use for Artificial Intelligence? by AI GDE Carlos Alarcon (Colombia) explains criteria for choosing between GPUs and TPUs for ML projects through real-world tests on Colab. It highlights specific use cases such as inference, self-attention, and massive matrix multiplication to help users optimize hardware selection for efficiency and cost.

ML acceleration guide: TPUs vs GPUs by AI GDE Glen Yu (Canada) compares TPUs and GPUs for ML acceleration by detailing architectures, precision types, and XLA’s importance. He also shares a code example to demonstrate performance differences between the hardware types.

Keras

[kinetic doc] Fine-tuning Gemma 4 on TPU with Kinetic by AI GDE Adonai Vera (Colombia) details how to fine-tune Gemma 4 Instruct 26B on a TPU using Kinetic and LoRA for memory efficiency. It outlines the process for environment setup, weight storage in GCS, and performing inference with the fine-tuned model. He also contributed to the Keras ecosystem by adding — reservation flag to kinetic pool.

[keras.io] Scaling Context-Aware Two-Tower Music Retrieval via JAX Data Parallelism and Keras 3 on TPUs by AI GDE Rishiraj Acharya (India) introduces a production-ready KerasRS implementation for context-aware music retrieval using a dynamic Two-Tower model and the Yambda dataset.

In my Kinetic era — Fine-tuning Gemma 3 to speak Gen Z on a Cloud TPU with one decorator by AI GDE Jigyasa Grover (US) demonstrates supervised fine-tuning for Gen Z style transfer using Gemma 3 1B and TPU v5 Lite. It simplifies deployment to GKE via Kinetic while utilizing Keras Hub/Kaggle for model management.

On-device ML

#Gemma 4 Bringing Multimodal Gemma 4 E2B to the Edge: A Deep Dive into LiteRT-LM and Qualcomm QNN by AI GDE Kartikey Rawat (India) explores the deployment of Gemma 4 E2B on Android devices using LiteRT-LM and Qualcomm QNN for NPU acceleration. It details architectural innovations and engineering changes required for production-ready on-device inference.

Running Gemma 4:E2B on Android: A Minimal Kotlin App

#Gemma 4 Running Gemma 4:E2B on Android: A Minimal Kotlin App (repository) by AI GDE Gabriel Preda (Romania) details a minimal Android chatbot developed in Kotlin using Gemma 4:E2B for fully offline local inference. It outlines key implementation steps such as UI creation, LiteRT-LM integration, and Markdown support.

Cloud

image source

Building a Healthcare Recommender with Keras, Two Towers and Google Cloud by AI GDE Rubens Zimbres (Brazil) shares how he built the Prescription Recommender app, which takes plain language symptoms as input, identifies likely diseases, and suggests medications, diets, and workouts.

Building an AI Agent Mesh with Gemini 3, OpenClaw, and ACPX by Cloud GDE Timothy Olaleke (Portugal) explains how to build an AI agent mesh using Gemini 3.1 Pro, OpenClaw, and ACPX with covering multi-agent orchestration, gateway architecture, and real-world deployment patterns.

ML Research

#Gemma Anthropogenic Regional Adaptation in Multimodal Vision-Language Model by AI GDE Aye Hninn Khine (Thailand) introduces the method to improve the cultural relevance of VLMs in specific regions while maintaining global performance.

#Gemma Investigating Refusal Mechanisms in Gemma 3 Models for Enhanced AI Safety by AI GDE Ruqiya Bin Safi (Saudi Arabia) studies refusal mechanisms in Gemma 3 by isolating the single directional subspace responsible for refusal behavior using vLLM and TPU infrastructure. It explores mechanistic interpretability to modulate model behavior and improve AI safety through controlled experiments and adversarial analysis.

[Mar-Apr 2026] AI Community — Activity Highlights and Achievements was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

Serve and Inference Gemma 4 on TPU

Nitin Tiwari — Mon, 04 May 2026 00:19:51 GMT

Introduction

Earlier in April 2026, Google released Gemma 4, the latest family of open multimodal models, and momentum has been building since then. Gemma 4 comes in four sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense. Native multimodality in the Gemma family first appeared last year with Gemma 3.

What makes Gemma 4 stand out is that it goes beyond standard text-to-text chat, with the ability to handle complex reasoning and agentic workflows. In practice, the real challenge lies in serving these models efficiently. This leads to a natural question: how do LLMs like Gemini deliver sub-second responses? A large part of the answer lies in TPUs.

Tensor Processing Units (TPUs)

Google uses Tensor Processing Units (TPUs) as hardware accelerators for both training and serving models. What puts Google ahead in the AI race is its early investment in designing custom chips, purpose-built for large-scale machine learning workloads.

In practice, these accelerators can deliver significantly higher performance than general-purpose GPUs for specific model architectures and serving scenarios.

In this blog, I go beyond the usual VLM inference on GPUs and show how to create a TPU instance on Google Cloud to serve Gemma 4 using vLLM.

What is vLLM?
vLLM is an open source high performance inference engine for large language models that maximizes hardware utilization and throughput using techniques like PagedAttention and continuous batching.

Now that we know what TPUs and vLLM are, let’s get started.

Prerequisites

Billing account linked to a GCP project
Reserved TPU quota
Access to Gemma family of models on Hugging Face

Since TPUs are expensive and limited in availability, you may need to request quota in advance or use queued resources.

Step 1: Create a TPU instance

Open the Google Cloud Console and activate Cloud Shell. TPUs can be either reserved or allocated using queued resources, meaning they are assigned when capacity becomes available.

In Cloud Shell, run the following commands to set up the required variables:

export PROJECT=YOUR_GCP_PROJECT_NAME
export HF_TOKEN=YOUR_HF_TOKEN
export ZONE=southamerica-east1-c
export TPU_NAME=gemma4-tpu-vllm

Cloud TPUs are available in different versions. You can explore the available generations, including the 8th generation TPUs such as TPU 8t (training) and TPU 8i (inference), announced at Google Cloud Next 2026.

In this tutorial, I will deploy Gemma 4 on TPU 6e (Trillium).

gcloud alpha compute tpus queued-resources create gemma4-tpu-vllm \
  --zone=southamerica-east1-c \
  --accelerator-type=v6e-8 \
  --runtime-version=v2-alpha-tpuv6e \
  --node-id=gemma4-tpu-vllm \
  --provisioning-model=flex-start \
  --max-run-duration=4h \
  --valid-until-duration=4h \
  --labels=purpose=flex-start

The above command creates a queued TPU resource using flex start provisioning, which allows you to specify the duration for which it remains active.

To check the status of your request, run the below

gcloud alpha compute tpus queued-resources describe \
  gemma4-tpu-vllm \
  --zone=southamerica-east1-c

It may take some time to spin up the TPU instance depending upon the availability.

Once provisioned, you can see the status is changed to ACTIVE. Alternatively, you can also check it on Cloud Console.

TPU v6e-8 instance on Cloud Console

Step 2: Configure Firewall

Run the below command to configure firewall rules so that the vLLM Docker image allows incoming traffic.

gcloud compute firewall-rules create allow-vllm-8000 \
  --allow tcp:8000 \
  --target-tags=vllm

Step 3: Connect to TPU instance using SSH

In the Cloud Shell, run the below command to SSH into the TPU instance.

gcloud compute tpus tpu-vm ssh gemma4-tpu-vllm \
  --zone=southamerica-east1-c

Step 4: Download Gemma 4 Docker image

Now, once SSHed into the TPU instance, run the below command to download the Gemma 4 Docker image.

sudo docker run -it --rm --name gemma4-vllm \
  --privileged \
  --network host \
  --shm-size 16g \
  -v /dev/shm:/dev/shm \
  -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:gemma4 \
  python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 8 \
    --max-model-len 8192 \
    --limit-mm-per-prompt '{"image": 1, "audio": 0}' \
    --disable_chunked_mm_input \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --host 0.0.0.0 \
    --port 8000
    --allowed-local-media-path /home/nitin_tiwari

In this example, we will deploy the gemma-4-26B-A4B-it model. It is a 26B parameter instruction-tuned model with 4B active parameters.

This will take a few minutes to set up the Docker image, as it loads the model weights and initializes the vLLM inference engine.

Once done, you should see the below message in the terminal.

Step 5: Start Inference

We are now ready to start inference on the deployed model. I have built a simple frontend that accepts text and image inputs, forwards them to the TPU instance hosting Gemma 4, and returns the generated response.

Clone the repository to your local machine:

git clone https://github.com/NSTiwari/Gemma-4-on-TPU.git
cd Gemma-4-on-TPU

Once completed, open the index.html file and update line 583 by replacing YOUR_EXTERNAL_IP with the external IP address of your TPU instance:

const res = await fetch("http://YOUR_EXTERNAL_IP:8000/v1/chat/completions"

External IP of TPU instance

You can find the external IP address in the list of TPUs in the Google Cloud Console.

Finally, start the frontend server using the following command:

# Start frontend server.
py -m http.server

Open your web browser, and type localhost:8000 in the address bar to start the frontend application.

Gemma 4 26B-4B-it on TPU v6e using vLLM

As you can see, Gemma 4 can perform a wide range of tasks, with response times of around 2–4 seconds when served on TPUs using vLLM.

Note: The first request may take longer due to cold start overhead.

So, that concludes this blog. I wanted this blog to cover the end-to-end aspects of how to create a TPU instance on Google Cloud, serve Gemma 4 using vLLM, build a frontend application, and send requests to the TPU instance for inference.

I believe it pretty much covered everything you need to get started and serve your own custom models on TPUs, where inference that would otherwise take several seconds to minutes on a typical GPU setup can be significantly faster.

I hope you learned how powerful TPUs can be when combined with vLLM to reduce inference latency. Stay tuned for more such tutorials.

Acknowledgment

This project was developed as part of Google’s AI Developer Programs TPU Sprint. I sincerely thank the Google AIDP Team for their generous support in providing GCP credits to help facilitate this project.

References & Resources

Serve and Inference Gemma 4 on TPU was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

⚖️ LexiMini: How I Built an AI Legal Assistant for India — From Scratch, on a TPU

Geeta Kakrani — Mon, 04 May 2026 00:19:38 GMT

Fine-tuning Gemma 3 4B on Indian Law Data · MaxText · Tunix Distillation · HuggingFace

This blog documents the approach behind a TPU-based LLM pipeline I’m currently building.

The focus here is not just on results, but on the full system design — model behavior, distillation strategy, and the practical challenges of working with TPUs in a real setup.

Instead of presenting a polished end-state, I’m breaking down the actual process: how the system is structured, what decisions were made, and how different components evolved over time.

You’ll find everything here — from setup to experimentation — in a way that reflects real-world building, not just ideal scenarios.

Why LexiMini?

India has over 1.4 billion people.

Think about a woman in a village in rural India. Her landlord is threatening to throw her out. She does not know what a tenant agreement means legally. She does not know that she has rights. There is no lawyer in her village. The nearest district court is two hours away. And even if she gets there — she cannot afford to pay someone to explain what is written in that contract.

Or think about a woman who took a small loan from a local moneylender. She does not know what interest rate is legally allowed. She does not know what she can do when the terms change overnight. She does not know who to go to.

These are not rare cases. This is everyday life for millions of women across rural India — where both education and internet access are still limited, where legal literacy is almost zero, and where the gap between knowing your rights and losing everything is just one piece of paper you could not understand.

Normal people face these situations every single day — tenant contracts they cannot read, loans with terms they did not understand, workplace harassment they do not know is illegal, property disputes they have no idea how to fight. These are ordinary problems. But without legal knowledge, ordinary problems become life-altering ones.

When they do go online to search for answers, they find one of two things:

Dense legal text that was written by lawyers, for lawyers.

Or generic AI answers from models that were never trained on Indian law and confidently give wrong information — which in a legal situation, can cause real harm.

I wanted to build something different. A small, focused AI that actually knows Indian law — tenant rights, loan agreements, women’s legal protections, IPC sections, bail procedures, court processes. Not a general model pretending to know. A specialized one, built for the people who need it most.

That is LexiMini.

The Full Pipeline

Before I get into the steps, here is the complete picture of what I built:

GitHub Repo          TPU VM               GCS Bucket
(Indian Law Data) →  (MaxText Training) → (Checkpoints)
                                               ↓
                                     HuggingFace Model
                                     (Gemma 3 4B Fine-tuned)
                                               ↓
                                     Tunix Distillation
                                     (4B Teacher → 1B Student)
                                               ↓
                                     LexiMini Final
                                     (Lightweight, Deployable)

There are four phases:

Phase 1 — Set up the TPU VM and install everything
Phase 2 — Prepare and upload the Indian law dataset
Phase 3 — Fine-tune Gemma 3 4B using MaxText
Phase 4 — Distill the 4B model down to 1B using Tunix

Let me walk through each one.

Tools I Used

Tool What it does Google TPU VM (v6e) Hardware TPU chips for fast training MaxText Google’s open-source JAX-based LLM training framework Tunix Google DeepMind’s post-training + distillation framework, built on JAX HuggingFace Hub Model storage and deployment Google Cloud Storage (GCS) Stores data, weights, and checkpoints during training

One important note: Everything in this guide runs directly on the TPU VM terminal. After you SSH in, you never need to go back to your local machine.

Phase 1: Setting Up the TPU VM

The Setup Flow

Create TPU VM  →  SSH In  →  Install Packages  →  Clone MaxText  →  Install JAX

Step 1 — Create the TPU VM

Let me be honest about something before you run this command.

A TPU v6 costs around $12–$15 per hour on Google Cloud. Fine-tuning a 4B model for 5000 steps takes several hours. Distillation adds even more. The total compute cost of this project, paid out of pocket, would be completely out of reach for most independent researchers or students.

I was able to do this because I received Google Cloud credits through the TPUSprint program as a Google Developer Expert (GDE) in AI/ML.

Google Cloud credits are provided for this project. #TPUSprint

I started by setting up everything from scratch on Google Cloud.

First, I created a new project inside Google Cloud Platform. Once the project was ready, I landed on the main dashboard.

From there, I navigated to Compute Engine → TPUs. This is where all TPU resources are managed.

Since there were no TPUs created yet, the page was empty. I clicked on Create TPU to start the setup.

On the TPU creation screen:

I gave the TPU a name (node-1)
Selected the zone (us-central1-a) — this is important because TPU availability depends on the zone
Chose the TPU type (v5litepod-1 in this case, based on availability)
Selected the TPU software version (v2-alpha-tpuv5-lite)

I kept the rest of the settings as default and clicked Create.

Once the TPU node was created, it appeared in the TPU list and was ready to use.

Now click over ssh to open Terminal:

For reference, the same setup can also be done using CLI:

gcloud compute tpus tpu-vm create leximini-tpu \
  --zone=us-central2-b \
  --accelerator-type=v4-8 \
  --version=tpu-ubuntu2204-base

One key learning here: TPU configuration is highly dependent on availability. You might need to try different zones or TPU types before finding one that works.

All commands shown below were executed directly inside the TPU VM terminal after connecting via SSH.

Step 3 — Install System Packages

sudo apt-get update && sudo apt-get install -y git python3-pip wget curl

Step 4 — Clone MaxText and Install JAX

MaxText is Google’s training framework for large language models. It is built on JAX and designed to run natively on TPU — which is exactly what we need here.

cd ~
git clone https://github.com/google/maxtext.git
cd maxtext
pip install -r requirements.txt
pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

JAX TPU installation takes around 5 minutes. Let it finish completely before moving on.

Phase 2: Preparing the Indian Law Dataset

The Data Flow

Raw Legal Text      →    Python Script     →    GCS Bucket
(.txt / .csv files)      (convert to JSONL)     (ready for MaxText)

This phase took me longer than I expected — not because of the code, but because of the data itself.

I curated Indian legal text across multiple categories: IPC sections, CrPC procedures, constitutional articles, and court judgment excerpts. The curation matters. MaxText will train on exactly what you give it. Garbage in, garbage out — this is especially true for legal text where precision is everything.

Step 5 — Clone Your Dataset Repo on the TPU VM

cd ~
git clone https://github.com/YOUR_USERNAME/YOUR_REPO.git
cd YOUR_REPO
ls

https://github.com/geeta-gwalior/Leximini_V1

Step 6 — Convert Data to JSONL Format

MaxText needs training data in JSONL format — one JSON object per line, each with a text key. Here is the conversion script:

The system was developed using real legal datasets. However, due to confidentiality considerations, I have included only synthetic data in the public repository.

cat > prepare_data.py << 'EOF'
import json

cat > prepare_data.py << 'EOF'
import json

input_file  = 'indian_law.txt'   # change as needed
output_file = 'train_data.jsonl'

count = 0

with open(input_file, 'r', encoding='utf-8') as f_in, \
     open(output_file, 'w', encoding='utf-8') as f_out:

for line in f_in:
        line = line.strip()

# skip empty or very short lines
        if not line or len(line) < 20:
            continue

record = {"text": line}
        f_out.write(json.dumps(record, ensure_ascii=False) + "\n")
        count += 1

print(f"Total records written: {count}")
EOF

python3 prepare_data.py

This is a simplified version of the preprocessing step. In practice, additional structuring and formatting were applied to improve training quality.

Step 7 — Create a GCS Bucket and Upload Data

MaxText does not read from local disk during training. It reads from Google Cloud Storage. So everything goes to GCS first:

# Create the bucket (pick a globally unique name)
gsutil mb -l us-central2 gs://YOUR-BUCKET-NAME

# Upload training data
gsutil cp ~/YOUR_REPO/data/train_data.jsonl gs://YOUR-BUCKET-NAME/data/train_data.jsonl

# Verify
gsutil ls gs://YOUR-BUCKET-NAME/data/

Phase 3: Fine-Tuning Gemma 3 4B with MaxText

The Training Flow

HF Weights         GCS Upload         MaxText Train       Checkpoint
(Gemma 3 4B)   →  (store weights) →  (5000 steps)    →  (saved to GCS)

This is the heart of the project. We take Gemma 3 4B — Google’s open-source model — and teach it everything about Indian law.

I ran this training step twice. The first time I misconfigured the checkpoint path and lost hours of compute. Small mistakes are expensive on TPU. Double-check your GCS paths before you hit enter.

Step 8 — Download Gemma 3 4B Weights from HuggingFace

You need a HuggingFace account and an access token. Get it from: huggingface.co → Settings → Access Tokens → New token.

pip install huggingface_hub

python3 << 'EOF'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='google/gemma3-4b-it',
    local_dir='/home/USER/gemma3-weights',   # replace USER with your username (run: whoami)
    token='hf_YOUR_TOKEN_HERE'
)
print('Download complete!')
EOF

Step 9 — Upload Weights to GCS

# The -m flag enables parallel upload — do not skip it for large files
gsutil -m cp -r ~/gemma3-weights gs://YOUR-BUCKET-NAME/weights/gemma3-4b/

# Verify
gsutil ls gs://YOUR-BUCKET-NAME/weights/gemma3-4b/

Step 10 — Run Fine-Tuning

Everything comes together here. One command, and MaxText takes over:

cd ~/maxtext

cd ~/maxtext

python3 MaxText/train.py MaxText/configs/base.yml \
  base_output_directory=gs://YOUR-BUCKET-NAME/output/ \
  load_parameters_path=gs://YOUR-BUCKET-NAME/weights/gemma3-4b/ \
  dataset_path=gs://YOUR-BUCKET-NAME/data/ \
  model_name=gemma3-4b \
  steps=5000 \
  run_name=leximini-finetune

Key training parameters such as learning rate, batch size, and warmup strategy were tuned conservatively to preserve pretrained knowledge during fine-tuning. Small changes in these values had a noticeable impact on stability and output quality.

Checkpoints are saved automatically every 500 steps to gs://YOUR-BUCKET-NAME/output/leximini-finetune/checkpoints/

Watch the loss value as it prints. When it starts going down — that moment is genuinely satisfying. The model is learning Indian law in real time.

Step 11 — Convert Checkpoint to HuggingFace Format

After training, the checkpoint is in MaxText’s native format. We need to convert it to HuggingFace format so it can be used with the transformers library:

cd ~/maxtext

python3 MaxText/convert_gpt_maxtext_to_hf.py \
  --base_model_path gs://YOUR-BUCKET-NAME/weights/gemma3-4b/ \
  --maxtext_model_path gs://YOUR-BUCKET-NAME/output/leximini-finetune/checkpoints/5000/ \
  --output_path ~/leximini-4b-hf \
  --model_size 4b

ls ~/leximini-4b-hf/

Step 12 — Push the Fine-Tuned 4B Model to HuggingFace

python3 << 'EOF'
from huggingface_hub import HfApi

api = HfApi()

api.create_repo(
    repo_id='YOUR_HF_USERNAME/leximini-4b',
    token='hf_YOUR_TOKEN_HERE',
    private=False
)

api.upload_folder(
    folder_path='/home/USER/leximini-4b-hf',
    repo_id='YOUR_HF_USERNAME/leximini-4b',
    repo_type='model',
    token='hf_YOUR_TOKEN_HERE'
)
print('4B model uploaded!')
EOF

Phase 4: Distilling 4B → 1B with Tunix

The Distillation Flow

LexiMini 4B         Tunix Framework        Gemma 3 1B
(Teacher Model)  →  (Distillation)      →  (Student Model)
                                               ↓
                                        LexiMini 1B
                                     (4x smaller, nearly
                                      as knowledgeable)

Fine-tuning is done. But a 4B parameter model is heavy. It is slow to serve and expensive to run at scale. I wanted something leaner — something that could actually reach more people, including on limited hardware.

That is where Tunix comes in.

What is Knowledge Distillation?

Think of it this way. Imagine a senior advocate who has practiced Indian law for 20 years. Instead of making a junior lawyer read every case file from scratch, the senior sits with them and explains the reasoning — the why behind each judgment, not just the what.

The junior lawyer learns faster, and ends up nearly as capable — in a fraction of the time.

That is distillation. The 4B model (teacher) guides the 1B model (student) to reproduce the same legal understanding — but in a much smaller package.

The key insight is this: the student does not just learn from raw text labels. It learns from the teacher’s output probability distribution — which is far richer. When the teacher says “Section 302 relates to murder”, it doesn’t just output that as a yes/no. It outputs a distribution across thousands of tokens — and that distribution carries nuanced information about related concepts, alternate phrasings, confidence levels. The student absorbs all of that.

What is Tunix?

Tunix is an open-source post-training framework built by Google DeepMind on top of JAX. It handles things like knowledge distillation, RLHF, and model alignment — the work that happens after initial pretraining. It runs natively on TPU and integrates cleanly with JAX checkpoints.

Step 13 — Install Tunix

cd ~
git clone https://github.com/google-deepmind/tunix.git
cd tunix
pip install -e .

Step 14 — Download Gemma 3 1B Base Weights

The student model starts from Gemma 3 1B base weights:

python3 << 'EOF'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='google/gemma-3-1b',
    local_dir='/home/USER/gemma3-1b-weights',
    token='hf_YOUR_TOKEN_HERE'
)
print('1B base weights ready!')
EOF

# Upload to GCS
gsutil -m cp -r ~/gemma3-1b-weights gs://YOUR-BUCKET-NAME/weights/gemma3-1b/

Step 15 — Create the Distillation Config

cat > ~/tunix/configs/leximini_distill.py << 'EOF'
teacher_model_path  = 'gs://YOUR-BUCKET-NAME/output/leximini-finetune/checkpoints/5000/'
student_model_path  = 'gs://YOUR-BUCKET-NAME/weights/gemma3-1b/'
output_path         = 'gs://YOUR-BUCKET-NAME/distilled/leximini-1b/'

cat > ~/tunix/configs/leximini_distill.py << 'EOF'

teacher_model_path  = 'gs://YOUR-BUCKET-NAME/output/leximini-finetune/checkpoints/5000/'
student_model_path  = 'gs://YOUR-BUCKET-NAME/weights/gemma3-1b/'
output_path         = 'gs://YOUR-BUCKET-NAME/distilled/leximini-1b/'

teacher_model_name  = 'gemma3-4b'
student_model_name  = 'gemma3-1b'

train_data_path     = 'gs://YOUR-BUCKET-NAME/data/train_data.jsonl'

# training parameters (simplified for clarity)
steps               = 3000
max_target_length   = 1024
temperature         = 3.0 
alpha               = 0.7

dtype               = 'bfloat16'

EOF

Two parameters that matter most:

temperature = 2.0 — This softens the teacher’s output distribution. At temperature 1.0, the teacher gives sharp, confident predictions. At 2.0, probability spreads across more tokens — giving the student richer, more nuanced signals to learn from. This is one of the core insights of the original knowledge distillation paper by Hinton et al.

alpha = 0.7–70% of the loss comes from matching the teacher. 30% comes from the raw training data. This balance keeps the student grounded in real legal text while absorbing the teacher’s reasoning.

In practice, finding the right balance between these factors required experimentation, as different settings can significantly impact how well the student model retains accuracy.

Step 16 — Run Distillation

cd ~/tunix

python3 tunix/distillation/distill.py \
  --config configs/leximini_distill.py \
  --teacher_model_path gs://YOUR-BUCKET-NAME/output/leximini-finetune/checkpoints/5000/ \
  --student_model_path gs://YOUR-BUCKET-NAME/weights/gemma3-1b/ \
  --output_path gs://YOUR-BUCKET-NAME/distilled/leximini-1b/ \
  --train_data gs://YOUR-BUCKET-NAME/data/train_data.jsonl \
  --steps 3000 \
  --learning_rate 5e-5 \
  --per_device_batch_size 4 \
  --temperature 2.0 \
  --alpha 0.7 \
  --dtype bfloat16 \
  --run_name leximini-distilled

Step 17 — Convert Distilled Checkpoint and Push to HuggingFace

cd ~/maxtext

python3 MaxText/convert_gpt_maxtext_to_hf.py \
  --base_model_path gs://YOUR-BUCKET-NAME/weights/gemma3-1b/ \
  --maxtext_model_path gs://YOUR-BUCKET-NAME/distilled/leximini-1b/checkpoints/3000/ \
  --output_path ~/leximini-1b-hf \
  --model_size 1b

ls ~/leximini-1b-hf/

python3 << 'EOF'
from huggingface_hub import HfApi

api = HfApi()

api.create_repo(
    repo_id='YOUR_HF_USERNAME/leximini-1b-final',
    token='hf_YOUR_TOKEN_HERE',
    private=False
)

api.upload_folder(
    folder_path='/home/USER/leximini-1b-hf',
    repo_id='YOUR_HF_USERNAME/leximini-1b-final',
    repo_type='model',
    token='hf_YOUR_TOKEN_HERE'
)
print('LexiMini is live!')
EOF

What I Am Still Working On

I want to be transparent about where this project currently stands.

The fine-tuned model exists and is running. But it still hallucinates — especially on less common legal sections. It sometimes generates plausible-sounding IPC sections that do not exist. This is a known problem with language models trained on limited domain data, and I am actively working on it.

The areas I am focusing on right now:

More data — The current dataset covers core IPC and constitutional law well. Consumer protection, family law, property law, and state-specific laws need more coverage.
Better data formatting — Moving from raw paragraph dumps to structured question-answer pairs, which gives the model clearer learning signal.
Distillation tuning — The 4B → 1B pipeline is implemented but still being refined. The 1B student needs more iterations to fully absorb the teacher’s legal reasoning.

The GitHub and final model links will be added here when the project reaches a stable, reliable state. I would rather share something that works properly than something that looks impressive but misleads people about their legal rights.

What I Learned While Building This

This wasn’t my first time working with fine-tuning. I’ve previously used approaches like LoRA, QLoRA, and PEFT — they are reliable and produce strong results. But they are also time-intensive, especially when working with larger datasets and multiple iterations.

What stood out to me in this project was the shift to JAX-based training using MaxText. The speed difference was significant. Workloads that would typically take hours could be executed in minutes. TPU performance is obviously a big factor here, but the tooling itself also plays a major role.

At the same time, this came with its own challenges.

MaxText doesn’t have the kind of beginner-friendly ecosystem or tutorials that many PyTorch-based workflows have. There isn’t a single place where everything is explained clearly. I had to rely on documentation, experimentation, and AI-assisted exploration to understand how things actually work under the hood.

Distillation was another area where things were not straightforward. While the pipeline works, maintaining accuracy is still a challenge. I’m actively experimenting to find the right balance between compression and performance, especially for a domain like legal text where precision matters.

Another important shift in thinking for me: I don’t just want to train models — I want to move toward reasoning-focused training. That is still a work in progress, and I’m continuing to explore how to integrate it effectively into this pipeline.

Overall, this project wasn’t about learning basics. It was about navigating gaps — missing documentation, unclear workflows, and making practical decisions in a system that is still evolving.

The Bigger Picture

This project is not about building for developers — it’s about building for real users.

Someone dealing with a tenant dispute or a confusing loan agreement is not going to use a notebook or call an API. They need something simple, accessible, and reliable — something that works on limited internet, speaks plain language, and understands Indian law beyond surface-level fluency.

That’s why distillation matters. A 4B model is powerful, but a well-optimized 1B model can run on affordable hardware, load faster, and reach far more people. The goal was never scale for its own sake — it was usability.

LexiMini is still evolving. Hallucination is a real challenge, and I’m actively working on improving reliability. The dataset is also expanding, especially in areas that impact everyday lives — tenant law, microfinance, women’s rights, domestic violence protections, inheritance, and local governance.

It’s not finished — but it’s real, it’s running, and it’s being built with a clear purpose: making Indian law accessible to those who need it most.

This article will be updated as the project progresses. Model and GitHub links will be added upon stable release.

Built with MaxText · Tunix · JAX · HuggingFace

Acknowledgment: Google Cloud credits are provided for this project. #TPUSprint

⚖️ LexiMini: How I Built an AI Legal Assistant for India — From Scratch, on a TPU was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.

ML acceleration guide: TPUs vs GPUs

Glen Yu — Tue, 28 Apr 2026 00:58:19 GMT

There’s a lot of hype around GPUs and NVIDIA, but how much do you know about TPUs?

Rack of TPUs (left) and GPUs (right)

Article includes code examples you can find near the end

Rise of GPUs

Graphics Processing Units have been around for quite some time and their job is to render 2D and 3D graphics in to millions of pixels, calculating their colour, texture, lighting, in parallel to send to your monitor. For a 60Hz monitor that means producing rendered frames 60 times every second.

Rendering graphics is one thing, but developing code for handling GPUs was a little more difficult. That is, until NVIDIA launched CUDA (Compute Unified Device Architecture) in 2006, which allowed scientific researchers and developers who work in fields that require massive parallel math to take advantage of a GPU’s capabilities. With the rise of machine learning in the early 2010’s, it was discovered that the massive parallel math was exactly what ML engineers needed to train deep neural networks. Since then, the focus of CUDA has been shifting more towards optimizing for machine learning and AI workloads.

Because GPUs were commercially available and relatively inexpensive at the time, the barrier to entry was low. An ML engineer could train models on their NVIDIA graphics card during the day and jump into a game of League of Legends at night on the same hardware.

Honourable mention

AMD’s GPUs with Radeon Open Compute (ROCm) in an open-source software stack designed to compete in the AI ecosystem. Though it’s not as popular as CUDA, this gap is closing with Meta recently signing a deal to expand its existing partnership with AMD.

Tensor Processing Unit

In the early 2010s, Google projected that the growing demands of its AI workloads, particularly the rapid adoption of deep learning across products like Search and Photos, would require doubling its data center computing capacity roughly every year and a half. Rather than scale generic hardware indefinitely, Google sought a more efficient solution purpose-built for neural network computation, and thus the Tensor Processing Unit (TPU) was born. The TPU is a custom application-specific integrated circuit (ASIC) designed by Google specifically to accelerate AI workloads, deployed internally starting in 2015. By specializing the hardware for the dense matrix operations at the heart of neural networks, TPUs achieve dramatically better performance per watt than general-purpose CPUs or GPUs, reducing both energy consumption and cooling demands at data center scale.

Google has a tradition of making tools it uses internally available to the broader world, and TPUs are another example of this. The existence of TPUs was first publicly announced at Google I/O in 2016. In 2018, Cloud TPU v2 became available for external users through Google Cloud, marking the first time developers outside Google could harness the same accelerators powering Google’s own AI systems. TPUs also come in two performance flavours: efficiency and performance to meet different market needs.

NOTE: As of the 8th generation of TPUs announced during Google Next 2026, efficiency and performance TPUs will be renamed inference and training respectively in favour of a more descriptive, workload-based naming convention.

Architecture layout

From an architectural standpoint, GPUs can be thought of as being individual computers with accelerators (picture your home gaming PC). If you want to connect them into a cluster, it would be over network, but no matter how fast the network is, it still has to cross node boundaries, and bandwidth drops as a result.

TPUs are designed from the ground up to be interconnected at a massive scale with a physical layout that involves thousands of TPU chips in a torus topology which gives every chip 6 neighbours (two per axis, one on each side). Recognize that interconnect bandwidth would be the main bottleneck at this scale, Google designed their own proprietary Inter-Chip Interconnect (ICI) network which provides uniform, high-bandwidth, low-latency connections between all the chips in a slice regardless of physical location. With torus topology, there is no concept of crossing a node boundary. When you request TPUs, you do not get the entire TPU cluster or pod. Rather, you get only a small subset or slice. To make this possible, Google developed Optical Circuit Switch (OCS) to be able to rewire physical connections on the fly (entirely in software), allowing the same hardware to serve different workload shapes without any physical reconfiguration.

NOTE: Efficiency TPU versions use a 2D torus topology, while Performance TPUs leverage a 3D torus architecture to give you maximum performance with minimum latency.

Precision and range

A floating-point number consists of three parts:

Sign: Positive or negative (represented by the first bit)
Exponent: Determines the range of the number
Mantissa: Significant digits of a floating-point number, which determines the accuracy

Traditionally, the standard for high-performance computing was FP32. When AI researchers moved to FP16 to save memory, they lost more than just accuracy: they also lost range. FP32 uses 8 bits for the exponent, while FP16 uses only 5. The 3-bit difference in the exponent bits amount to an almost 10³⁴ difference in range (FP32 has a range of 3.4 x 10³⁸, while FP16 only has a range of 6.5 x 10⁴). In deep learning, where gradients can be incredibly tiny, FP16 often suffers from underflow (meaning numbers are being rounded to 0 because it is too small for FP16’s range to represent), requiring a technical workaround called “loss scaling” to keep the math stable.

Google Brain (now part of Google DeepMind) solved this invented Brain Floating Point (bfloat16), which simply shifted 3 bits from the mantissa to exponents:

Table comparing FP32, FP16 and bfloat16

By sacrificing precision for range, bfloat offers the same massive range as FP32, but with the reduced memory and bandwidth of FP16. A huge reason for why this works is that deep learning models are surprisingly noise-tolerant and having more training stability is for more important than having a few extra decimal places of precision. Today, bfloat16 is the de facto standard for training modern LLMs on NVIDIA’s GPUs and Google’s TPUs.

Why XLA matters

Standard Python execution typically takes an eager approach. This means it executes each step as it is being encountered. This is great for debugging because you can insert print statements to inspect variables at any point.

XLA (Accelerated Linear Algebra), on the other hand, is a domain-specific JIT compiler. Instead of executing steps one by one, it analyzes the entire execution graph to optimize and fuse operations before they run. This lazy approach creates an initial warm-up delay, but once the training starts, it is significantly faster than standard methods. The tradeoff is transparency: your step-by-step Python code becomes an optimized “black box”, making traditional debugging strategies more difficult. This is why TPUs are powerhouses for massive enterprise training, while GPUs remain the flexible choice for quick experimentation.

NOTE: Though XLA was built for TPUs, it’s also made its way into the NVIIA GPU ecosystem via tools such as JAX and torch.compile since PyTorch 2.0.

TorchTPU

Google is engineering a TorchTPU stack that will provide native PyTorch support. This would allow you to run models in TPUs as they are with full support for native PyTorch features. TorchTPU is currently in preview, and once it becomes GA, you can be sure I’ll be diving deeper into it!

Code example

I’m including a couple of Jupyter notebooks that I ran via Antigravity + Colab plugin for you to try yourself:

As you will see from the results below, TPU is indeed faster. However, my example isn’t large enough or complex enough to really showcase the true speeds that TPU can bring.

NOTE: I have a Colab Pro account which affords me access to additional GPUs and TPUs. The Colab free tier only includes NVIDIA T4 and TPU v5e-1

Interpreting training results

These are some benchmark trainings (epochs: 50, batch size: 512) in which I used a NVIDIA T4 GPU with (default) FP32 vs Google TPU v5e-1 (single chip TPU) with bfloat16. As expected, TPUs were faster but with lower precision:

T4 GPU (FP32) vs TPU v5e-1 (bfloat16), epochs: 50, batch size: 512

I then trained the same model using the T4 GPU using bfloat16 but noticed a massive performance drop. This was due to the T4 being an older generation GPU that did not support bfloat16 natively and had to emulate which added a lot of overhead. Switching to a newer L4 GPU, I was able to see the (tiny) performance gain along with the reduced precision:

T4 GPU (bfloat16) vs L4 (bfloat16), epochs: 50, batch size: 512

Finally, I thought I’d see how the training would perform on a newer TPU v6e-1 and I was blown away by the improvement:

TPU v6e-1 (bfloat16), epochs: 50, batch size: 512

Conclusion

Comparing GPUs and TPUs isn’t exactly apples-to-apples. They represent fundamentally different philosophies in architecture, memory management, and execution.

In the modern enterprise, it isn’t usually a matter of choosing one over the other, but rather using each where it shines. For rapid iteration and smaller workloads, the flexibility of GPUs is unmatched. However, once a project hits a certain scale, the domain-specific architecture of the TPU becomes the clear winner in efficiency and throughput.

TPUs are as fast as they are because they are a specialized one-trick pony, but to truly harness that power requires a deeper understanding of the stack. The biggest challenge isn’t often the compute itself, but rather: “How do I feed data to the TPUs fast enough and efficiently enough so that it doesn’t become the bottleneck?” and ensuring your input pipeline is fast enough so that the hardware doesn’t sit idle.

In future posts, I’ll dive deeper into these advanced concepts to show how you can optimizing data pipelines to get the most out of your TPUs.

BONUS: Google’s 8th-generation TPUs announced at Google Next

https://medium.com/media/e41887997802e9f7c3e2c573a2b0f3f5/href

ML acceleration guide: TPUs vs GPUs was originally published in Google Developer Experts on Medium, where people are continuing the conversation by highlighting and responding to this story.