Inspiration

Large Language Models are quickly outgrowing the boundaries of conventional infrastructure - some reaching hundreds of gigabytes and pushing far beyond what a single GPU container can handle. We love Google Cloud Run for its simplicity - autoscaling, revisions, IAM, and seamless serverless deployment - but it wasn’t designed for multi-GPU, multi-hundred-GB, or HPC-style workloads.

So we asked ourselves:

Can we run a model that’s “too big for one container”… without leaving the Cloud Run experience?

That question became Commissure.

In neuroscience, a commissure connects the hemispheres of the brain - enabling distributed intelligence.
Our project borrows that metaphor: Commissure connects multiple Cloud Run GPU services into one distributed LLM runtime, each stage acting as a “hemisphere” of a single AI mind.


What It Does

Commissure transforms Cloud Run into an HPC-grade LLM runtime that scales models far beyond a single GPU.

It splits a large model - e.g. Gemma-3-27B-Instruct - into three cooperating GPU microservices:

  • Stage A: Handles HTTP requests, tokenization, and runs the front layers.
  • Stage B: Executes the middle transformer layers.
  • Stage C: Produces the final activations, normalization, and logits.

These services communicate through gRPC, streaming bf16 boundary tensors at high speed.
From the outside, users see a single OpenAI-compatible /v1/chat/completions endpoint, but under the hood, Cloud Run GPUs collaborate as one distributed brain.

Architecture

Each stage loads only a subset of the model’s layers:

$$ \text{Stage A: layers } 0..K_1,\quad \text{Stage B: } K_1..K_2,\quad \text{Stage C: } K_2..L $$

Intermediate activations are transmitted as boundary tensors xB,S,d over gRPC in bf16 precision - small enough to stream efficiently, precise enough to preserve accuracy.

This architecture allows a 27-billion-parameter model to run within Cloud Run GPU limits, and scales naturally to multiple stages:

$$ L = \sum_i L_i $$

where each ( L_i ) corresponds to a Cloud Run service holding its own segment of the model. Together, they form a seamless distributed inference graph, running entirely on managed serverless infrastructure.


System Diagram (Tensor Streaming)

┌─────────────────────────────────────────────────────────────────────────────┐
│                           User Request (HTTPS)                              │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE A (Cloud Run Service – L4 GPU)                                       │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ • FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible)      │  │
│  │ • Tokenizer (chat templates, stop IDs, text → token IDs)              │  │
│  │ • Embeddings + decoder layers 0..K₁−1 (front of the model)            │  │
│  │ • Maintains its own KV cache for layers 0..K₁−1                       │  │
│  │ • Computes boundary activations: x₀ ∈ ℝ^{B×S×d_model}                 │  │
│  │ • Outputs x₀ as bf16 boundary tensor over gRPC                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │  gRPC stream (bf16-serialized x₀)
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE B (Cloud Run Service – L4 GPU)                                       │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ • gRPC bidirectional streaming server (Boundary.Decode)               │  │
│  │ • Middle decoder layers K₁..K₂−1                                      │  │
│  │ • Dynamic KV cache management for its layer range                     │  │
│  │ • Receives boundary tensor x₀                                         │  │
│  │ • Computes x₁ = f_B(x₀) through layers K₁..K₂−1                       │  │
│  │ • Shape preserved: x₀, x₁ ∈ ℝ^{B×S×d_model}                           │  │
│  │ • Sends x₁ as bf16 boundary tensor over gRPC                          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │  gRPC stream (bf16-serialized x₁)
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE C (Cloud Run Service – L4 GPU)                                       │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │ • gRPC bidirectional streaming server (Boundary.Decode)               │  │
│  │ • Final decoder layers K₂..L−1                                        │  │
│  │ • Final LayerNorm + LM Head                                           │  │
│  │ • Dynamic KV cache for its own layers                                 │  │
│  │ • Receives boundary tensor x₁                                         │  │
│  │ • Computes logits via x₂ = f_C(x₁)                                    │  │
│  │ • Token sampling (temperature, top-p)                                 │  │
│  │ • Returns next_token_id back over the same gRPC stream                │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Each token step can be expressed as the composition of stage functions:

$$ x_{t+1} = f_C!\big(f_B(f_A(\text{input}))\big), $$

where activations are streamed between Cloud Run GPU services as boundary tensors

$$ x_{B,S,d} \in \mathbb{R}^{B \times S \times d}, $$

preserving the hidden state across stages in bf16 precision.


How We Built It

We built Commissure in Python 3.11, using:

  • PyTorch 2.8 + CUDA 12.8 - model slicing and tensor streaming
  • FastAPI - for the HTTP/SSE public interface (Stage A)
  • gRPC + Protocol Buffers - for bf16 boundary streaming between services
  • Google Cloud Run (L4 GPUs) - for all compute stages
  • Cloud Build, Artifact Registry, Secret Manager, and GCS - for fully automated deployment
  • A custom CLI - commissure up - that builds the Docker image, uploads weights, deploys all three services, and automatically wires them together (C → B → A)

We also implemented:

  • Lazy model loading to reduce cold-start latency
  • Asynchronous warm-up and caching
  • DynamicCache handling for cross-stage attention reuse
  • bf16 wire serialization for compact activation transfer

Challenges We Ran Into

  • Memory ceilings: fitting partial model shards while preserving bf16 precision.
  • Inter-stage streaming: optimizing latency while maintaining numerical stability.
  • Checkpoint slicing: dynamically mapping Hugging Face prefixes across layer splits.
  • Cold starts: solved via lazy loading, async warm-ups, and cache persistence.
  • Serverless orchestration: ensuring ephemeral GPU services behave like one continuous model.

Accomplishments That We’re Proud Of

  • Successfully ran Gemma-3-27B across three cooperating Cloud Run GPU services, end-to-end.
  • Achieved real-time, streaming chat completions fully compatible with OpenAI’s API format.
  • Created a reproducible one-command deployment pipeline (commissure up) using Google Cloud tools.
  • Delivered a proof that Cloud Run can perform distributed inference once thought impossible for serverless.
  • Most importantly - made large-model inference accessible to teams that love Cloud Run’s simplicity.

What We Learned

  • Cloud Run is more capable than expected - it can host distributed compute graphs, not just APIs.
  • gRPC + bf16 is the key to bridging precision and performance in cross-container inference.
  • Layer partitioning + DynamicCache can scale models linearly with the number of services.
  • Serverless can be fast, scalable, and HPC-grade when you rethink the architecture, not the platform.

What’s Next for Commissure

  • Broader model support: Extend the same split-runtime approach to LLaMA, DeepSeek, MiniMax, GPT-OSS, and future open-weight models.
  • Per-stage quantization: Enable seamless execution of hundreds-of-gigabyte checkpoints through mixed 4-bit / 8-bit compression - with no architecture changes required.
  • Adaptive scaling: Automatically tune layer splits ((K₁, K₂, Lᵢ)) based on GPU size, region, or workload.
  • Cross-region mesh inference: Link Cloud Run GPUs across continents for global, low-latency collaboration.
  • Distributed training: Reuse the same staged, gRPC-connected topology to experiment with training and fine-tuning across multiple Cloud Run GPUs.

Our next goal is simple - to keep pushing serverless beyond its limits, until Cloud Run thinks like a supercomputer.


Summary

Commissure bridges the gap between serverless and high-performance AI.
It proves that with the right architecture, even massive language models can run fully managed, secure, and elastic — no clusters, no manual scaling, just Cloud Run.

Built With

Share this project:

Updates