Inspiration
Large Language Models are quickly outgrowing the boundaries of conventional infrastructure - some reaching hundreds of gigabytes and pushing far beyond what a single GPU container can handle. We love Google Cloud Run for its simplicity - autoscaling, revisions, IAM, and seamless serverless deployment - but it wasn’t designed for multi-GPU, multi-hundred-GB, or HPC-style workloads.
So we asked ourselves:
Can we run a model that’s “too big for one container”… without leaving the Cloud Run experience?
That question became Commissure.
In neuroscience, a commissure connects the hemispheres of the brain - enabling distributed intelligence.
Our project borrows that metaphor: Commissure connects multiple Cloud Run GPU services into one distributed LLM runtime, each stage acting as a “hemisphere” of a single AI mind.
What It Does
Commissure transforms Cloud Run into an HPC-grade LLM runtime that scales models far beyond a single GPU.
It splits a large model - e.g. Gemma-3-27B-Instruct - into three cooperating GPU microservices:
- Stage A: Handles HTTP requests, tokenization, and runs the front layers.
- Stage B: Executes the middle transformer layers.
- Stage C: Produces the final activations, normalization, and logits.
These services communicate through gRPC, streaming bf16 boundary tensors at high speed.
From the outside, users see a single OpenAI-compatible /v1/chat/completions endpoint, but under the hood, Cloud Run GPUs collaborate as one distributed brain.
Architecture
Each stage loads only a subset of the model’s layers:
$$ \text{Stage A: layers } 0..K_1,\quad \text{Stage B: } K_1..K_2,\quad \text{Stage C: } K_2..L $$
Intermediate activations are transmitted as boundary tensors xB,S,d over gRPC in bf16 precision - small enough to stream efficiently, precise enough to preserve accuracy.
This architecture allows a 27-billion-parameter model to run within Cloud Run GPU limits, and scales naturally to multiple stages:
$$ L = \sum_i L_i $$
where each ( L_i ) corresponds to a Cloud Run service holding its own segment of the model. Together, they form a seamless distributed inference graph, running entirely on managed serverless infrastructure.
System Diagram (Tensor Streaming)
┌─────────────────────────────────────────────────────────────────────────────┐
│ User Request (HTTPS) │
└────────────────────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE A (Cloud Run Service – L4 GPU) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ • FastAPI HTTP / SSE endpoint (public-facing, OpenAI-compatible) │ │
│ │ • Tokenizer (chat templates, stop IDs, text → token IDs) │ │
│ │ • Embeddings + decoder layers 0..K₁−1 (front of the model) │ │
│ │ • Maintains its own KV cache for layers 0..K₁−1 │ │
│ │ • Computes boundary activations: x₀ ∈ ℝ^{B×S×d_model} │ │
│ │ • Outputs x₀ as bf16 boundary tensor over gRPC │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│ gRPC stream (bf16-serialized x₀)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE B (Cloud Run Service – L4 GPU) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ • gRPC bidirectional streaming server (Boundary.Decode) │ │
│ │ • Middle decoder layers K₁..K₂−1 │ │
│ │ • Dynamic KV cache management for its layer range │ │
│ │ • Receives boundary tensor x₀ │ │
│ │ • Computes x₁ = f_B(x₀) through layers K₁..K₂−1 │ │
│ │ • Shape preserved: x₀, x₁ ∈ ℝ^{B×S×d_model} │ │
│ │ • Sends x₁ as bf16 boundary tensor over gRPC │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────┬────────────────────────────────────────────┘
│ gRPC stream (bf16-serialized x₁)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE C (Cloud Run Service – L4 GPU) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ • gRPC bidirectional streaming server (Boundary.Decode) │ │
│ │ • Final decoder layers K₂..L−1 │ │
│ │ • Final LayerNorm + LM Head │ │
│ │ • Dynamic KV cache for its own layers │ │
│ │ • Receives boundary tensor x₁ │ │
│ │ • Computes logits via x₂ = f_C(x₁) │ │
│ │ • Token sampling (temperature, top-p) │ │
│ │ • Returns next_token_id back over the same gRPC stream │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Each token step can be expressed as the composition of stage functions:
$$ x_{t+1} = f_C!\big(f_B(f_A(\text{input}))\big), $$
where activations are streamed between Cloud Run GPU services as boundary tensors
$$ x_{B,S,d} \in \mathbb{R}^{B \times S \times d}, $$
preserving the hidden state across stages in bf16 precision.
How We Built It
We built Commissure in Python 3.11, using:
- PyTorch 2.8 + CUDA 12.8 - model slicing and tensor streaming
- FastAPI - for the HTTP/SSE public interface (Stage A)
- gRPC + Protocol Buffers - for bf16 boundary streaming between services
- Google Cloud Run (L4 GPUs) - for all compute stages
- Cloud Build, Artifact Registry, Secret Manager, and GCS - for fully automated deployment
- A custom CLI -
commissure up- that builds the Docker image, uploads weights, deploys all three services, and automatically wires them together (C → B → A)
We also implemented:
- Lazy model loading to reduce cold-start latency
- Asynchronous warm-up and caching
- DynamicCache handling for cross-stage attention reuse
- bf16 wire serialization for compact activation transfer
Challenges We Ran Into
- Memory ceilings: fitting partial model shards while preserving bf16 precision.
- Inter-stage streaming: optimizing latency while maintaining numerical stability.
- Checkpoint slicing: dynamically mapping Hugging Face prefixes across layer splits.
- Cold starts: solved via lazy loading, async warm-ups, and cache persistence.
- Serverless orchestration: ensuring ephemeral GPU services behave like one continuous model.
Accomplishments That We’re Proud Of
- Successfully ran Gemma-3-27B across three cooperating Cloud Run GPU services, end-to-end.
- Achieved real-time, streaming chat completions fully compatible with OpenAI’s API format.
- Created a reproducible one-command deployment pipeline (
commissure up) using Google Cloud tools. - Delivered a proof that Cloud Run can perform distributed inference once thought impossible for serverless.
- Most importantly - made large-model inference accessible to teams that love Cloud Run’s simplicity.
What We Learned
- Cloud Run is more capable than expected - it can host distributed compute graphs, not just APIs.
- gRPC + bf16 is the key to bridging precision and performance in cross-container inference.
- Layer partitioning + DynamicCache can scale models linearly with the number of services.
- Serverless can be fast, scalable, and HPC-grade when you rethink the architecture, not the platform.
What’s Next for Commissure
- Broader model support: Extend the same split-runtime approach to LLaMA, DeepSeek, MiniMax, GPT-OSS, and future open-weight models.
- Per-stage quantization: Enable seamless execution of hundreds-of-gigabyte checkpoints through mixed 4-bit / 8-bit compression - with no architecture changes required.
- Adaptive scaling: Automatically tune layer splits ((K₁, K₂, Lᵢ)) based on GPU size, region, or workload.
- Cross-region mesh inference: Link Cloud Run GPUs across continents for global, low-latency collaboration.
- Distributed training: Reuse the same staged, gRPC-connected topology to experiment with training and fine-tuning across multiple Cloud Run GPUs.
Our next goal is simple - to keep pushing serverless beyond its limits, until Cloud Run thinks like a supercomputer.
Summary
Commissure bridges the gap between serverless and high-performance AI.
It proves that with the right architecture, even massive language models can run fully managed, secure, and elastic — no clusters, no manual scaling, just Cloud Run.

Log in or sign up for Devpost to join the conversation.