You can now call models running on your own hardware through a hosted AI gateway. Privately, with one command:
`ngrok http 8000 --url https://vllm.internal`
Plus all the public models you know and love. ↓
We spent a lot of time listening to feedback, and observing how users were working with our early access release. Super proud of the team and their efforts that went into this. If you want to route traffic to AI in your applications this is the gateway for you.
You can now call models running on your own hardware through a hosted AI gateway. Privately, with one command:
`ngrok http 8000 --url https://vllm.internal`
Plus all the public models you know and love. ↓
ngrok.ai
I wired tiny OLED screens onto two Raspberry Pis so I can watch them find and forget each other.
It's a live view of the kernel's ARP table: how every machine tracks its neighbors on a local network and maps IPs → MAC addresses.
When you route to a local model, it's full private connectivity: no publicly addressable URLs, no public IPs to allowlist, or ports to open, wherever you've got those GPUs humming.
Works with Ollama, vLLM, LM Studio, and the neoclouds, too.
One baseURL change and an access key gets you:
→ routing to every model, even dedicated ones on AWS, Azure, etc
→ automatic failover + retries when a model slows or fails
→ scoped keys per app
→ token, latency, and cost visibility by app, dev, and model
Yes, ngrok gives localhost a public URL.
It's also developer infrastructure that routes and secures traffic to the wild thing you'll plug into prod next.
Reach into customer networks. Put every bit of ingress behind one front door. Run one gateway for devices, APIs, and LLMs.
Our own @samwhoo ported Kubernetes to the browser. Like a real flippin' cluster with lifecycles, DNS, and a simulated network.
It's ~100k lines of TypeScript, almost all written by LLMs, but with every line reviewed by hand to keep it slop-free ↓
Those dots moving around a webpage? Real (enough) pods sending each other requests over a simulated network, all in your browser. Hit pause and the whole cluster freezes, because it runs on a fake clock.
Go poke at the demo for yourself: webernetes-demo.ngrok.app
In 1996, AOL went down for 19 hours.
It pushed "NASA finds evidence of life on Mars" off the front page of the New York Times.
We sponsored an SRE to write a human postmortem of that outage: the stuff five whys leave out.
But in the end, the best material wasn't technical. It was all the very human stories happening around the outage.
So Mac wonders: SREs treat the technology as the protagonist and the people affected as statistics… but maybe we've got it all backwards?