The longer the context, the more memory your LLM needs. We introduce research techniques to compress that memory 200x on the fly without changing the base model.
1/ You can shrink a language model's KV cache by 200×, in a single forward pass, and it still answers correctly.
At 256k context that's 36 GiB of cache down to ~360 MiB, with no change to the base model.
Here's how we did it 👇
Baseten is live on the Respan Gateway.
Congratulations to the @RespanAI team on their Gateway launch as they bring observability, evals, and routing to agents.
Try Baseten Model APIs now on Respan.
Model selection isn't just a fancy term for "looking at benchmarks". If you're just auto-updating and going off twitter vibes, you're not really adding any value to your business or your customers. To do this well, it means you need to deeply understand your use cases, how much
Working in the Training team at Baseten, I often see companies agonize over which model to use. So many people worry about how to keep up with benchmarks and new releases
But with post-training and specialization, and as we see a rising tide in the intelligence of many
Working in the Training team at Baseten, I often see companies agonize over which model to use. So many people worry about how to keep up with benchmarks and new releases
But with post-training and specialization, and as we see a rising tide in the intelligence of many
Are you tired of waiting 17 minutes for an AI agent to finish a code change?
As an agent’s context grows, standard transformer attention can turn long runs into a bottleneck.
@NVIDIAAI Nemotron 3 Ultra addresses this with a hybrid architecture that replaces several
Introducing NVIDIA Nemotron 3 Ultra.
A frontier smart open model built for long-running agents that need to plan, reason, use tools and keep working across complex coding, research and enterprise workflows.
Up to 5x faster inference and up to 30% lower cost for agentic tasks.
Today we're announcing MAI-Thinking-1 with Microsoft and it will be available on Baseten soon.
Microsoft built something genuinely different here: a commercial-grade thinking model trained on clean data with no distillation from third-party models and designed to be fine-tuned
I’m thrilled to welcome Gabe Stern to Baseten to lead Legal. Gabe is the whole package: deeply experienced, sharp, highly trusted, and commercially minded. We first got to work together at Slack, where he was an exceptional partner and played a critical role through Slack's