The Agent Open panel lineup is live.
On Jun 30th in San Francisco Braintrust and friends host a conversation with leaders building the infrastructure for shipping quality agents.
Then, it's time for pickleball. First come, first serve. Literally.
There have been six generations of AI agents:
- A simple prompt that asks a model a question.
- A fixed pipeline that retrieves context and puts it into the prompt to get a result.
- A react loop, in which the model decides what tools to call and in what order.
- A
When you're building AI systems, you need to know what prompt your LLM received, what it returned, and how many tokens it used. And you need to log tool calls, retrieval, reasoning, and handoffs between subagents.
OpenTelemetry is an OSS framework for capturing that data using
How do you make AI traces readable for non-engineers?
Custom trace views in Braintrust transform a raw trace into a format that a subject matter expert can understand.
For example, you can turn a customer support trace into a ticket card with the entire conversation, the
Braintrust now integrates with Azure AI Foundry, giving you access to OpenAI models and the full Azure model catalog, including Grok, Claude, and DeepSeek.
Configure it through API key, Entra ID, or workload identity federation for secure cloud provider integrations.
The success or failure of an agent is measured differently depending on the needs of the business. Braintrust lets you define custom facets and track every production trace against what really matters.
Braintrust is presenting a workshop on how to:
- Define the dimensions that
Your AI tools should understand your log schemas, help debug failed evals, and answer questions about your model performance.
Braintrust’s MCP server makes this possible with Cursor, Claude, VS Code, and more.
AI governance is entering a new phase. The EU AI Act enforces legal accountability for any company with EU customers, and ISO 42001 compliance is becoming a requirement in enterprise procurement.
The teams best prepared for these governance frameworks are the teams that already
Every AI team is building with different frameworks, different model providers, and different languages.
Braintrust is designed for this reality. Instrument once via SDKs or OpenTelemetry, and get consistent traces, evals, and debugging across your entire stack.
"What you do is you prioritize the top few benchmarks and then you probably bullshit the rest."
We talk a lot about how AI is making coding easier for non-technical folks, but don't hear much about how the most elite engineers are delegating their most technically complex work.
Because LLMs are non-deterministic, a sudden change in your eval score could just be due to variance.
A binomial test tells you the probability of getting a certain result by chance, so you can be confident that an eval score isn't random.
How does your team rank when it comes to shipping quality AI products?
Braintrust's AI quality assessment maps your current practices to the next useful step, whether you're still manually checking outputs or already running online scores in production.