feat(adk): implement ChatModel retry for ChatModelAgent #635

shentongmartin · 2025-12-22T03:22:43Z

PR #635: feat(adk): implement ChatModel retry for ChatModelAgent

Summary

This PR adds automatic retry functionality for ChatModel calls in ChatModelAgent, enabling graceful handling of transient LLM API failures (network timeouts, rate limits, temporary unavailability).

Quick Start

agent, err := adk.NewChatModelAgent(ctx, &adk.ChatModelAgentConfig{
    Name:  "MyAgent",
    Model: myModel,
    ModelRetryConfig: &adk.ModelRetryConfig{
        MaxRetries: 3,  // Up to 3 retries (4 total calls)
        IsRetryAble: func(ctx context.Context, err error) bool {
            return isTransientError(err)  // Your retry logic
        },
        // BackoffFunc is optional - defaults to exponential backoff with jitter
    },
})

Core Design: Two Types of Errors

LLM calls can fail in two ways, and this feature handles both:

Error Type	When It Happens	How Retry Works
Direct Error	`Generate()` or `Stream()` returns error immediately	Retry internally, user sees only final result
Stream Error	Error appears mid-stream during consumption	Emit error to user, then retry and emit new stream

Why Stream Errors Are Different

For direct errors, retry is invisible to the user. But for stream errors, the user has already started receiving chunks. We can't "take back" what was sent, so instead:

Wrap the error as WillRetryError to signal "retry is pending"
Emit it in the stream so the user knows what happened
Start a new stream for the retry attempt

If the error won't be retried (non-retryable or max retries exhausted), the original error is returned directly without wrapping.

Handling AgentEvents with Stream Retry

When retry is configured and streaming is enabled, users should handle events like this:

iterator := agent.Run(ctx, input)
for {
    event, ok := iterator.Next()
    if !ok {
        break
    }
    
    // Check for final error (non-retryable or exhausted retries)
    if event.Err != nil {
        handleFinalError(event.Err)
        break
    }
    
    // Process streaming output
    if event.Output != nil && event.Output.MessageOutput.IsStreaming {
        stream := event.Output.MessageOutput.MessageStream
        for {
            msg, err := stream.Recv()
            if err == io.EOF {
                break  // Stream completed successfully
            }
            if err != nil {
                // Check if this error will be retried (more streams coming)
                var willRetry *adk.WillRetryError
                if errors.As(err, &willRetry) {
                    log.Printf("Attempt %d failed, retrying...", willRetry.RetryAttempt)
                    break  // Wait for next event with new stream
                }
                // Original error - won't retry, workflow will stop
                log.Printf("Final error (no retry): %v", err)
                break
            }
            // Display chunk to user
            displayChunk(msg)
        }
    }
}

Key insight: With retry enabled, you may receive multiple streaming events for a single LLM call - one for each attempt. Only the last successful stream should be used for the final response.

Multi-Agent Workflow Behavior

In sequential workflows (Agent A → Agent B), stream errors are handled intelligently:

┌─────────────────────────────────────────────────────────────────┐
│  Agent A (with retry)              Agent B                      │
│  ┌─────────────────────┐          ┌─────────────────────┐      │
│  │ Attempt 1: ❌ Error  │          │                     │      │
│  │ Attempt 2: ❌ Error  │    →     │  Only receives      │      │
│  │ Attempt 3: ✅ Success│          │  successful message │      │
│  └─────────────────────┘          └─────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘

End-user sees: All attempts (errors wrapped as WillRetryError, then success)
Agent B receives: Only the successful message (failed attempts excluded from context)

This prevents partial/errored responses from polluting the context window of downstream agents.

Error Propagation Rules

Scenario	End-User Sees	Agent B Called?
`WillRetryError` → Success	Error streams + success stream	✅ Yes (success only)
Max retries exhausted	Error streams + original error	❌ No
Non-retryable error	Original error	❌ No
No retry config + error	Original error	❌ No

Configuration Reference

type ModelRetryConfig struct {
    // MaxRetries: 0 = no retries, 3 = up to 3 retries (4 total calls)
    MaxRetries int
    
    // IsRetryAble: Return true for transient errors. If nil, all errors retry.
    IsRetryAble func(ctx context.Context, err error) bool
    
    // BackoffFunc: Custom delay strategy. If nil, uses exponential backoff:
    // 100ms → 200ms → 400ms → ... → 10s max, with 0-50% random jitter
    BackoffFunc func(ctx context.Context, attempt int) time.Duration
}

Error Handling

// Check if all retries exhausted
if errors.Is(err, adk.ErrExceedMaxRetries) {
    var retryErr *adk.RetryExhaustedError
    if errors.As(err, &retryErr) {
        log.Printf("Failed after %d retries. Last error: %v", 
            retryErr.TotalRetries, retryErr.LastErr)
    }
}

Implementation Notes

Transparent wrapper: retryChatModel wraps the model without changing graph topology
Callback per attempt: Each retry triggers OnChatModelEnd for observability
Gob-serializable errors: WillRetryError is registered for checkpoint serialization
Context-aware: Both IsRetryAble and BackoffFunc receive context for cancellation support

Files Changed

File	Change
`adk/retry_chatmodel.go`	New: retry wrapper, error types, backoff logic
`adk/chatmodel.go`	Config integration, callback handlers for error wrapping
`adk/react.go`	Pass retry config to react graph
`adk/flow.go`	Skip `WillRetryError` events in `genAgentInput`
`adk/runctx.go`	`StreamErr` field: `string` → `error`
`adk/utils.go`	Preserve error type in stream handling
`schema/stream.go`	`WithErrWrapper` option for stream conversion
`adk/*_test.go`	Comprehensive test coverage

codecov · 2025-12-22T03:26:34Z

Codecov Report

❌ Patch coverage is 88.75000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.94%. Comparing base (2c167dc) to head (4c272bf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
adk/retry_chatmodel.go	85.71%	10 Missing and 11 partials ⚠️
adk/chatmodel.go	93.22%	2 Missing and 2 partials ⚠️
adk/utils.go	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #635      +/-   ##
==========================================
+ Coverage   79.44%   79.94%   +0.50%     
==========================================
  Files         123      124       +1     
  Lines       11602    11812     +210     
==========================================
+ Hits         9217     9443     +226     
+ Misses       1658     1635      -23     
- Partials      727      734       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

adk/retry_chatmodel.go

adk/flow.go

adk/runctx.go

Change-Id: Iac018949691d91456fdcf67cececf0c0ca55bdcb

…ut tools Change-Id: I027fd9ad577caa4006c170aa70ad9e4e1fac7879

Change-Id: Ibbe7cd78e49c7e0f833dc4dc7b4e121df92bb9db

…onential backoff - Add RetryAbleError and NonRetryAbleError types for distinguishing error types in streams - Implement exponential backoff with jitter (base 100ms, max 10s, 0-50% jitter) - Change StreamErr field from string to error type for proper error propagation - Add error wrapper support in StreamReaderWithConvert for stream error classification - Update genAgentInput to skip RetryAbleError events but propagate other errors - Pass ModelRetryConfig to callback handlers for stream error wrapping - Add TotalRetries field to RetryExhaustedError - Update IsRetryAble and BackoffFunc signatures to include context Test coverage: - Add sequential workflow tests for retry scenarios - Add gob encoding tests for RetryAbleError and NonRetryAbleError - Add test for unregistered error type gob encoding failure - Update existing tests for StreamErr type change Change-Id: I9d822665e23fc12cfa1997de02ab3bf85849c0e4

- Rename RetryAbleError to WillRetryError (indicates retry will happen) - Rename NonRetryAbleError to WontRetryError (indicates no retry) - Fix genErrWrapper to check max retries before emitting WillRetryError - Consolidate duplicate stream error tests into table-driven test - Remove redundant TestChatModelAgentRetry_WithTools_StreamError_SequentialFlow - Add coverage tests for error string methods and default IsRetryAble The naming change better reflects the user's perspective: when receiving WillRetryError in a stream, the retry is about to happen (not already done). WontRetryError indicates the error is final - either non-retryable or max retries exhausted. Change-Id: Ia613c83b1da69c600130a69f2039ff9425552ecd

adk/retry_chatmodel.go

schema/stream.go

adk/retry_chatmodel.go

- Remove WontRetryError type - no need to wrap non-retryable errors - Update genErrWrapper to return original error when not retrying - Simpler API: only WillRetryError needs special handling - Users can directly use errors.Is(err, specificError) for non-retry cases This makes the error handling more intuitive: - WillRetryError: retry is pending, expect another stream - Original error: this is final, no more retries Change-Id: I13a8db602823a9ed3ab2164b63b585db7e55399b

shentongmartin mentioned this pull request Dec 22, 2025

feat(compose): persist rerun node inputs in checkpoint #634

Open

shentongmartin force-pushed the feat/chatmodelagent_retry branch 3 times, most recently from 2f1191a to aa1c7d0 Compare December 23, 2025 04:22

hi-pender reviewed Dec 24, 2025

View reviewed changes

adk/retry_chatmodel.go Outdated Show resolved Hide resolved

hi-pender reviewed Dec 24, 2025

View reviewed changes

adk/flow.go Outdated Show resolved Hide resolved

hi-pender reviewed Dec 24, 2025

View reviewed changes

adk/runctx.go Show resolved Hide resolved

shentongmartin added 5 commits December 25, 2025 14:17

feat(adk): implement ChatModel retry for ChatModelAgent

0b84471

Change-Id: Iac018949691d91456fdcf67cececf0c0ca55bdcb

feat(adk): emit partial stream as AgentEvent for ChatModelAgent witho…

335ca1b

…ut tools Change-Id: I027fd9ad577caa4006c170aa70ad9e4e1fac7879

refactor(adk): let retryChatModel handle err in streaming chunk

260bd07

Change-Id: Ibbe7cd78e49c7e0f833dc4dc7b4e121df92bb9db

shentongmartin force-pushed the feat/chatmodelagent_retry branch from 47d74c5 to b6dfb48 Compare December 25, 2025 06:17

hi-pender reviewed Dec 25, 2025

View reviewed changes

adk/retry_chatmodel.go Outdated Show resolved Hide resolved

hi-pender reviewed Dec 25, 2025

View reviewed changes

schema/stream.go Show resolved Hide resolved

hi-pender reviewed Dec 25, 2025

View reviewed changes

adk/retry_chatmodel.go Outdated Show resolved Hide resolved

hi-pender approved these changes Dec 25, 2025

View reviewed changes

shentongmartin merged commit c8dc0ba into main Dec 25, 2025
19 checks passed

shentongmartin deleted the feat/chatmodelagent_retry branch December 25, 2025 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(adk): implement ChatModel retry for ChatModelAgent #635

feat(adk): implement ChatModel retry for ChatModelAgent #635

Uh oh!

shentongmartin commented Dec 22, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

feat(adk): implement ChatModel retry for ChatModelAgent #635

feat(adk): implement ChatModel retry for ChatModelAgent #635

Uh oh!

Conversation

shentongmartin commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #635: feat(adk): implement ChatModel retry for ChatModelAgent

Summary

Quick Start

Core Design: Two Types of Errors

Why Stream Errors Are Different

Handling AgentEvents with Stream Retry

Multi-Agent Workflow Behavior

Error Propagation Rules

Configuration Reference

Error Handling

Implementation Notes

Files Changed

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

shentongmartin commented Dec 22, 2025 •

edited

Loading

codecov bot commented Dec 22, 2025 •

edited

Loading