Skip to content

Conversation

@shentongmartin
Copy link
Contributor

@shentongmartin shentongmartin commented Dec 22, 2025

PR #635: feat(adk): implement ChatModel retry for ChatModelAgent

Summary

This PR adds automatic retry functionality for ChatModel calls in ChatModelAgent, enabling graceful handling of transient LLM API failures (network timeouts, rate limits, temporary unavailability).

Quick Start

agent, err := adk.NewChatModelAgent(ctx, &adk.ChatModelAgentConfig{
    Name:  "MyAgent",
    Model: myModel,
    ModelRetryConfig: &adk.ModelRetryConfig{
        MaxRetries: 3,  // Up to 3 retries (4 total calls)
        IsRetryAble: func(ctx context.Context, err error) bool {
            return isTransientError(err)  // Your retry logic
        },
        // BackoffFunc is optional - defaults to exponential backoff with jitter
    },
})

Core Design: Two Types of Errors

LLM calls can fail in two ways, and this feature handles both:

Error Type When It Happens How Retry Works
Direct Error Generate() or Stream() returns error immediately Retry internally, user sees only final result
Stream Error Error appears mid-stream during consumption Emit error to user, then retry and emit new stream

Why Stream Errors Are Different

For direct errors, retry is invisible to the user. But for stream errors, the user has already started receiving chunks. We can't "take back" what was sent, so instead:

  1. Wrap the error as WillRetryError to signal "retry is pending"
  2. Emit it in the stream so the user knows what happened
  3. Start a new stream for the retry attempt

If the error won't be retried (non-retryable or max retries exhausted), the original error is returned directly without wrapping.

Handling AgentEvents with Stream Retry

When retry is configured and streaming is enabled, users should handle events like this:

iterator := agent.Run(ctx, input)
for {
    event, ok := iterator.Next()
    if !ok {
        break
    }
    
    // Check for final error (non-retryable or exhausted retries)
    if event.Err != nil {
        handleFinalError(event.Err)
        break
    }
    
    // Process streaming output
    if event.Output != nil && event.Output.MessageOutput.IsStreaming {
        stream := event.Output.MessageOutput.MessageStream
        for {
            msg, err := stream.Recv()
            if err == io.EOF {
                break  // Stream completed successfully
            }
            if err != nil {
                // Check if this error will be retried (more streams coming)
                var willRetry *adk.WillRetryError
                if errors.As(err, &willRetry) {
                    log.Printf("Attempt %d failed, retrying...", willRetry.RetryAttempt)
                    break  // Wait for next event with new stream
                }
                // Original error - won't retry, workflow will stop
                log.Printf("Final error (no retry): %v", err)
                break
            }
            // Display chunk to user
            displayChunk(msg)
        }
    }
}

Key insight: With retry enabled, you may receive multiple streaming events for a single LLM call - one for each attempt. Only the last successful stream should be used for the final response.

Multi-Agent Workflow Behavior

In sequential workflows (Agent A → Agent B), stream errors are handled intelligently:

┌─────────────────────────────────────────────────────────────────┐
│  Agent A (with retry)              Agent B                      │
│  ┌─────────────────────┐          ┌─────────────────────┐      │
│  │ Attempt 1: ❌ Error  │          │                     │      │
│  │ Attempt 2: ❌ Error  │    →     │  Only receives      │      │
│  │ Attempt 3: ✅ Success│          │  successful message │      │
│  └─────────────────────┘          └─────────────────────┘      │
└─────────────────────────────────────────────────────────────────┘
  • End-user sees: All attempts (errors wrapped as WillRetryError, then success)
  • Agent B receives: Only the successful message (failed attempts excluded from context)

This prevents partial/errored responses from polluting the context window of downstream agents.

Error Propagation Rules

Scenario End-User Sees Agent B Called?
WillRetryError → Success Error streams + success stream ✅ Yes (success only)
Max retries exhausted Error streams + original error ❌ No
Non-retryable error Original error ❌ No
No retry config + error Original error ❌ No

Configuration Reference

type ModelRetryConfig struct {
    // MaxRetries: 0 = no retries, 3 = up to 3 retries (4 total calls)
    MaxRetries int
    
    // IsRetryAble: Return true for transient errors. If nil, all errors retry.
    IsRetryAble func(ctx context.Context, err error) bool
    
    // BackoffFunc: Custom delay strategy. If nil, uses exponential backoff:
    // 100ms → 200ms → 400ms → ... → 10s max, with 0-50% random jitter
    BackoffFunc func(ctx context.Context, attempt int) time.Duration
}

Error Handling

// Check if all retries exhausted
if errors.Is(err, adk.ErrExceedMaxRetries) {
    var retryErr *adk.RetryExhaustedError
    if errors.As(err, &retryErr) {
        log.Printf("Failed after %d retries. Last error: %v", 
            retryErr.TotalRetries, retryErr.LastErr)
    }
}

Implementation Notes

  • Transparent wrapper: retryChatModel wraps the model without changing graph topology
  • Callback per attempt: Each retry triggers OnChatModelEnd for observability
  • Gob-serializable errors: WillRetryError is registered for checkpoint serialization
  • Context-aware: Both IsRetryAble and BackoffFunc receive context for cancellation support

Files Changed

File Change
adk/retry_chatmodel.go New: retry wrapper, error types, backoff logic
adk/chatmodel.go Config integration, callback handlers for error wrapping
adk/react.go Pass retry config to react graph
adk/flow.go Skip WillRetryError events in genAgentInput
adk/runctx.go StreamErr field: stringerror
adk/utils.go Preserve error type in stream handling
schema/stream.go WithErrWrapper option for stream conversion
adk/*_test.go Comprehensive test coverage

@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 88.75000% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.94%. Comparing base (2c167dc) to head (4c272bf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
adk/retry_chatmodel.go 85.71% 10 Missing and 11 partials ⚠️
adk/chatmodel.go 93.22% 2 Missing and 2 partials ⚠️
adk/utils.go 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #635      +/-   ##
==========================================
+ Coverage   79.44%   79.94%   +0.50%     
==========================================
  Files         123      124       +1     
  Lines       11602    11812     +210     
==========================================
+ Hits         9217     9443     +226     
+ Misses       1658     1635      -23     
- Partials      727      734       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Change-Id: Iac018949691d91456fdcf67cececf0c0ca55bdcb
…ut tools

Change-Id: I027fd9ad577caa4006c170aa70ad9e4e1fac7879
Change-Id: Ibbe7cd78e49c7e0f833dc4dc7b4e121df92bb9db
…onential backoff

- Add RetryAbleError and NonRetryAbleError types for distinguishing error types in streams
- Implement exponential backoff with jitter (base 100ms, max 10s, 0-50% jitter)
- Change StreamErr field from string to error type for proper error propagation
- Add error wrapper support in StreamReaderWithConvert for stream error classification
- Update genAgentInput to skip RetryAbleError events but propagate other errors
- Pass ModelRetryConfig to callback handlers for stream error wrapping
- Add TotalRetries field to RetryExhaustedError
- Update IsRetryAble and BackoffFunc signatures to include context

Test coverage:
- Add sequential workflow tests for retry scenarios
- Add gob encoding tests for RetryAbleError and NonRetryAbleError
- Add test for unregistered error type gob encoding failure
- Update existing tests for StreamErr type change

Change-Id: I9d822665e23fc12cfa1997de02ab3bf85849c0e4
- Rename RetryAbleError to WillRetryError (indicates retry will happen)
- Rename NonRetryAbleError to WontRetryError (indicates no retry)
- Fix genErrWrapper to check max retries before emitting WillRetryError
- Consolidate duplicate stream error tests into table-driven test
- Remove redundant TestChatModelAgentRetry_WithTools_StreamError_SequentialFlow
- Add coverage tests for error string methods and default IsRetryAble

The naming change better reflects the user's perspective: when receiving
WillRetryError in a stream, the retry is about to happen (not already done).
WontRetryError indicates the error is final - either non-retryable or
max retries exhausted.

Change-Id: Ia613c83b1da69c600130a69f2039ff9425552ecd
@shentongmartin shentongmartin force-pushed the feat/chatmodelagent_retry branch from 47d74c5 to b6dfb48 Compare December 25, 2025 06:17
- Remove WontRetryError type - no need to wrap non-retryable errors
- Update genErrWrapper to return original error when not retrying
- Simpler API: only WillRetryError needs special handling
- Users can directly use errors.Is(err, specificError) for non-retry cases

This makes the error handling more intuitive:
- WillRetryError: retry is pending, expect another stream
- Original error: this is final, no more retries

Change-Id: I13a8db602823a9ed3ab2164b63b585db7e55399b
@shentongmartin shentongmartin merged commit c8dc0ba into main Dec 25, 2025
19 checks passed
@shentongmartin shentongmartin deleted the feat/chatmodelagent_retry branch December 25, 2025 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants