Skip to content

Conversation

@Dumbris
Copy link
Contributor

@Dumbris Dumbris commented Jul 8, 2025

  • Implement secure environment variable management
  • Enhance timeout management and logging in client and server components

Dumbris added 3 commits July 7, 2025 20:36
- Added `secureenv` package for filtering environment variables based on configuration.
- Introduced `EnvConfig` struct to manage allowed system variables and custom variables.
- Updated `Config` struct to include environment configuration.
- Enhanced `DefaultConfig` to initialize secure environment settings.
- Integrated secure environment manager into upstream and client components.
- Added tests for secure environment functionality and integration scenarios.
- Updated timeout duration for tool listing operations to 30 seconds for better handling of Docker container startup and API calls.
- Improved logging for various operations, including connection errors and tool calls, to provide more detailed insights during debugging.
- Added conditional logging for JSON request/response data based on debug level, enhancing traceability without cluttering logs.
- Removed unnecessary blank lines in config validation and test functions.
- Simplified environment variable splitting logic in tests.
- Introduced constants for OS checks to improve readability and maintainability.
- Updated autostart functionality checks to use constants for OS types.
- Enhanced error handling in file operations for better robustness.
@Dumbris Dumbris merged commit 12a7673 into main Jul 8, 2025
31 checks passed
@Dumbris Dumbris deleted the feature/env-vars-filter branch July 14, 2025 08:53
rannow pushed a commit to rannow/mcpproxy-go that referenced this pull request Sep 23, 2025
* Implement secure environment variable management

- Added `secureenv` package for filtering environment variables based on configuration.
- Introduced `EnvConfig` struct to manage allowed system variables and custom variables.
- Updated `Config` struct to include environment configuration.
- Enhanced `DefaultConfig` to initialize secure environment settings.
- Integrated secure environment manager into upstream and client components.
- Added tests for secure environment functionality and integration scenarios.

* Enhance timeout management and logging in client and server components

- Updated timeout duration for tool listing operations to 30 seconds for better handling of Docker container startup and API calls.
- Improved logging for various operations, including connection errors and tool calls, to provide more detailed insights during debugging.
- Added conditional logging for JSON request/response data based on debug level, enhancing traceability without cluttering logs.

* Refactor code for clarity and consistency

- Removed unnecessary blank lines in config validation and test functions.
- Simplified environment variable splitting logic in tests.
- Introduced constants for OS checks to improve readability and maintainability.
- Updated autostart functionality checks to use constants for OS types.
- Enhanced error handling in file operations for better robustness.
Dumbris added a commit that referenced this pull request Nov 2, 2025
CRITICAL: Identified most dangerous gap in Docker recovery implementation.

Problem: Supervisor can spawn multiple containers for same server when
failing to detect container liveness, leading to:
- Resource exhaustion (multiple containers per server)
- Port binding conflicts
- Orphaned containers accumulating over time
- Data corruption if containers share state

Root Causes:
1. Race condition in ForceReconnectAll() - no serialization
2. Lost container IDs when cidfile read times out (10s)
3. Random container names don't prevent duplicates
4. No pre-creation check for existing containers
5. Supervisor liveness failures (slow startup, network partition, API errors)

Comprehensive 5-Layer Solution:
- Layer 1: Idempotent container creation (pre-cleanup)
- Layer 2: Container labels for ownership tracking
- Layer 3: Distributed locks to prevent races
- Layer 4: Enhanced container health verification
- Layer 5: Graceful degradation on cidfile timeout

Impact: Prevents resource exhaustion and ensures exactly 1 container
per server at all times.

Effort: +12 hours (40 total vs 28-32 original)
Dumbris added a commit that referenced this pull request Nov 2, 2025
Implements Issues #1, #2, #4, and #11 (Layers 1, 3, 4) from docker-recovery-improvements.md

**Issue #11: Duplicate Container Spawning Prevention**
- Layer 1 (Idempotent Creation): ensureNoExistingContainers() cleans up all existing containers before creating new ones
- Layer 3 (Distributed Locks): Per-server mutex prevents concurrent container creation races
- Layer 4 (Enhanced Health Verification): verifyContainerHealthy() checks Docker container state AND responsiveness

**Issue #1: Docker-only Filtering**
- ForceReconnectAll() now filters to only reconnect Docker-based servers
- HTTP/SSE/stdio servers are skipped, preventing unnecessary reconnections

**Issue #2: Container Health Verification**
- verifyContainerHealthy() detects frozen containers when Docker paused
- Uses docker inspect to verify container running + responsive
- Prevents skipping reconnection for zombie containers

**Issue #4: Add Recovering State**
- New StateCoreRecoveringDocker state provides UX feedback during recovery
- EventDockerRecovered triggers transition from error → recovering → launching
- Tray menu shows "Docker engine recovered - reconnecting servers..."

**Files Modified:**
- internal/upstream/core/docker.go: ensureNoExistingContainers(), GetContainerID()
- internal/upstream/core/connection.go: container lock acquisition, pre-cleanup
- internal/upstream/core/container_lock.go: NEW - distributed lock implementation
- internal/upstream/core/client.go: GetContainerID() accessor
- internal/upstream/manager.go: verifyContainerHealthy(), Docker-only filtering
- internal/upstream/managed/client.go: GetContainerID() delegation
- cmd/mcpproxy-tray/internal/state/states.go: StateCoreRecoveringDocker, EventDockerRecovered
- cmd/mcpproxy-tray/main.go: state mapping, handleDockerRecovering()
- internal/tray/connection_state.go: ConnectionStateRecoveringDocker

**Testing:**
- All upstream package tests passing (./internal/upstream/...)
- Code compiles cleanly (go build ./...)
- State machine transitions validated

**Impact:**
- Prevents duplicate container spawning during concurrent reconnections
- Eliminates resource exhaustion from orphaned containers
- Better UX - users see recovery in progress instead of errors
- Only Docker servers affected by Docker recovery (not HTTP/SSE)
Dumbris added a commit that referenced this pull request Nov 2, 2025
Implements Issues #3, #9, and #11 (Layers 2, 5) from docker-recovery-improvements.md

**Issue #11 Layer 2: Container Labels for Ownership Tracking**
- Add Docker labels to all containers (instance ID, server name, PID)
- formatContainerLabels() generates ownership labels automatically
- Labels: com.mcpproxy.managed, com.mcpproxy.instance, com.mcpproxy.server
- Enables accurate cleanup of orphaned containers from crashed instances
- New file: internal/upstream/core/instance.go for instance ID management

**Issue #11 Layer 5: cidfile Timeout Fallback**
- When cidfile read times out (slow image pull), fall back to name lookup
- Use 'docker ps --filter name=...' to recover container ID
- Prevents orphaned containers when cidfile fails
- Graceful degradation maintains container cleanup capability

**Issue #3: Exponential Backoff Polling**
- Replace fixed 5s Docker polling with exponential backoff
- Intervals: 2s → 5s → 10s → 30s → 60s (max)
- Fast recovery when Docker quickly resumes (2s vs 5s)
- Lower CPU/battery usage when Docker off for extended periods
- Logs total wait time and attempt count for observability

**Issue #9: Partial Failure Handling**
- ForceReconnectAll() now returns detailed ReconnectResult struct
- Tracks: total servers, attempted, successful, failed, skipped
- Per-server skip reasons ("container healthy", "not Docker", "disabled")
- Per-server failure errors for debugging
- Runtime logs comprehensive summary of reconnection results

**Files Modified:**
- internal/upstream/core/instance.go: NEW - instance ID persistence
- internal/upstream/core/isolation.go: add container labels
- internal/upstream/core/docker.go: cidfile timeout fallback
- internal/upstream/manager.go: ReconnectResult struct, detailed tracking
- internal/runtime/lifecycle.go: log reconnection results
- cmd/mcpproxy-tray/main.go: exponential backoff polling, min() helper

**Testing:**
- All upstream tests passing
- Code compiles cleanly
- Exponential backoff logic validated

**Impact:**
- Container labels enable cleanup of orphaned containers across restarts
- cidfile fallback prevents container orphaning on slow pulls
- Exponential backoff saves resources while maintaining fast recovery
- Partial failure tracking improves debugging and observability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant