-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-115634: Fix ProcessPoolExecutor deadlock with max_tasks_per_child #140900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…child Fix deadlock in ProcessPoolExecutor when using max_tasks_per_child. When a worker process exited after completing max_tasks_per_child tasks, the executor could incorrectly believe an idle worker existed due to stale semaphore values, preventing a replacement worker from being spawned and causing the executor to hang indefinitely. The fix adds a process_died parameter to _adjust_process_count() to distinguish between process death (where semaphore might be wrong) and normal task submission (where semaphore is accurate). When a process dies, we check the actual process count before consulting the semaphore, ensuring a replacement worker is always spawned when needed. This preserves the existing idle worker reuse optimization while fixing the deadlock scenario reported in pythongh-115634. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Make the pythongh-115634 regression tests more robust: 1. Add try/finally blocks to ensure executor.shutdown() is always called, preventing resource leaks on test failures or timeouts 2. Replace loops with individual assertions with single list comparisons for clearer failure messages - when a test fails, the assertion will show exactly which results differ 3. Add descriptive assertion messages that explain the issue context without requiring reading the test code These improvements ensure tests clean up properly even when run against unpatched CPython where the deadlock would occur, and make assertion failures more actionable for debugging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
TL;DRThe This analysis was prepared, drafted, and posted by Claude Code (Sonnet 4.5) at the direct request of @gpshead. Analysis: The _idle_worker_semaphore ProblemExecutive SummaryThe Key Finding: @gaogaotiantian (PR #115642 author) identified the core issue (source):
What the Semaphore Was Designed ForOriginal Purpose (bpo-39207, PR #19453)The semaphore was introduced in Python 3.9 to enable on-demand worker spawning - an optimization to avoid spawning all workers upfront. Intended behavior:
The optimization goal: If you submit 1 task to a pool with How It's Supposed to Work (Ideally)Why the Accounting BreaksThe Fatal FlawProblem: What actually happens once at max_workers: The accounting divergence:
Why a Counter Would Work (Comparison)Same scenario with a signed integer counter instead: Key difference: Counter tracks the net state (idle workers minus waiting tasks), which can be negative. Semaphore can only track positive values, so accounting breaks when the true state would be negative. The Concrete Bug in gh-115634Current Usage of the SemaphoreWhere It's Used (3 locations)
When It's ConsultedOnly in one place: Not consulted:
What the Semaphore Actually AchievesUnder max_workers: ✓ Works as intended
At max_workers: ✗ Fundamentally broken
Maintainer Discussion Insights@pitrou's Analysis (CPython core dev)(source):
Key insight: The semantic confusion between "acquiring semaphore for spawn check" vs "worker becoming busy". (source):
@gaogaotiantian's Analysis (PR #115642 author)(source):
(same source):
This is the key insight - the semaphore accounting is fundamentally incompatible with being at max capacity. Proposed Solutions1. PR #115642's Approach: Lock + CounterReplace semaphore with Pros:
Cons:
2. @tabrezm's Approach (our fix): Prioritize Process CountCheck Pros:
Cons:
3. Remove the Semaphore EntirelyAlways check Pros:
Cons:
4. Fix the Semaphore Accounting (Complex)Make the semaphore actually track idle workers correctly. Requires:
Consensus: Probably not worth it given the alternatives. What Actually Benefits from the Semaphore?The Idle Worker Reuse OptimizationScenario: Pool has 4 workers, all idle, 1 task submitted. Without semaphore: if len(self._processes) < self._max_workers:
self._spawn_process() # Wouldn't spawn (4 < 4 is False) ✓Wait... the optimization already exists via process count check! The Real Benefit: Avoiding Lock Contention?Looking at the code, the semaphore avoids needing to:
Instead, it's a lock-free check via But: We already access RecommendationShort-term: Keep our fix (prioritize process count)
Medium-term: Remove the semaphore entirelyRationale:
Proposed removal: def _adjust_process_count(self):
if self._processes is None:
return
# Simple and correct: just check if we need more workers
if len(self._processes) < self._max_workers:
self._spawn_process()Remove:
Result: ~15 lines deleted, simpler code, no accounting bugs. Long-term: If optimization is really neededProfile to prove the semaphore provides measurable benefit, then implement it correctly with:
But current evidence suggests the optimization is not worth the complexity. ConclusionThe Our fix (gh-115634) works around this by not trusting the semaphore when replacing dead workers, but doesn't address the root cause. The semaphore remains broken at max capacity - we just avoid the most severe consequences. The best long-term solution is likely to remove the semaphore entirely, as:
References
Acknowledgments
|
Summary
Fixes a deadlock in
ProcessPoolExecutorwhen usingmax_tasks_per_child. The executor would hang indefinitely after a worker process exited upon reaching its task limit, preventing any remaining queued tasks from being processed.Note: This (long) description was written by Claude Code (Sonnet 4.5) with minimal editing by me before I opened the Draft PR. I've reviewed the code which is derived from the existing PR and issue discussions and am mostly happy with it. There remains the valid question of "why does this semaphore even exist?" but I'd like to keep the potential removal of that separate from fixing this issue for simplicity's sake. -- @gpshead
The Bug
When a worker process exits after completing
max_tasks_per_childtasks, the executor's idle worker accounting becomes incorrect due to stale semaphore values. The semaphore is incremented when tasks complete but isn't decremented when at max capacity. When_adjust_process_count()is called to replace the dead worker, it incorrectly believes an idle worker exists (due to the inflated semaphore) and fails to spawn a replacement, causing a deadlock.Affected versions: Python 3.11, 3.12, 3.13, 3.14, 3.15
Bug introduced in: Python 3.11 (PR #27373 - adding the max_tasks_per_child feature)
Reproduction
Result: Prints 0 and 1, then hangs forever waiting for task 2.
The Fix
This PR implements a refined version of @tabrezm's proposed solution from the issue/PR discussion. The fix adds a
process_diedparameter to_adjust_process_count()to distinguish between two calling contexts:Key insight: The process count (
len(self._processes)) is the authoritative source of truth. The semaphore is a derived optimization hint that can become stale. When a process dies, we must trust the authoritative state.This approach:
Comparison to PR #115642
The original PR #115642 attempted to fix semaphore accounting by replacing it with a lock-protected counter. However:
Our approach is surgical: fix only the specific case where the semaphore is wrong (process death), leave the normal path unchanged.
Testing
Added three comprehensive regression tests:
test_max_tasks_per_child_rapid_task_submission_gh115634: Core bug scenario (1 worker, 10 tasks, max_tasks=2)test_max_tasks_per_child_multiple_worker_restarts_gh115634: Multiple restart cycles (15 tasks, max_tasks=3)test_max_tasks_per_child_multiple_workers_gh115634: Multiple workers (2 workers, 12 tasks, max_tasks=2)All tests include:
try/finallyblockssupport.SHORT_TIMEOUTTest results:
Files Changed
Lib/concurrent/futures/process.py(+19 lines): The fixLib/test/test_concurrent_futures/test_process_pool.py(+56 lines): Regression testsMisc/NEWS.d/next/Library/2025-11-02-05-28-56.gh-issue-115634.JbcNnF.rst: NEWS entryRelated Issues
Acknowledgments