-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
An alternative solution to this issue might be adding a flag to Let me know what potential solution would be more viable. |
|
For the above idea regarding the conditional check to see if // Look up the interpreter.
PyInterpreterState *interp = _PyInterpreterID_LookUp(id);
if (interp == NULL) {
return NULL;
}
//Check if the interpreter is already being destroyed
if (interp->destroy_called) {
return NULL;
}
interp->destroy_called = 1; |
If I'm not mistaken, cpython/Include/internal/pycore_pystate.h Line 60 in 9e71917
|
|
Sorry for not getting to this sooner, @aeros. I'm going to start a review right now and try to finish it today. |
|
BTW, thanks for tackling this! |
|
@aeros, please update the PR (and the NEWS entry) to relate to https://bugs.python.org/issue37224. |
|
@aeros, I'm pretty sure this PR will have no effect on the problem. Have you been able to reproduce the problem by running I'm going to leave a comment on https://bugs.python.org/issue37224, so we can discuss it there. |
That's been the most tricky part about this particular issue, I was having trouble replicating the test failure in the first place in my local environment (OS: Arch Linux, Kernel: Linux 5.3.11, CPU: Intel i5-4460): $ ./python -m test test__xxsubinterpreters -j100 -F
...
== Tests result: INTERRUPTED ==
Test suite interrupted by signal SIGINT.
1000 tests OK.Perhaps I could try running the tests for a few hours instead of SIGINTing after 1k iterations or set up a FreeBSD virtual machine and run the tests again, to replicate the environment where the failure occurred as much as possible. Also, I figured that you might have an idea of whether or not the proposed solution (in the PR or in the comments above) could help with the underlying issue. Admittedly this was a bit of a stab in the dark, which is why I said that I wasn't confident about the proposed solution. Looking back in retrospect, I think that I should have focused more on replicating the issue in the first place, but at the time (~Sep 20th) it had been my first attempt at working on a complex race condition.
Thanks for the review, I'll try to do some further investigation. I'll continue the discussion in the bpo issue. |
If it's okay, I'd prefer to wait on fixing the Misc/NEWS entry (and improving upon it) until after we're certain of the exact cause and solution for the issue, so that I don't have to continually revise it. I can certainly adjust the title of the PR though. |
|
Closing this PR, as I've confirmed that it does not correctly address the test failure. I've found a better solution (more details in https://bugs.python.org/issue37224) that I'll likely open as a PR within the next week or so, after further testing. |
This is primarily targeted at addressing the BuildBot (AMD64 FreeBSD) failure for
test__xxsubinterpreters:I suspect the failure is occurring when multiple threads are attempting to destroy the same sub-interpreter. To fix this issue, the appropriate solution seems to be ensuring that the current thread attempting to end the sub-interpreter is able to do so without interruption.
From my understanding, this can be accomplished by ensuring the current thread has acquired the GIL. To ensure it is done safely,
PyEval_RestoreThreadis called only when the thread is not finalizing and has not already acquired the GIL.Note: This is my first attempt at fixing an issue involving sub-interpreters, and my first fix within the C-API. My main areas of experience thus far have been in the documentation and in exclusively Python code changes. If I have made any mistakes or misunderstandings, I would greatly appreciate it if they could be explained in detail.
https://bugs.python.org/issue38154