Skip to content

Conversation

@aeros
Copy link
Contributor

@aeros aeros commented Sep 20, 2019

This is primarily targeted at addressing the BuildBot (AMD64 FreeBSD) failure for test__xxsubinterpreters:

FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test__xxsubinterpreters.py", line 765, in test_still_running
    interpreters.destroy(interp)
AssertionError: RuntimeError not raised

I suspect the failure is occurring when multiple threads are attempting to destroy the same sub-interpreter. To fix this issue, the appropriate solution seems to be ensuring that the current thread attempting to end the sub-interpreter is able to do so without interruption.

From my understanding, this can be accomplished by ensuring the current thread has acquired the GIL. To ensure it is done safely, PyEval_RestoreThread is called only when the thread is not finalizing and has not already acquired the GIL.

Note: This is my first attempt at fixing an issue involving sub-interpreters, and my first fix within the C-API. My main areas of experience thus far have been in the documentation and in exclusively Python code changes. If I have made any mistakes or misunderstandings, I would greatly appreciate it if they could be explained in detail.

https://bugs.python.org/issue38154

@aeros aeros added the type-bug An unexpected behavior, bug, or error label Sep 20, 2019
@aeros
Copy link
Contributor Author

aeros commented Sep 20, 2019

/cc @vstinner @ericsnowcurrently

@aeros
Copy link
Contributor Author

aeros commented Sep 20, 2019

An alternative solution to this issue might be adding a flag to PyInterpreterState named destroy_called and adding a conditional check at the start of interp_destroy() to ensure that it hasn't already been called on a specific sub-interpreter.

Let me know what potential solution would be more viable.

@aeros aeros added skip news and removed skip news labels Sep 20, 2019
@aeros
Copy link
Contributor Author

aeros commented Sep 25, 2019

For the above idea regarding the conditional check to see if interp_destroy was already called on the interpreter, here's where it would be positioned within the function and what it would actually look like:

// Look up the interpreter.
PyInterpreterState *interp = _PyInterpreterID_LookUp(id);
if (interp == NULL) {
    return NULL;
}

//Check if the interpreter is already being destroyed
if (interp->destroy_called) {
        return NULL;
}
interp->destroy_called = 1;

@aeros
Copy link
Contributor Author

aeros commented Oct 7, 2019

An alternative solution to this issue might be adding a flag to PyInterpreterState named destroy_called

If I'm not mistaken, destroy_called would be added to the internal struct for PyInterpreterState, _is:

@ericsnowcurrently
Copy link
Member

Sorry for not getting to this sooner, @aeros. I'm going to start a review right now and try to finish it today.

@ericsnowcurrently
Copy link
Member

BTW, thanks for tackling this!

@ericsnowcurrently
Copy link
Member

@aeros, please update the PR (and the NEWS entry) to relate to https://bugs.python.org/issue37224.

@ericsnowcurrently
Copy link
Member

@aeros, I'm pretty sure this PR will have no effect on the problem. Have you been able to reproduce the problem by running ./python -m test test__xxsubinterpreters -F -j30? Does the problem go away with this change applied?

I'm going to leave a comment on https://bugs.python.org/issue37224, so we can discuss it there.

@aeros
Copy link
Contributor Author

aeros commented Nov 23, 2019

@ericsnowcurrently

Have you been able to reproduce the problem by running ./python -m test test__xxsubinterpreters -F -j30? Does the problem go away with this change applied?

That's been the most tricky part about this particular issue, I was having trouble replicating the test failure in the first place in my local environment (OS: Arch Linux, Kernel: Linux 5.3.11, CPU: Intel i5-4460):

$ ./python -m test test__xxsubinterpreters -j100 -F
...
== Tests result: INTERRUPTED ==
Test suite interrupted by signal SIGINT.

1000 tests OK.

Perhaps I could try running the tests for a few hours instead of SIGINTing after 1k iterations or set up a FreeBSD virtual machine and run the tests again, to replicate the environment where the failure occurred as much as possible.

Also, I figured that you might have an idea of whether or not the proposed solution (in the PR or in the comments above) could help with the underlying issue. Admittedly this was a bit of a stab in the dark, which is why I said that I wasn't confident about the proposed solution. Looking back in retrospect, I think that I should have focused more on replicating the issue in the first place, but at the time (~Sep 20th) it had been my first attempt at working on a complex race condition.

I'm going to leave a comment on https://bugs.python.org/issue37224, so we can discuss it there.

Thanks for the review, I'll try to do some further investigation. I'll continue the discussion in the bpo issue.

@aeros
Copy link
Contributor Author

aeros commented Nov 23, 2019

@ericsnowcurrently

please update the PR (and the NEWS entry) to relate to https://bugs.python.org/issue37224.

If it's okay, I'd prefer to wait on fixing the Misc/NEWS entry (and improving upon it) until after we're certain of the exact cause and solution for the issue, so that I don't have to continually revise it. I can certainly adjust the title of the PR though.

@aeros aeros changed the title bpo-38154: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() Nov 23, 2019
@aeros
Copy link
Contributor Author

aeros commented Dec 14, 2019

Closing this PR, as I've confirmed that it does not correctly address the test failure. I've found a better solution (more details in https://bugs.python.org/issue37224) that I'll likely open as a PR within the next week or so, after further testing.

@aeros aeros closed this Dec 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review type-bug An unexpected behavior, bug, or error

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants