bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293

aeros · 2019-09-20T08:08:32Z

This is primarily targeted at addressing the BuildBot (AMD64 FreeBSD) failure for test__xxsubinterpreters:

FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test__xxsubinterpreters.py", line 765, in test_still_running
    interpreters.destroy(interp)
AssertionError: RuntimeError not raised

I suspect the failure is occurring when multiple threads are attempting to destroy the same sub-interpreter. To fix this issue, the appropriate solution seems to be ensuring that the current thread attempting to end the sub-interpreter is able to do so without interruption.

From my understanding, this can be accomplished by ensuring the current thread has acquired the GIL. To ensure it is done safely, PyEval_RestoreThread is called only when the thread is not finalizing and has not already acquired the GIL.

Note: This is my first attempt at fixing an issue involving sub-interpreters, and my first fix within the C-API. My main areas of experience thus far have been in the documentation and in exclusively Python code changes. If I have made any mistakes or misunderstandings, I would greatly appreciate it if they could be explained in detail.

https://bugs.python.org/issue38154

aeros · 2019-09-20T08:11:39Z

/cc @vstinner @ericsnowcurrently

aeros · 2019-09-20T08:16:04Z

An alternative solution to this issue might be adding a flag to PyInterpreterState named destroy_called and adding a conditional check at the start of interp_destroy() to ensure that it hasn't already been called on a specific sub-interpreter.

Let me know what potential solution would be more viable.

aeros · 2019-09-25T19:52:42Z

For the above idea regarding the conditional check to see if interp_destroy was already called on the interpreter, here's where it would be positioned within the function and what it would actually look like:

// Look up the interpreter.
PyInterpreterState *interp = _PyInterpreterID_LookUp(id);
if (interp == NULL) {
    return NULL;
}

//Check if the interpreter is already being destroyed
if (interp->destroy_called) {
        return NULL;
}
interp->destroy_called = 1;

aeros · 2019-10-07T06:00:17Z

An alternative solution to this issue might be adding a flag to PyInterpreterState named destroy_called

If I'm not mistaken, destroy_called would be added to the internal struct for PyInterpreterState, _is:

cpython/Include/internal/pycore_pystate.h

Line 60 in 9e71917

struct _is {

ericsnowcurrently · 2019-11-22T23:47:07Z

Sorry for not getting to this sooner, @aeros. I'm going to start a review right now and try to finish it today.

ericsnowcurrently · 2019-11-22T23:48:38Z

BTW, thanks for tackling this!

ericsnowcurrently · 2019-11-22T23:59:57Z

@aeros, please update the PR (and the NEWS entry) to relate to https://bugs.python.org/issue37224.

ericsnowcurrently · 2019-11-23T00:07:40Z

@aeros, I'm pretty sure this PR will have no effect on the problem. Have you been able to reproduce the problem by running ./python -m test test__xxsubinterpreters -F -j30? Does the problem go away with this change applied?

I'm going to leave a comment on https://bugs.python.org/issue37224, so we can discuss it there.

aeros · 2019-11-23T05:33:32Z

@ericsnowcurrently

Have you been able to reproduce the problem by running ./python -m test test__xxsubinterpreters -F -j30? Does the problem go away with this change applied?

That's been the most tricky part about this particular issue, I was having trouble replicating the test failure in the first place in my local environment (OS: Arch Linux, Kernel: Linux 5.3.11, CPU: Intel i5-4460):

$ ./python -m test test__xxsubinterpreters -j100 -F
...
== Tests result: INTERRUPTED ==
Test suite interrupted by signal SIGINT.

1000 tests OK.

Perhaps I could try running the tests for a few hours instead of SIGINTing after 1k iterations or set up a FreeBSD virtual machine and run the tests again, to replicate the environment where the failure occurred as much as possible.

Also, I figured that you might have an idea of whether or not the proposed solution (in the PR or in the comments above) could help with the underlying issue. Admittedly this was a bit of a stab in the dark, which is why I said that I wasn't confident about the proposed solution. Looking back in retrospect, I think that I should have focused more on replicating the issue in the first place, but at the time (~Sep 20th) it had been my first attempt at working on a complex race condition.

I'm going to leave a comment on https://bugs.python.org/issue37224, so we can discuss it there.

Thanks for the review, I'll try to do some further investigation. I'll continue the discussion in the bpo issue.

aeros · 2019-11-23T06:54:34Z

@ericsnowcurrently

please update the PR (and the NEWS entry) to relate to https://bugs.python.org/issue37224.

If it's okay, I'd prefer to wait on fixing the Misc/NEWS entry (and improving upon it) until after we're certain of the exact cause and solution for the issue, so that I don't have to continually revise it. I can certainly adjust the title of the PR though.

aeros · 2019-12-14T04:11:28Z

Closing this PR, as I've confirmed that it does not correctly address the test failure. I've found a better solution (more details in https://bugs.python.org/issue37224) that I'll likely open as a PR within the next week or so, after further testing.

aeros added 3 commits September 20, 2019 03:41

Fix destroy_interp() GIL handling

136d307

Adjust interp_destroy() comment

cfbcddf

Fix whitespace in interp_destroy()

4a28d9a

the-knights-who-say-ni added the CLA signed label Sep 20, 2019

bedevere-bot added the awaiting review label Sep 20, 2019

aeros added the type-bug An unexpected behavior, bug, or error label Sep 20, 2019

aeros requested a review from ericsnowcurrently September 20, 2019 08:11

aeros added skip news and removed skip news labels Sep 20, 2019

📜🤖 Added by blurb_it.

4e7fd63

aeros changed the title ~~bpo-38154: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy()~~ bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() Nov 23, 2019

aeros closed this Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293

bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293

Uh oh!

aeros commented Sep 20, 2019 •

edited by bedevere-bot

Loading

Uh oh!

aeros commented Sep 20, 2019

Uh oh!

aeros commented Sep 20, 2019 •

edited

Loading

Uh oh!

aeros commented Sep 25, 2019 •

edited

Loading

Uh oh!

aeros commented Oct 7, 2019 •

edited

Loading

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 23, 2019

Uh oh!

aeros commented Nov 23, 2019 •

edited

Loading

Uh oh!

aeros commented Nov 23, 2019 •

edited

Loading

Uh oh!

aeros commented Dec 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293

bpo-37224: Fix test__xxsubinterpreters failure and GIL handling in interp_destroy() #16293

Uh oh!

Conversation

aeros commented Sep 20, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros commented Sep 20, 2019

Uh oh!

aeros commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros commented Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 22, 2019

Uh oh!

ericsnowcurrently commented Nov 23, 2019

Uh oh!

aeros commented Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros commented Nov 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aeros commented Dec 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aeros commented Sep 20, 2019 •

edited by bedevere-bot

Loading

aeros commented Sep 20, 2019 •

edited

Loading

aeros commented Sep 25, 2019 •

edited

Loading

aeros commented Oct 7, 2019 •

edited

Loading

aeros commented Nov 23, 2019 •

edited

Loading

aeros commented Nov 23, 2019 •

edited

Loading