Skip to content

Conversation

@eduardo-elizondo
Copy link
Contributor

@eduardo-elizondo eduardo-elizondo commented Jan 17, 2020

Currently, during runtime destruction, _PyImport_Cleanup is clearing the interpreter state before clearing out the modules themselves. This leads to a segfault on modules that rely on the module state to clear themselves up.

For example, let's take the small snippet added in the issue by @DinoV :

import _struct

class C:
    def __init__(self):
        self.pack = _struct.pack
    def __del__(self):
        self.pack('I', -42)

_struct.x = C()

The module _struct uses the module state to run pack. Therefore, the module state has to be alive until after the module has been cleared out to successfully run C.__del__. This happens at line 606, when _PyImport_Cleanup calls _PyModule_Clear. In fact, the loop that calls _PyModule_Clear has in its comments:

Now, if there are any modules left alive, clear their globals to minimize potential leaks. All C extension modules actually end up here, since they are kept alive in the interpreter state.

That means that we can't clear the module state (which is used by C Extensions) before we run that loop.

Moving _PyInterpreterState_ClearModules until after it, fixes the segfault in the code snippet.

Finally, this updates a test in io to correctly assert the error that it now throws (since it now finds the io module state). The test that uses this is: test_create_at_shutdown_without_encoding. Given this test is now working is a proof that the module state now stays alive even when __del__ is called at module destruction time. Thus, I didn't add a new tests for this.

https://bugs.python.org/issue38076

Automerge-Triggered-By: @encukou

@eduardo-elizondo eduardo-elizondo changed the title bpo-38076 Clear the interpreter state only after clearing module globals [WIP - Waiting for Tests]bpo-38076 Clear the interpreter state only after clearing module globals Jan 17, 2020
@eduardo-elizondo eduardo-elizondo changed the title [WIP - Waiting for Tests]bpo-38076 Clear the interpreter state only after clearing module globals bpo-38076 Clear the interpreter state only after clearing module globals Jan 17, 2020
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 17, 2020
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit f2cfc13 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jan 17, 2020
@DinoV
Copy link
Contributor

DinoV commented Jan 17, 2020

I'm about to hop on a plane so I can't check it now but I'm curious what happens with:

import _struct, sys

class C:
    def __init__(self):
        self.pack = _struct.pack
    def __del__(self):
        self.pack('I', -42)

sys.x = C()

@pablogsal
Copy link
Member

I'm about to hop on a plane so I can't check it now but I'm curious what happens with:

Still segfaults on my system:

❯ cat question.py  
import _struct, sys

class C:
    def __init__(self):
        self.pack = _struct.pack
    def __del__(self):
        self.pack('I', -42)

sys.x = C()

~/github/python/master runtime-destruction-fix*
❯ ./python question.py 
[1]    25816 segmentation fault (core dumped)  ./python question.py


(gdb) bt
#0  PyModule_GetState (m=0x0) at Objects/moduleobject.c:565
#1  0x00007ffff7fa1fd8 in cache_struct_converter (fmt='I', ptr=ptr@entry=0x7fffffffd170)
    at /home/pablogsal/github/python/master/Modules/_struct.c:2100
#2  0x00007ffff7fa22a8 in pack (self=<optimized out>, args=0x7ffff788a550, nargs=2)
    at /home/pablogsal/github/python/master/Modules/_struct.c:2162
#3  0x00005555557827fa in cfunction_vectorcall_FASTCALL (func=<built-in method pack of module object at remote 0x7ffff7821530>, 
    args=0x7ffff788a550, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/methodobject.c:366
#4  0x000055555567d102 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775810, args=0x7ffff788a550, 
    callable=<built-in method pack of module object at remote 0x7ffff7821530>, tstate=0x555555952fc0) at ./Include/cpython/abstract.h:111
#5  _PyObject_Vectorcall (kwnames=0x0, nargsf=9223372036854775810, args=0x7ffff788a550, 
    callable=<built-in method pack of module object at remote 0x7ffff7821530>) at ./Include/cpython/abstract.h:120
#6  call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x555555952fc0) at Python/ceval.c:4931
#7  _PyEval_EvalFrameDefault (
    f=Frame 0x7ffff788a3d0, for file /home/pablogsal/github/python/master/question.py, line 7, in __del__ (self=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>), throwflag=<optimized out>) at Python/ceval.c:3383
#8  0x00005555555bd12d in _PyEval_EvalFrame (throwflag=0, f=0x7ffff788a3d0, tstate=0x555555952fc0) at ./Include/internal/pycore_ceval.h:43
#9  function_code_fastcall (tstate=0x555555952fc0, co=<optimized out>, args=0x7fffffffd390, nargs=1, globals=<optimized out>)
    at Objects/call.c:327
#10 0x00005555555bda74 in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, 
    kwnames=<optimized out>) at Objects/call.c:364
#11 0x0000555555612336 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x7fffffffd388, 
    callable=<function at remote 0x7ffff77cbd70>, tstate=0x555555952fc0) at ./Include/cpython/abstract.h:111
#12 _PyObject_CallOneArg (arg=<unknown at remote 0x555555952fc0>, func=<function at remote 0x7ffff77cbd70>)
    at ./Include/cpython/abstract.h:162
#13 call_unbound_noarg (unbound=<optimized out>, func=func@entry=<function at remote 0x7ffff77cbd70>, 
    self=self@entry=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>)
    at Objects/typeobject.c:1466
#14 0x000055555561d098 in slot_tp_finalize (
    self=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>) at Objects/typeobject.c:6858
#15 0x00005555555f87cf in PyObject_CallFinalizer (
    self=self@entry=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>)
    at Objects/object.c:311
#16 0x00005555555fb1e2 in PyObject_CallFinalizerFromDealloc (
    self=self@entry=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>)
    at Objects/object.c:329
#17 0x00005555556144fc in subtype_dealloc (
    self=<C(pack=<built-in method pack of module object at remote 0x7ffff7821530>) at remote 0x7ffff7792140>) at Objects/typeobject.c:1220
#18 0x00005555555f9d20 in _Py_Dealloc (op=<optimized out>) at Objects/object.c:2291
#19 0x00005555555eae7d in _Py_DECREF (op=<optimized out>, lineno=546, filename=0x5555557cb173 "./Include/object.h") at ./Include/object.h:479
#20 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:546
#21 insertdict (mp=mp@entry=0x7ffff78d2650, key=key@entry='x', hash=6776896660283084101, value=value@entry=None) at Objects/dictobject.c:1102
#22 0x00005555555ebe4c in PyDict_SetItem (
    op=op@entry={'__name__': None, '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, 'addaudithook': None, 'audit': None, 'breakpointhook': None, '_clear_type_cache': None, '_current_frames': None, 'displayhook': None, 'exc_info': None, 'excepthook': None, 'exit': None, 'getdefaultencoding': None, 'getdlopenflags': None, 'getallocatedblocks': None, 'getfilesystemencoding': None, 'getfilesystemencodeerrors': None, 'gettotalrefcount': None, 'getrefcount': None, 'getrecursionlimit': None, 'getsizeof': None, '_getframe': None, 'intern': None, 'is_finalizing': None, 'setswitchinterval': None, 'getswitchinterval': None, 'setdlopenflags': None, 'setprofile': None, 'getprofile': None, 'setrecursionlimit': None, 'settrace': None, 'gettrace': None, 'call_tracing': None, '_debugmallocstats': None, 'set_coroutine_origin_tracking_depth': None, 'get_coroutine_origin_tracking_depth': None, 'set_asyncgen_hooks': None, 'get_asyncgen_hooks': None, 'unraisablehook': None, 'modules': None, 'stderr': None, '__stderr__':...(truncated), key='x', value=None) at Objects/dictobject.c:1545
#23 0x00005555555f7ee6 in _PyModule_ClearDict (
    d={'__name__': None, '__doc__': None, '__package__': None, '__loader__': None, '__spec__': None, 'addaudithook': None, 'audit': None, 'breakpointhook': None, '_clear_type_cache': None, '_current_frames': None, 'displayhook': None, 'exc_info': None, 'excepthook': None, 'exit': None, 'getdefaultencoding': None, 'getdlopenflags': None, 'getallocatedblocks': None, 'getfilesystemencoding': None, 'getfilesystemencodeerrors': None, 'gettotalrefcount': None, 'getrefcount': None, 'getrecursionlimit': None, 'getsizeof': None, '_getframe': None, 'intern': None, 'is_finalizing': None, 'setswitchinterval': None, 'getswitchinterval': None, 'setdlopenflags': None, 'setprofile': None, 'getprofile': None, 'setrecursionlimit': None, 'settrace': None, 'gettrace': None, 'call_tracing': None, '_debugmallocstats': None, 'set_coroutine_origin_tracking_depth': None, 'get_coroutine_origin_tracking_depth': None, 'set_asyncgen_hooks': None, 'get_asyncgen_hooks': None, 'unraisablehook': None, 'modules': None, 'stderr': None, '__stderr__':...(truncated)) at Objects/moduleobject.c:629
#24 0x000055555569f898 in _PyImport_Cleanup (tstate=tstate@entry=0x555555952fc0) at Python/import.c:617
#25 0x00005555556b8108 in Py_FinalizeEx () at Python/pylifecycle.c:1412
#26 0x00005555555adc58 in Py_RunMain () at Modules/main.c:634
#27 0x00005555555adccd in pymain_main (args=args@entry=0x7fffffffd670) at Modules/main.c:662
#28 0x00005555555add91 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:686
--Type <RET> for more, q to quit, c to continue without paging--
#29 0x00005555555ac6d2 in main (argc=<optimized out>, argv=<optimized out>) at ./Programs/python.c:16


@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Jan 17, 2020

That of course fails since we are clearing the module state before we clear up the sys globals. You bring up a good point though. I'll just push this past the clearing of every module dictionary.

I just tested that locally and it worked, even on your new example. I will add a new test and update the PR

@DinoV
Copy link
Contributor

DinoV commented Jan 17, 2020

Another possibility would be restoring sys to its original form like we do builtins: https://github.com/eduardo-elizondo/cpython/blob/f2cfc13f65e94f755b5d47218f4479bca30118fc/Python/import.c#L564

And then collecting and then killing off sys/builtin and collecting again.

Copy link
Contributor

@DinoV DinoV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to add the various test cases in the bug to the PR as well.

@eduardo-elizondo
Copy link
Contributor Author

@DinoV given that sys is also treated specially just like builtins, it might make sense to also add this. Also, agreed now that you've pointed out a couple of other edge cases, I'll add more tests.

@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Jan 18, 2020

Actually, I decided against creating a sysdict copy to avoid increasing the heap memory needed for the interpreter state. Just pushing further back the module state clearing does the trick.

By the way, as you might have noticed in the sys test, C.__del__ is not called when sys holds an instance to it. This is the normal behavior even before any changes were added related to the module state.

}
_PyModule_ClearDict(interp->builtins);

/* Clear module dict copies stored in the interpreter state */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to change this comment to: "Clear module dict copies stored in the interpreter state only after all the modules have been cleared" or something along those lines.

@pablogsal pablogsal self-requested a review January 18, 2020 00:36
@pablogsal pablogsal self-assigned this Jan 18, 2020
@Jongy
Copy link
Contributor

Jongy commented Jan 18, 2020

Wow. Funny story.

I was debugging some Python crash I got lately, and realized it has something to do with mishandled refcounts. I hooked _Py_Dealloc, logged all freed objects and identified the object freed incorrectly.

After "solving" that with a temporary Py_INCREF(op); return from dealloc, I got another unrelated crash in which it seemed like the PyModuleObject of _struct is missing? PyState_FindModule returnsNULL. So I started tracing how did this happen using the _Py_Dealloc hook, but decided to stop for today because it's getting late here.

Just before getting off the computer, I hopped by the latest PRs of CPython, to see what kind of interpreter bugs people usually encounter and fix. And so it happens the very first PR I stumble upon is this exact problem. Luckiest click ever.

TL;DR it was funny and I can confirm moving _PyInterpreterState_ClearModules down 50 lines fixes the problem for me 🤷‍♂️

@brettcannon brettcannon removed their request for review January 20, 2020 22:19
@DinoV DinoV requested review from nascheme and pitrou and removed request for pitrou January 22, 2020 19:25
@nascheme
Copy link
Member

nascheme commented Jan 24, 2020

Thank you for the patch Eddie. This is a dark and ugly corner of CPython, as you probably realized. Over the years, many revisions have been made to the interpreter shutdown logic. E.g. doing things in various orders, adding extra steps, GC passes, etc. I think in nearly each revision, someone had an example where the old shutdown process produced a behavior they didn't like and they were motivated to tweak until their certain problem went away. It's a bit like playing "wack a mole", hammer one bug down, another pops up. Not saying your change is bad or wrong, I'm just giving you some background history. I suspect it is possible your change will result in someone else's program failing to shutdown gracefully. E.g. somehow they depend on interpreter state getting cleared first. Based on a quick review, the ordering you propose looks more sensible and robust.

I've been thinking about a better way to do shutdown for a while now. To me, it seems intimately tied to the GC and how we handle finalizers. I.e. if there wasn't any finalizers, it should not matter what order we destroy things. Could we use gcmodule.c finalize_garbage() and handle_legacy_finalizers() to deal with finalizers and then not have them executing while we are destroying things? Making shutdown work that way would be a big change.

@pablogsal
Copy link
Member

Could we use gcmodule.c finalize_garbage() and handle_legacy_finalizers() to deal with finalizers and then not have them executing while we are destroying things? Making shutdown work that way would be a big change.

I was recently working on a prototype for this idea based on the desired that the maximum amount of objects are destroyed when the interpreter ends. I could open an issue so we could evaluate that

@nascheme
Copy link
Member

I would be interested to see that prototype Pablo. I don't know what we should do with this PR then. Based on a quick review, it looks sane to me but I don't feel confident I understand the full impact of the change. It is sort of a "band aid" fix and I'm okay with that if it improves things overall. Fixing a segfault is certainly a good thing.

Copy link
Member

@encukou encukou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a quick review, it looks sane to me but I don't feel confident I understand the full impact of the change. It is sort of a "band aid" fix and I'm okay with that if it improves things overall. Fixing a segfault is certainly a good thing.

I'm in the same position.
Let's apply the band-aid now, so it gets real-world testing in the alphas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants