Skip to content

Further adventures with dead IO threads and threading timeouts #432

@bitprophet

Description

@bitprophet

This is a follow-up to #351, specifically:

  • The general cause of that issue was errors within IO or other worker threads causing them to stop consuming from a subprocess' pipe, which in turn locked the subprocess once the pipe filled, which then locked up the main Invoke process (as the subprocess never exits).
  • The fix was to add a 1s timeout for (non-stdin) IO thread joins, which appeared sufficient in our test cases.
  • Integration tests were added to exhibit this, specifically barfing out /usr/share/dict/words (~2.4 MB of text)
    • On the side (later? can't recall, too lazy to look up) another test was added that incidentally exposes the same issue - one that deals with Watcher errors, and since Watchers are run in the stdout-handling worker thread body (Runner._handle_output), exceptions from them will have the same effect as an unexpected IO-handling exception.
  • Much later, when performing incidentally similar tests in Fabric 2, I found that large output was getting mysteriously truncated. Immediately thought of "oh this must be a shutdown/flush problem", and yup...it was that 1s thread join timeout! Setting it to None makes the problem mysteriously go away.
    • Generally, the problem is that reading/parsing/etc 2.4 MB of text chunkwise over the network - even to localhost - takes long enough that waiting 1s after process exit is insufficient (In my testing, the thread is naturally joined after about 2-3s, and of course it is reasonable to expect even larger remote jobs to take even longer to wrap up.)
  • So how to reconcile these two things?
    • Is there a better way to deal with the thread death than this timeout? Feels like there should be; we have 'dead thread' detection so wait() ought to be able to continue to cleanup even if the subprocess hasn't finished...
    • Otherwise, we need to deal with it as a regular timeout issue of making it configurable and probably having Fabric 2 default it to a longer time, like 5 or 10 seconds.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions