-
Notifications
You must be signed in to change notification settings - Fork 384
Closed
Description
This is a follow-up to #351, specifically:
- The general cause of that issue was errors within IO or other worker threads causing them to stop consuming from a subprocess' pipe, which in turn locked the subprocess once the pipe filled, which then locked up the main Invoke process (as the subprocess never exits).
- The fix was to add a 1s timeout for (non-stdin) IO thread joins, which appeared sufficient in our test cases.
- Integration tests were added to exhibit this, specifically barfing out
/usr/share/dict/words(~2.4 MB of text)- On the side (later? can't recall, too lazy to look up) another test was added that incidentally exposes the same issue - one that deals with Watcher errors, and since Watchers are run in the stdout-handling worker thread body (
Runner._handle_output), exceptions from them will have the same effect as an unexpected IO-handling exception.
- On the side (later? can't recall, too lazy to look up) another test was added that incidentally exposes the same issue - one that deals with Watcher errors, and since Watchers are run in the stdout-handling worker thread body (
- Much later, when performing incidentally similar tests in Fabric 2, I found that large output was getting mysteriously truncated. Immediately thought of "oh this must be a shutdown/flush problem", and yup...it was that 1s thread join timeout! Setting it to
Nonemakes the problem mysteriously go away.- Generally, the problem is that reading/parsing/etc 2.4 MB of text chunkwise over the network - even to localhost - takes long enough that waiting 1s after process exit is insufficient (In my testing, the thread is naturally joined after about 2-3s, and of course it is reasonable to expect even larger remote jobs to take even longer to wrap up.)
- So how to reconcile these two things?
- Is there a better way to deal with the thread death than this timeout? Feels like there should be; we have 'dead thread' detection so
wait()ought to be able to continue to cleanup even if the subprocess hasn't finished... - Otherwise, we need to deal with it as a regular timeout issue of making it configurable and probably having Fabric 2 default it to a longer time, like 5 or 10 seconds.
- Is there a better way to deal with the thread death than this timeout? Feels like there should be; we have 'dead thread' detection so