-
Notifications
You must be signed in to change notification settings - Fork 384
Prevent UnicodeErrors when forwarding child stdout to the console #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In invoke.runners.Local, the stdout/stderr of child processes is read and forwarded to the console (i.e. invoke's stdout/stderr). The incoming data is read with locale.getdefaultencoding(), while output happens with sys.stdout.encoding. If these two encodings differ, it may not be possible to write the received data to stdout, e.g. if an UTF-8 Yen character is received, but sys.stdout.encoding is 'latin1' or 'ascii'. If that happens, the resulting exception causes the thread to end, and the child process hangs due to the child filling its output buffer. This is mostly a concern on Windows, where the preferred encoding and the console I/O encoding often differ. Cf. #241
|
This doesn't work because the test stdio streams don't support writing to the underlying bytes buffer. And it's wrong anyway, as by writing raw bytes to (the buffer under) a text stream you can end up with a stream with mixed encodings. What you're trying to do is replace unencodable bytes. So you should do exactly that: clean_data = data.encode(dst.encoding, errors='replace').decode(dst.encoding)
dst.write(clean_data)That should work for any text stream with an encoding. |
Avoid accessing sys.stdout.buffer to write raw bytes. Manually decode the bytes and write them as string instead. This approach avoids problems during testing, when sys.stdout can be mocked.
|
Thanks for the suggestions, @pfmoore. I updated the PR. The tests are failing because |
|
Hey, as per my note in the other ticket, you should wait a couple days (hopefully no more than that...) for me to finish and merge the big run() reorganization, it will at least partly obviate your PR and you'll want to update it again for that. I am +1 on allowing Spec to learn about encoding, it's been privileged to remain ignorant of such things prior to now, but just because I personally haven't needed it. |
|
The following would probably be a good utility function for "write some text without the possibility of errors (in human readable form)" def safe_write(data, file=sys.stdout, errors='backslashreplace'):
if hasattr(file, 'encoding') and getattr(file, 'errors') == 'strict':
# If the file object is encoding text in strict mode, make sure
# no unencodable characters are in the data being written, by
# encoding with a more permissive error handling method, and then
# decoding back to text
# Python 2.6 compatibility - don't use a keyword argument for the
# 'errors' parameter to encode.
data = data.encode(file.encoding, errors).decode(file.encoding)
# Data should now only contain characters that can be safely written
file.write(data)There should probably be some specific unit tests for this function, as the way the test suite captures output means that the existing tests won't cover how a "real" stdout (strict mode encoding) would behave. Assuming the incoming changes don't remove the need for this, we could use this when writing subprocess output to The assumption here is that subprocess output is only written to |
|
@pfmoore Unfortunately the genesis of the changes to merge, was to add output stream selection support! I.e. let folks drop in some arbitrary file-like object to be written to, defaulting to sys.stdout/stderr. (Also, I can guarantee some users will be wanting to pipe invoke output elsewhere, either via that feature or via their shell. Not the default use case, but a not-uncommon one.) So we may need to make the encoding process more selectable on top of that, perhaps? (I.e. have one or more methods sort of like this, which can be swapped out depending on stream or user needs/desires). Let's revisit once I've merged that stuff and we'll see how it shakes out. |
|
@bitprophet Ah, the irony :-) Agreed, let's leave it for now. When we get to it, there's probably a good argument for treating sys.stdout as a special case (maybe with an isatty test added as well). I'm not actually convinced that Python's choice to use the strict error handler for console output is such a good idea in the first place... |
|
That big ol' ticket has been merged to Invoke master now, btw. I'm context switching to Fabric 2 dev to make use of the new class structure & functionality, hopefully I won't find more changes that need doing - but if I do they'll be much smaller in scale :) |
|
#260 has a reproducible case for this too. On my system, at least, the issue is sort of the inverse of the OP's scenario: we're taking what I assume is a Latin-1-derived string, and trying to print it to a stdout that thinks of itself as UTF-8. What seems to happen here is that the (Python 2 - Python 3 ends up replacing the characters with junk instead of exploding) That's why the Also, this has come up before, but I really hate how the Python core implementation of |
|
Hey guys, I'll test out this pull request on my system in a moment since I experienced this bug. Yeah, unicode behaves strangely in Python 2.x, I guess that's what Python 3 was attempting to overcome. A fallback to latin-1 would have been more logical, but whatcha gonna do 😄 I'll report back soon with my findings |
|
I did a dodgy manual merge, so it's possible I made a booboo, but my system doesn't like the errors kwarg in the decode function: |
|
@fgimian I'd check to see if that is a Python 2 vs Python 3 ism, smells like one offhand. |
|
Apologies for not looking at this again sooner mate, I'm going to try spend some time on it this week as I'd like to incorporate invoke into all my current project's automation and this is causing me some drama 😄 |
|
OK, so this problem is very easy to repro. The problem is that e.g. OS X 10.10Python 2.6 Python 2.6.9 (unknown, Sep 9 2014, 15:05:12)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)Python 2.7 Python 2.7.10 (default, Jul 9 2015, 13:34:07)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──RHEL 6.6Python 2.6.6 (r266:84292, Nov 21 2013, 10:50:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)Ubuntu 14.04Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.write(u'└──\n')
└──More to follow 😄 |
|
Ahh, this is a bug in Python 2.6 specifically. I have an idea for a fix but it's a little ugly. Let me think this through further for a moment 😄 |
|
Please find my pull request at #262 😄 |
|
@fgimian Nice catch re: the Python 2.6 bug, that's frustrating. Pity 2.6 is still so widespread or I'd start considering dropping it for final release =/ I'm going to reply to #262 here just so we're keeping the conversation in one thread, also because I'd like to compare that approach to the one here in #242. Specifically:
|
|
@bitprophet I object in principle to encoding to bytes and then writing, as conceptually the stream is a text stream, and doing so is incorrect. But that's a Python 3 POV, which I apply even to Python 2 because it ensures that the logical separation between bytes and strings is retained. In this instance, practicality may beat purtity - as long as the abuse is isolated to Python 2.X. But I personally have little interest in 2.X (except to advocate that code avoids bad assumptions that may transalate badly to 3.x) and specifically I don't care at all about 2.6. The I've only superficially looked at the small question you asked, though. I haven't looked at why the tests are failing or anything like that, I'm afraid - sorry. |
|
Thanks @pfmoore - and I am generally +1 on the "round trip to scrub bad characters" idea, I'd rather have garbled output than explosions. Users will need educating about these hilarious edge cases regardless, so. (also, if/when I merge some form of these fixes, I would definitely put some comments in :)) |
|
The bit about a comment wasn't a dig :-) What's in the patch looks about right. I just think it's a bit annoying that there's no Python builtin function for doing the Actually, maybe a small helper would be useful, for documentation purposes: def sanitise(s, encoding, errors='replace'):
"""Make sure string s doesn't contain any characters that can't be handled by encoding.
Use the specified codec error handler to determine how to handle unencodable characters."""
return s.encode(encoding, errors).decode(encoding) |
|
Hey there guys, firstly thanks for all your help 😄 I'd like to highlight a few points if I may:
Keep in mind that most people looking to use your library likely started with a Makefile which works everywhere 😄 Ideally this wonderful library would also work in as many systems as possible to make it easily to switch from GNU make. All the best and thanks a lot! |
|
Don't have time to reply to all of that right now (but thanks!!) but I can note that the 2.6 thing was, for me, a "I wish I could drop support" - meaning, I can't, for exactly the reasons you state ;) I am a sysadmin by trade so I place more emphasis on "old version support" when balancing tradeoffs, than many devs do. |
|
Thanks @bitprophet, really appreciate your time and help with this 😄 |
|
I bumped into this issue today, but I haven't noticed this PR, so I created my PR #274, which seems to be cleaner, passes tests for Python 2.x (it seems that 3.x tests are broken not because of my PR), and it works for me like expected. |
|
Note to self, I believe this problem & its potential solution(s) still apply despite the IO stuff getting a moderate rework recently (for #289). Bumping to next milestone for now but it's still something I want to work out pre 1.0. |
|
@bitprophet Just a note, I have been heavily using my patched (PR #274, PR #275) PyInvoke on Linux, Windows, and OS X since the time I have fixed it. It works perfectly well. |
|
@bitprophet I want to point out that this fix doesn't solve the Unicode errors at all. This PR should be closed in favor of #274. |
|
Hey guys, I found this problem too. Currently I have either to patch this line to make from invoke import run as orig_run, task
from functools import partial
run = partial(
orig_run,
encoding='utf-8',
out_stream=codecs.getwriter('utf8')(sys.stdout))This way, all output from child process is decoded from utf-8 (thanks to |
|
@svetlyak40wt Please, check out PR #274. It has a bullet-proof fix for this issue. When I started, I also thought, that the fix is simple, but unicode turned out to be a tricky thing especially when Python 2/3 support is required. |
|
This should be obviated by #350 at this point. |
In invoke.runners.Local, the stdout/stderr of child processes is read
and forwarded to the console (i.e. invoke's stdout/stderr). The incoming
data is read with
locale.getdefaultencoding(), while output happens withsys.stdout.encoding.If these two encodings differ, it may not be possible to write the
received data to stdout, e.g. if an UTF-8 Yen character is received, but
sys.stdout.encodingis'latin1'or'ascii'. If that happens, theresulting exception causes the thread to end, and the child process
hangs due to the child filling its output buffer.
This is mostly a concern on Windows, where the preferred encoding and
the console I/O encoding often differ. In my testing, this fixes #241. The
console output has replacement characters, but invoke doesn't crash.