-
Notifications
You must be signed in to change notification settings - Fork 384
Fixed a number of encoding issues (in run context)
#274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
invoke/runners.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- E501 line too long (80 > 79 characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, suggest how I should fix this.
|
NOTE:
As you can see, Python 2 weirdly reports ANSI (ASCII) instead of UTF-8 when NOTE2: We should avoid |
|
@bitprophet Could you please review this small, but quite important fix? |
invoke/runners.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. The original code is pulling data from the results of iterdecode - i.e., data is a string. The replacement code is pulling data from iterencode or get, so it's bytes.
This would be the sort of error that Python 2 lets you make because of its incomplete separation of bytes and strings. But it should fail on Python 3 (and give potentially corrupted results on Python 2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getalways yieldsbytes(strin Python 2.x)encode(decode(bytes))returnsbytesoutput.writeworks safer withbytes
I don't see any mistake here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding (3), is output opened as a bytes stream or as a string stream? The original code clearly impies it's a string stream.
The original code did decode(get()), returning a string. You have changed the type of data without changing the code that uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding (3):
- There is no "text" mode on Linux and OS X, write will encode unicode to bytes under the hood if we don't, so it is usually advised to pass bytes when you write or send anything in Python
- In tests, it is used CarbonCopy, which is based on BytesIO (bytes)
I changed the code because it was buggy. It was just a luck that it used to work. There was the sort of error that you referred to in your first comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've looked at the surrounding code. I think I see what you're trying to achieve here.
You're right, there's a bug, but your fix is wrong (see below - I'm taking a purist view here, I suspect your fix probably works, but it's very hard to analyze, so I can't be sure). According to the comment at the top of the io() function, reader should return bytes, and output should take bytes as the argument to its write() method. In Python 3, that's explicitly typed, and using the wrong types won't work. In Python 2, you can accidentally supply a string when you mean bytes and it will work, but give the wrong output. This is where Python 2's string/bytes model sucks - you have to track which objects are bytes and which are strings for yourself, the language doesn't do it for you.
The original code got bytes from get(), put them through iterdecode to get strings, then wrote the strings with output.write. That is broken, because you're writing strings to an output function that should get bytes.
But the comment about output is muddled:
If
hideisFalse, write bytes tooutput, a stream such assys.stdout.
That doesn't make sense because sys.stdout is not a stream you can write bytes to! You can write bytes to the test CarbonCopy stream, so the tests don't spot this. But you cannot supply bytes to sys.stdout.write. On Python 2 you get mojibake, on Python 3 you get a type error.
I think your code juggles its way round this, by making data be either bytes or string in a way that matches what output expects. But it's complicated, fragile code that doesn't keep a clean type model (something that in my experience is the source of a lot of bugs).
I would strongly recommend a more type-consistent fix, cleanly separating bytes and strings (conceptually in Python 2, but in terms of actual types in Python 3). That might mean needing to explicitly pass around encodings in places where it's implied at the moment.
But working round the issue in the way your patch does may be a "practicality beats purity" issue - I'll leave that decision to @bitprophet. But someone should at least test it on Python 3 and Windows (and by "test" I mean use it for real, with an actual stdout, not running the tests which stub out the string-based sys.stdout with a bytes-based CarbonCopy instance!!)
Actually, if the test harness is mocking out sys.stdout with a BytesIO object, that's pretty broken. It'll make it pretty much impossible to spot or check issues like this where strings and bytes get muddled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, there is indeed a difference between open(mode='w') and open(mode='wb') in Python 3.x and sys.stdout is opened in text mode. I will rework my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. And thanks for your patience - rereading, my initial explanations weren't very clear :-(
|
@pfmoore @bitprophet I've just pushed a better improved version. Please, review it. Tests pass for Python 2.6, 2.7, 3.3 and 3.4 on Linux, but it fails on 3.2. Please, advise on how to fix This PR fixes a number of issues related to running non-ascii commands as well as handling non-ascii output. NOTE: I've updated six.py for |
|
@frol - This looks OK to me now. Thanks for updating. The unicode stuff in invoke is nasty stuff, and help in getting it right is much appreciated. All that stuff switching to I'm slightly bothered about So yeah, I'm OK with this. |
|
@bitprophet @pfmoore My concern about
I personaly prefer options (1) and (3). |
|
Amusingly, I ran into this myself while running the internal but rarely-used I also skimmed #242 and it reminded me about the 2.6 vs 2.7 issue; I confirmed also that under 2.7 on master, I get no exception but DO get garbage characters. This branch, under 2.7 - also looks clean. Also mentioned in #242 is the issue of testing; I'm irritated that the old tests I wrote (which specifically include Re: unicode literals vs Python 3.2, another related discussion is http://python-future.org/unicode_literals.html . I don't have a strong opinion at this time; we waffle a lot on 3.2 support, there've been times when it was dropped and then added again. Most of the good arguments against |
|
@bitprophet All sounds good. I will prepare and test a patch with |
|
@bitprophet I had a couple of minutes right after I have wrote the previous comment, so the patch is already pushed. Travis is happy now! |
|
Cool, thank you! BTW, if you find time before I do to figure out why these integration tests were passing prior to your branch, and fix them so they fail without & pass with, that would be super great. |
|
@bitprophet I guess, it is because of |
|
@bitprophet I see you have pushed a number of commits recently. Could you please merge this PR (and probably some other PRs) while they can be merged successfully? |
|
@bitprophet I have fixed the integration tests, so they will catch the encoding errors, but travis will fail on |
|
I had to rebase my PR since |
|
Everything is tested and Travis is happy. Please, merge. |
|
This needs reconciliation with both recent master-merged changes and the notes from pfmoore in #289 re: centralizing and tweaking encoding stuff. Adding to next milestone so it doesn't get (re-)lost. |
|
Just for your info:
|
|
@frol Just to clarify, could you please explain the issue you are getting, and how to reproduce it? If I follow the code and tests, it's related somehow to cases when the command is a bytestring (as opposed to a Unicode string), and when the command output is bytes not in the console encoding (e.g., ANSI escape sequences). But I may be misunderstanding - so if you could explain how I can reproduce the issue, that'd be a great help. From your initial description in this thread, you say you ran a program with your system set to POSIX locale, but the program output was Unicode (you don't clarify what encoding - I guess UTF8?) I'm not quite sure how that relates to the above? |
|
@pfmoore First, just try to run tests (as usual and with Having import invoke
invoke.run('cat ./integration/_support/tree.out')Even with Fixed version: Fixed version with |
|
@frol OK, thanks. So just to help me understand (my environment is Windows, so I want to be clear on expected behaviour on Unix):
OK. I'll need to set up a Unix VM to do some testing, but I think I see what you're after now. I'll take a look at how the latest invoke code fares, and see what needs to be changed from there. Then hopefully I'll be in a position to review and re-base your code changes :-) |
|
@pfmoore Try |
|
OK, I see the issue ( @bitprophet what I'd like to do is to open a new issue specifically to get the string/bytes separation correct. I can then create a series of PRs to both document and address the issues once and for all. Would that be acceptable? If it's going to cause merge conflicts with work you're doing, I'd rather hold off - as you've already said you're still getting your head round the string/bytes stuff, so I don't want to leave you having to resolve conflicts with code that makes encoding assumptions that I haven't explained clearly enough for you to follow :-) What I'd suggest then:
How does that sound? BTW, @frol given where (I think) we've got to with the Unicode situation, I'm no longer sure I'm happy with your approach of wrapping the output stream. It's basically wrapping a "not quite sure if it's bytes or unicode" stream to make it a "definitely Unicode" stream. And I (now) think we should in fact be trapping all of the uncertainty about bytes vs unicode at the extreme edges of the program, not internally like this is doing. OTOH, I think we're a lot closer to that ideal now than we were when this patch was originally written, so at least there's progress :-) |
You will need that |
|
@frol agreed. But (IMO) the output wrapping is a question of "what do we want Invoke to do when it needs to print out arbitrary Unicode?" Python's answer is "encode it using the OS-defined encoding, error if not". Choosing to change that behaviour (wrapping enc = sys.getfilesystemencoding() # or whatever
my_str.encode(enc, errors='backslashreplace').decode(enc)Which is better depends on the application. (I won't comment on whether Python's default behaviour is ever the best approach - we live with what we're given on that one...) On Unix, the behaviour of |
|
@pfmoore I chose to implement non-failing printing since none of the known to me tools like Please, also learn about |
|
@frol Ta - agreed an encode/decode dance is not ideal (and I'm not necessarily recommending it, just deferring a decision for now). But the "have we got an encoding/a buffer" code in your wrapper is potentially an issue too (it's not str/bytes clean to sometimes use buffer and other times not). Nothing's ideal. Your default encoding workarounds aren't in master at the moment. If there's an issue they solve, they can be included either now or independently (they don't affect bytes/unicode separation directly). From the URL you gave I can't track back to a specific issue number, though. Anyway, that's probably enough talk for now. I need @bitprophet to confirm he's OK with my suggestion before it's worth doing anything more. |
Because all of those encoding workarounds are parts of this PR, which should have been merged a long time ago and later, if somebody wouldn't like the design (which I have not violated, as I have only patched the exact issue), it could have been refactored. Instead, we have this conversation while the issue is still there. |
|
Ah, sorry. The direct link didn't indicate that it was related to this PR. My bad. You say in there:
but you don't say what that issue is. Could you elaborate?
Well, that's not something I can affect - @bitprophet only has so much time to work on invoke, I guess. All I can offer is to look at this and prepare a new patch if I can understand your issue(s). (Given that you don't have the time to rework this PR so that it applies to the current codebase). So far it seems we have two, possibly three:
Note that (1) and (3) combined could be taken as saying "why don't we just take the bytes from the subprocess and dump them straight out to the console unchanged? And indeed that is precisely what Python 2 originally did - but it hits all sorts of issues as soon as you try to interact with that byte stream as if it were a string (for example, inspecting the result like Invoke allows you to do) unless the encodings are all set correctly. (The latter is why, in many cases, Python 2 "works" - but ask someone Japanese if you want to hear about encoding problems!) In Python 3, however, the issue is forced - either you write all your internal interactions with the byte stream as bytes (which is a dreadful UI) or you encode/decode - and the only way of doing that while keeping your sanity (and standing any chance at all of knowing the right encoding) is to decode to Unicode as early as possible, and re-encode to bytes as late as you can. You can transport a byte stream cleanly via Unicode-based code, but you need to use the "surrogateescape" error handler, which is a Python 3 only solution. To summarise, unless we want to drop support for Python 3, or force users to deal with things like But this is the essay I need to write for that document I mentioned in my plan above. And I'm not ready to do that yet :-) |
Please, read my second comment on this issue (I have just slightly updated it).
In fact, four: import invoke
invoke.run(u'echo \xFF', echo=True)$ python bug_274_2.py
Traceback (most recent call last):
File "/tmp/qq.py", line 3, in <module>
invoke.run(u'echo \xFF', echo=True)
File "/tmp/env/lib/python2.7/site-packages/invoke/__init__.py", line 27, in run
return Context().run(command, **kwargs)
File "/tmp/env/lib/python2.7/site-packages/invoke/context.py", line 53, in run
return runner_class(context=self).run(command, **kwargs)
File "/tmp/env/lib/python2.7/site-packages/invoke/runners.py", line 223, in run
print("\033[1;37m{0}\033[0m".format(command))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)
This is exactly what is done in this PR. |
|
OK. Regarding the encoding one, I don't have the Linux knowledge to comment. Regarding print(u'echo \xff')If it doesn't work, your Python setup is apparently unable to print the Unicode character "LATIN SMALL LETTER Y WITH DIAERESIS". Which makes no sense because you say your terminal is set to use UTF-8. If your console was set to use ASCII, then I'd say that Python's (and as a consequence invoke's) behaviour is a little unfriendly, but valid. We can (if we choose) do better in invoke, but something's still lying to Python about your terminal's capabilities. If the above does work, invoke is somehow corrupting the configuration of sys.stdout - which I doubt. Sigh. This is why I hate Python 2. Let's just drop support for it and all move to Python 3 ;-) |
|
@pfmoore Everything is fine with my Linux environment; it is just a broken implementation in invoke, here what happens: >>> print(b"{0}".format(u"echo \xff"))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)You may say, "OK, let's use $ env LANG=C python
>>> print(u"{0}".format(u"echo \xff"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 5: ordinal not in range(128)BTW, in the "usual" cases everything is "fine" indeed: $ python
>>> print(u"echo \xFF")
echo ÿ
>>> print("echo \xFF")
echo � |
|
Well, your first example is obviously wrong - Your second example sets the locale to C, so again, yea, of course it only likes ASCII. Your third example is what I'd expect. Your fourth shows why Python 2 is awful :-) What it is doing is printing the byte stream It's quite possible that your first and fourth example indicate what the problem is in invoke - and the solution is quite likely to use |
And if you have somehow missed this, it is how it is implemented in invoke! https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L223
And if you would read the comments above, you will find that this was discussed before I made such a move! |
Do you think it is possible to decode the output of every command that invoke can call? Perhaps Result.stdout needs to keep the raw data (in addition to attempting a decode..)? Forcing Unicode decoding/encoding errors on unsuspecting users "is bad"(tm). |
tl;dr - We have to decode everything to meet the contracted API for invoke. That's hard, but not optional. IMO. This is a difficult question to answer. The Unix and Windows philosophies are very different here, so I'll try to give a platform-neutral answer, but I am aware that there are nuances in each case (and I'd ask that anyone reading this also be careful to remember that Unix isn't the only environment here) First of all, we need to know what the data returned by any command invoke runs means, because:
In particular, OK, so we have to be able to interpret the data returned by a command as text. Or we have to know that it's not intended as text (and in that case, we currently have no way in the API of dealing with that properly). To interpret the data as text we need to know the encoding. There is no encoding supplied with the data, which is a fundamental flaw in the stdout interface (but that interface goes back to the days when Unicode was unheard of, so it's not surprising). So we need to work out the encoding using other data. Most of the time, systems are configured with a known encoding that we can assume is used by commands. On Windows, that's pretty reliable, as it's defined by the OS and the C runtime, so we really only need to watch out for programs that don't use the C runtime and don't follow its conventions. On Unix, it's defined by the user's environment variables - but there are then C runtime and OS APIs that interpret that data (and those have their own issues, such as So it's a mess. The solutions we have are:
The problem cases are systems where the "normal conventions" aren't working as expected. I'm being very careful not to describe such systems as "broken", because this is an area where pointing the finger of blame is particularly unhelpful. But my assumption is that the systems affected (i.e. cases that aren't covered by the 2 approaches above) are relatively few and unusual. That is something we need to confirm, though. For those systems, we have the following options:
My preference is to make sure the code works reliably for all "normal" cases, and degrades gracefully for odd situations. I do not want the user to ever have to work around this issue in their own code, other than by explicitly stating the correct encoding to use. But having said that I am OK with us giving errors (but ones we choose, not raw Python encode/decode exceptions) if we find that we can't properly determine the encoding on the user's behalf. Of course, it's not clear to me how much work there is in getting to the "ideal world" situation I describe above, and while I am offering to do that work, until it's done people will still have this type of issue to deal with. And this is all just my opinion anyway - ultimately it needs @bitprophet to say if he agrees with this approach, or if he prefers something else. |
…u'' prefix, which was the case because of unicode_literals
Current coverage is
|
|
For those who care, I have rebased and merged together this and #275 PRs. All tests are green. |
This brings us back to using `getpreferredencoding(False)` but the addition of preferring `getdefaultlocale` presumably catches any cases where the former was giving bad info.
Unicode handling is nasty, it is quite hard to make it right, but this PR makes Invoke reliable in non-ascii world.
Old Description
I've had a session with POSIX locale on Linux and a program that I've run pushed non-ASCII (UTF-8 in my case) output anyway. This revealed a bug when user-requested/
locale.getpreferredencoding()is UTF-8, butsys.stdoutis ASCII.