bpo-29240: PEP 540: Add a new UTF-8 mode#855
bpo-29240: PEP 540: Add a new UTF-8 mode#855vstinner merged 6 commits intopython:masterfrom vstinner:pep540
Conversation
|
@Haypo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zooba, @benjaminp and @serhiy-storchaka to be potential reviewers. |
zooba
left a comment
There was a problem hiding this comment.
Would like to see some improved documentation before this is merged, but I think the rest of the change looks okay.
Doc/library/sys.rst
Outdated
There was a problem hiding this comment.
Either: "UTF-8 mode can now change the encoding" or "UTF-8 mode now changes the encoding", neither of which are very good at explaining what the impact is.
There was a problem hiding this comment.
Is the point that specifying both UTF-8 mode and legacy encoding is not supported? Or specifying UTF-8 mode will override the legacy encoding setting?
Doc/using/cmdline.rst
Outdated
Include/fileobject.h
Outdated
There was a problem hiding this comment.
Is this an important piece of information? Is it so important that it has to be in the limited API? What would I use this for?
Modules/main.c
Outdated
There was a problem hiding this comment.
wcsicmp? Any need to be case-sensitive here?
|
Ping @ncoghlan - more changes to startup sequence :) |
|
FYI I didn't look again at my code. I only recreated this pull request to not loose the code, since my old PR was on the old CPython repository. I still have to update the PEP and post it to python-dev. Questions like "wcsicmp? Any need to be case-sensitive here?" should be asked on the PEP directly, not on the implementation. You should wait until the PEP is accepted before reviewing the implementation. The PEP may still change. |
|
@Haypo Ah fair enough :) I haven't been paying attention to all the email recently, so I figured it was accepted already. |
|
The implementation is not complete: the command line arguments and environment variables are not re-decoded from UTF-8 once Py_Main() detects that the user requested the UTF-8 mode (-X utf8 or PYTHONUTF8). But thanks to my refactoring work on Modules/main.c ( bugs.python.org/issue32030 ), fixing this issue should now be much simpler. |
* Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag. * If the LC_CTYPE locale is "C" at startup: enable automatically the UTF-8 mode. * Add _winapi.GetACP(). encodings._alias_mbcs() now calls _winapi.GetACP() to get the ANSI code page * locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode. * Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding in the UTF-8 mode. * Update subprocess._args_from_interpreter_flags() to handle -X utf8 * Skip some tests relying on the current locale if the UTF-8 mode is enabled. * Add test_utf8mode.py. * _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to return also the length (number of wide characters). * pymain_get_global_config() and pymain_set_global_config() now always copy flag values, rather than only copying if the new value is greater than the old value.
|
I updated the PR for the 4th version of the PEP 540: getpreferredencoding() now returns UTF-8 in the UTF-8 Mode. |
Programs/python.c
Outdated
| Py_UTF8Mode = 1; | ||
| } | ||
| setlocale(LC_CTYPE, ctype); | ||
| PyMem_RawFree(ctype); |
There was a problem hiding this comment.
The line 64 changes the locale to the user LC_CTYPE locale. This line restores the locale to its previous value.
There was a problem hiding this comment.
Shouldn't previous value be got by setlocale(LC_CTYPE, NULL) instead of setlocale(LC_CTYPE, "")?
There was a problem hiding this comment.
Oh, you're right. I fixed it in a new commit.
Fix conflict in Modules/main.c: pymain_set_global_config() and _Py_CheckHashBasedPycsMode.
Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().
|
@methane: You approved my PEP. This is implementation, all tests pass on Linux and Windows. While the implementation is not perfect, I propose to merge it right now to be get more time before the Python 3.7 final to test it. Remaining issue: when the UTF-8 mode is enabled manually (-X utf8), the command line arguments and environment variables are decoded from the locale encoding instead of being decoded from UTF-8. The fix should be easy, but I didn't have time yet to implement it. Another less important issue: the code to check if the LC_CTYPE locale is the POSIX locale is not currently shared between locale coercion (PEP 538) and UTF-8 Mode (PEP 540). |
Programs/python.c
Outdated
| /* UTF-8 mode */ | ||
| char *old_ctype = setlocale(LC_CTYPE, NULL); | ||
| if (old_ctype != NULL) { | ||
| old_ctype = _PyMem_RawStrdup(old_ctype); |
There was a problem hiding this comment.
Maybe, it can be:
if (_Py_LegacyLocaleDetected()) {
Py_UTF8Mode = 1;
_Py_CoerceLegacyLocale();
}See here
Lines 75 to 77 in 4ae06c5
There was a problem hiding this comment.
"if (_Py_LegacyLocaleDetected()) { Py_UTF8Mode = 1;"
Right. It works as expected, I updated my PR.
|
I feel adding more code to And when |
It is.
main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example https://bugs.python.org/issue32030 Currently, Py_Main() (Modules/main.c) and _PyPathConfig_Calculate() (Modules/getpath.c and PC/getpathp.c) are fully implemented with wchar_t*. Parsing the command line options using char* on Unix but wchar_t* would make the code more complex. I expect that many lines of code would have to be duplicate, one version for char*, one version for wchar_t*. For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option? |
|
Oops, I forgot to push my main.c change. It's now done. |
mode
using -X utf8 or -X utf8=strict command line options.
enabled.
return also the length (number of wide characters).
mode
https://bugs.python.org/issue29240