-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-29990: Range checking in GB18030 decoder #999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@animalize, thanks for your PR! By analyzing the history of the files in this pull request, we identified @hyeshik, @bitdancer and @Yhg1s to be potential reviewers. |
zhangyangyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need an entry in Misc/NEWS.
| c4 = INBYTE4; | ||
| if (c < 0x81 || c3 < 0x81 || c4 < 0x30 || c4 > 0x39) | ||
| if (c < 0x81 || c > 0xFE || c3 < 0x81 || c3 > 0xFE || | ||
| c4 < 0x30 || c4 > 0x39) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue I saw discussions about different standards (GB18030-2000 and GB18030-2005).
In order to avoid ambiguity, I would add at the beginning of the file a comment with a link to the standard implemented by this code.
Additional links to sections that describes specific exceptions/algorithms/ranges (such as the one being changed by this patch) are also useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two different standards are compatible (ignore the draft one, it's not a standard) and doesn't make any difference here.
And about links, I think the only one worth considering here is the wikipedia one, others are all non-authoritative. The truly authoritative one is reserved by the company and you have to pay for it :-(. But if anyone is interested in it, google GB18030 will lead you to wikipedia.
| (b"abc\x81\x30\x81\x30def", "strict", 'abc\x80def'), | ||
| (b"abc\x86\x30\x81\x30def", "replace", 'abc\ufffd0\ufffd0def'), | ||
| # issue29990 | ||
| (b"\x81\x30\xFF\x30", "strict", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a few more tests, in particular:
- one where the first byte is equal to
\xFF; - a couple of tests with the
"replace"error handler.
|
I will reply later, have a good weekend :) |
I found this doesn't exist, when the first byte is It can't pass the test So I removed
I have tested, our GB18030-2000 codec is follow this file. This page also mentioned the issue of 0x80: I investigated our GB2312/GBK/GB18030 codecs two years ago, I'm glad to share my summary:
Make a long story short: no (big) problem in GB2312/GBK/GB18030 codecs.
IMO Wikipedia is not authoritative enough because everyone can edit it, but it's a very good referrence.
Libray is a choice, I searched, "National library of China" has a copy of GB18030-2000 standard, but it seems we don't need to look up it. |
|
Why close? |
recreate for CLA check.
http://bugs.python.org/issue29990