bpo-29990: Range checking in GB18030 decoder #999

ghost · 2017-04-05T06:28:47Z

recreate for CLA check.
http://bugs.python.org/issue29990

mention-bot · 2017-04-05T06:28:49Z

@animalize, thanks for your PR! By analyzing the history of the files in this pull request, we identified @hyeshik, @bitdancer and @Yhg1s to be potential reviewers.

zhangyangyu

Need an entry in Misc/NEWS.

ezio-melotti · 2017-04-06T16:25:49Z

Modules/cjkcodecs/_codecs_cn.c

            c4 = INBYTE4;
-            if (c < 0x81 || c3 < 0x81 || c4 < 0x30 || c4 > 0x39)
+            if (c < 0x81 || c > 0xFE || c3 < 0x81 || c3 > 0xFE ||
+                c4 < 0x30 || c4 > 0x39)


In the issue I saw discussions about different standards (GB18030-2000 and GB18030-2005).
In order to avoid ambiguity, I would add at the beginning of the file a comment with a link to the standard implemented by this code.
Additional links to sections that describes specific exceptions/algorithms/ranges (such as the one being changed by this patch) are also useful.

The two different standards are compatible (ignore the draft one, it's not a standard) and doesn't make any difference here.

And about links, I think the only one worth considering here is the wikipedia one, others are all non-authoritative. The truly authoritative one is reserved by the company and you have to pay for it :-(. But if anyone is interested in it, google GB18030 will lead you to wikipedia.

ezio-melotti · 2017-04-06T16:30:54Z

Lib/test/test_codecencodings_cn.py

        (b"abc\x81\x30\x81\x30def", "strict", 'abc\x80def'),
        (b"abc\x86\x30\x81\x30def", "replace", 'abc\ufffd0\ufffd0def'),
+        # issue29990
+        (b"\x81\x30\xFF\x30", "strict", None),


I would add a few more tests, in particular:

one where the first byte is equal to \xFF;

a couple of tests with the "replace" error handler.

ghost · 2017-04-07T11:45:48Z

I will reply later, have a good weekend :)

ghost · 2017-04-08T03:42:24Z

I would add a few more tests, in particular:
one where the first byte is equal to \xFF

I found this doesn't exist, when the first byte is \xFF, the minimum possible value of lseq is:
0x10000 + ((0xFF-0x81-15) * 10 + 0) * 1260 + 0 * 10 + 0 = 1464136 = 0x165748

It can't pass the test if (lseq <= 0x10FFFF), then the code will return 1 without touching anything.

        ...
        else if (c >= 15) { /* U+10000 - U+10FFFF */
            lseq = 0x10000 + (((Py_UCS4)c-15) * 10 + c2)
                * 1260 + (Py_UCS4)c3 * 10 + c4;
            if (lseq <= 0x10FFFF) {
                OUTCHAR(lseq);
                NEXT_IN(4);
                continue;
            }
        }
        return 1;

So I removed c > 0xFE test in this patch, make it faster a bit.

In the issue I saw discussions about different standards (GB18030-2000 and GB18030-2005).
In order to avoid ambiguity, I would add at the beginning of the file a comment with a link to the standard implemented by this code.
Additional links to sections that describes specific exceptions/algorithms/ranges (such as the one being changed by this patch) are also useful.

I have tested, our GB18030-2000 codec is follow this file.
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
0x80 (€) is not included in .xml file.

This page also mentioned the issue of 0x80:
http://icu-project.org/docs/papers/gb18030.html

I investigated our GB2312/GBK/GB18030 codecs two years ago, I'm glad to share my summary:

GBK codec

This implmentation can be seem as either "GBK without PUA code points",
or "CP936 v2.01 without euro sign (0x80 <-> U+20AC)".
GBK standard gave a total of 23940 two-byte sequences, 2149 of them
were mapped to PUA of BMP. In this implmentation, these 2149 sequences
were discarded, so this implmentation has 21791 (=23940-2149) two-byte
sequences.
The 2149 (=2054+95) sequences are mapped to 0xE000-0xE864 in PUA.
The 2054 are empty positions, nothing was assigned to these positions.
The 95 were assigned with characters. When GBK standard was published
in 1995, these 95 characters were not included by Unicode, so mapped
them to PUA. They are included by later version of Unicode.

GB18030 codec

This codec implemented full GB18030-2000 standard.
25 characters were mapped to PUA of BMP.

To implement full GB18030-2005 codec, just modify as below:
sequence GB18030-2000 GB18030-2005
A8BC U+E7C7 U+1E3F
8135F437 U+1E3F U+E7C7

U+1E3F is "LATIN SMALL LETTER M WITH ACUTE".

Make a long story short: no (big) problem in GB2312/GBK/GB18030 codecs.
So I would suggest don't add more descriptions to _codecs_cn.c.

@zhangyangyu:

I think the only one worth considering here is the wikipedia one

IMO Wikipedia is not authoritative enough because everyone can edit it, but it's a very good referrence.

The truly authoritative one is reserved by the company and you have to pay for it

Libray is a choice, I searched, "National library of China" has a copy of GB18030-2000 standard, but it seems we don't need to look up it.

zhangyangyu · 2017-04-14T11:04:14Z

Why close?

Range checking in GB18030 decoder

9acefcd

the-knights-who-say-ni added the CLA signed label Apr 5, 2017

zhangyangyu added type-bug An unexpected behavior, bug, or error needs backport to 2.7 labels Apr 6, 2017

zhangyangyu reviewed Apr 6, 2017

View reviewed changes

wjssz added 2 commits April 6, 2017 13:05

entry in Misc/NEWS

8d29000

fix -> Fix

e6183bf

zhangyangyu approved these changes Apr 6, 2017

View reviewed changes

ezio-melotti requested changes Apr 6, 2017

View reviewed changes

wjssz added 2 commits April 7, 2017 19:38

improve

cedf760

tab -> space

60c6881

First byte is 0xFF

2ee92cf

ghost closed this Apr 14, 2017

Mariatta removed needs backport to 2.7 labels Apr 14, 2017

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-29990: Range checking in GB18030 decoder #999

bpo-29990: Range checking in GB18030 decoder #999

Uh oh!

ghost commented Apr 5, 2017 •

edited by ghost

Loading

Uh oh!

mention-bot commented Apr 5, 2017

Uh oh!

zhangyangyu left a comment

Uh oh!

ezio-melotti Apr 6, 2017

Uh oh!

zhangyangyu Apr 7, 2017

Uh oh!

ezio-melotti Apr 6, 2017

Uh oh!

ghost commented Apr 7, 2017

Uh oh!

ghost commented Apr 8, 2017

GBK codec

GB18030 codec

Uh oh!

zhangyangyu commented Apr 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

bpo-29990: Range checking in GB18030 decoder #999

bpo-29990: Range checking in GB18030 decoder #999

Uh oh!

Conversation

ghost commented Apr 5, 2017 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mention-bot commented Apr 5, 2017

Uh oh!

zhangyangyu left a comment

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

zhangyangyu Apr 7, 2017

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

ghost commented Apr 7, 2017

Uh oh!

ghost commented Apr 8, 2017

GBK codec

GB18030 codec

Uh oh!

zhangyangyu commented Apr 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ghost commented Apr 5, 2017 •

edited by ghost

Loading