Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Apr 5, 2017

recreate for CLA check.
http://bugs.python.org/issue29990

@mention-bot
Copy link

@animalize, thanks for your PR! By analyzing the history of the files in this pull request, we identified @hyeshik, @bitdancer and @Yhg1s to be potential reviewers.

@zhangyangyu zhangyangyu added type-bug An unexpected behavior, bug, or error needs backport to 2.7 labels Apr 6, 2017
Copy link
Member

@zhangyangyu zhangyangyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an entry in Misc/NEWS.

c4 = INBYTE4;
if (c < 0x81 || c3 < 0x81 || c4 < 0x30 || c4 > 0x39)
if (c < 0x81 || c > 0xFE || c3 < 0x81 || c3 > 0xFE ||
c4 < 0x30 || c4 > 0x39)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue I saw discussions about different standards (GB18030-2000 and GB18030-2005).
In order to avoid ambiguity, I would add at the beginning of the file a comment with a link to the standard implemented by this code.
Additional links to sections that describes specific exceptions/algorithms/ranges (such as the one being changed by this patch) are also useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two different standards are compatible (ignore the draft one, it's not a standard) and doesn't make any difference here.

And about links, I think the only one worth considering here is the wikipedia one, others are all non-authoritative. The truly authoritative one is reserved by the company and you have to pay for it :-(. But if anyone is interested in it, google GB18030 will lead you to wikipedia.

(b"abc\x81\x30\x81\x30def", "strict", 'abc\x80def'),
(b"abc\x86\x30\x81\x30def", "replace", 'abc\ufffd0\ufffd0def'),
# issue29990
(b"\x81\x30\xFF\x30", "strict", None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a few more tests, in particular:

  • one where the first byte is equal to \xFF;
  • a couple of tests with the "replace" error handler.

@ghost
Copy link
Author

ghost commented Apr 7, 2017

I will reply later, have a good weekend :)

@ghost
Copy link
Author

ghost commented Apr 8, 2017

I would add a few more tests, in particular:
one where the first byte is equal to \xFF

I found this doesn't exist, when the first byte is \xFF, the minimum possible value of lseq is:
0x10000 + ((0xFF-0x81-15) * 10 + 0) * 1260 + 0 * 10 + 0 = 1464136 = 0x165748

It can't pass the test if (lseq <= 0x10FFFF), then the code will return 1 without touching anything.

        ...
        else if (c >= 15) { /* U+10000 - U+10FFFF */
            lseq = 0x10000 + (((Py_UCS4)c-15) * 10 + c2)
                * 1260 + (Py_UCS4)c3 * 10 + c4;
            if (lseq <= 0x10FFFF) {
                OUTCHAR(lseq);
                NEXT_IN(4);
                continue;
            }
        }
        return 1;

So I removed c > 0xFE test in this patch, make it faster a bit.

In the issue I saw discussions about different standards (GB18030-2000 and GB18030-2005).
In order to avoid ambiguity, I would add at the beginning of the file a comment with a link to the standard implemented by this code.
Additional links to sections that describes specific exceptions/algorithms/ranges (such as the one being changed by this patch) are also useful.

I have tested, our GB18030-2000 codec is follow this file.
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
0x80 (€) is not included in .xml file.

This page also mentioned the issue of 0x80:
http://icu-project.org/docs/papers/gb18030.html

I investigated our GB2312/GBK/GB18030 codecs two years ago, I'm glad to share my summary:

GBK codec

This implmentation can be seem as either "GBK without PUA code points",
or "CP936 v2.01 without euro sign (0x80 <-> U+20AC)".
GBK standard gave a total of 23940 two-byte sequences, 2149 of them
were mapped to PUA of BMP. In this implmentation, these 2149 sequences
were discarded, so this implmentation has 21791 (=23940-2149) two-byte
sequences.
The 2149 (=2054+95) sequences are mapped to 0xE000-0xE864 in PUA.
The 2054 are empty positions, nothing was assigned to these positions.
The 95 were assigned with characters. When GBK standard was published
in 1995, these 95 characters were not included by Unicode, so mapped
them to PUA. They are included by later version of Unicode.

GB18030 codec

This codec implemented full GB18030-2000 standard.
25 characters were mapped to PUA of BMP.

To implement full GB18030-2005 codec, just modify as below:
sequence GB18030-2000 GB18030-2005
A8BC U+E7C7 U+1E3F
8135F437 U+1E3F U+E7C7

  • U+1E3F is "LATIN SMALL LETTER M WITH ACUTE".

Make a long story short: no (big) problem in GB2312/GBK/GB18030 codecs.
So I would suggest don't add more descriptions to _codecs_cn.c.

@zhangyangyu:

I think the only one worth considering here is the wikipedia one

IMO Wikipedia is not authoritative enough because everyone can edit it, but it's a very good referrence.

The truly authoritative one is reserved by the company and you have to pay for it

Libray is a choice, I searched, "National library of China" has a copy of GB18030-2000 standard, but it seems we don't need to look up it.

@ghost ghost closed this Apr 14, 2017
@zhangyangyu
Copy link
Member

Why close?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type-bug An unexpected behavior, bug, or error

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants