-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-74902: add unicode grapheme cluster break algorithm #2673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
0f82f82 to
62fd6e0
Compare
62fd6e0 to
a47de54
Compare
|
Hello? Someone here? |
Modules/unicodedata.c
Outdated
| 0, /*tp_setattro*/ | ||
| 0, /*tp_as_buffer*/ | ||
| Py_TPFLAGS_DEFAULT, | ||
| "Internal grapheme cluster iterator object.", /* tp_doc */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the words "internal" and "object" are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?
|
Sorry for the long wait. Are we good concerning the changes? Anything to add? |
|
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
|
@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks! |
|
This PR is stale because it has been open for 30 days with no activity. |
|
This PR is stale because it has been open for 30 days with no activity. |
|
This PR is stale because it has been open for 30 days with no activity. |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize that it took so long to start reviewing this PR seriously.
Now we need this algorithm to calculate the width of text in columns, which is needed to support wide characters in many parts of the stdlib (REPL, tracebacks, etc). So we will add its implementation anyway. If you are busy or have lost interest, I will finish this work myself (keeping your credit), but if you are still interested, I would be happy to work together.
I wonder, what is the source of the state machine table? Did you created it from the original rules or from the table in GraphemeBreakTest.html? Or copied it from other source? I afraid that it is outdated and only supports legacy grapheme clusters. I can fix this, but maybe you already have a ready solution?
| self: self | ||
| unistr: unicode | ||
| start: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be Py_ssize_t. Some other variables should be Py_ssize_t, not int.
| self: self | ||
| unistr: unicode | ||
| start: int = 0 | ||
| end: Py_ssize_t(c_default="PY_SSIZE_T_MAX - 1") = sys.maxsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be PY_SSIZE_T_MAX.
Although I am not sure that the end parameter is needed. The user can simply stop iteration at any time.
| @staticmethod | ||
| def check_version(testfile): | ||
| hdr = testfile.readline() | ||
| return unicodedata.unidata_version in hdr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the file header look like?
With string contains tests, I worry about things like "8.0" in "18.0" matching wrongly. Could the full line be compared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# GraphemeBreakTest-17.0.0.txt
We have the same check for normalization tests.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.
https://bugs.python.org/issue30717