gh-74902: add unicode grapheme cluster break algorithm #2673

Vermeille · 2017-07-11T23:12:53Z

I have added GraphemeBreakProperty to UnicodeData.
An automaton to compute the rules for breaking grapheme clusters according to TR29 is included. It passes all the tests provided in GraphemeBreakTests.txt.

https://bugs.python.org/issue30717

Issue: Add unicode grapheme cluster break algorithm #74902

the-knights-who-say-ni · 2017-07-11T23:12:56Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

Vermeille · 2017-08-02T15:43:47Z

Hello? Someone here?

Modules/unicodedata.c

Modules/grapheme_cluster_break_automaton.h

Tools/unicode/makeunicodedata.py

Modules/unicodedata.c

serhiy-storchaka · 2017-08-03T10:40:42Z

Modules/unicodedata.c

+    0,                         /*tp_setattro*/
+    0,                         /*tp_as_buffer*/
+    Py_TPFLAGS_DEFAULT,
+    "Internal grapheme cluster iterator object.",           /* tp_doc */


I think the words "internal" and "object" are redundant.

"Internal", "iterator" and "object" are all redundant. "Grapheme cluster iterator" seems just right. What do you think?

Modules/unicodedata.c

Vermeille · 2018-01-11T10:47:54Z

Sorry for the long wait.

Are we good concerning the changes? Anything to add?

brettcannon · 2018-02-02T22:04:50Z

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

csabella · 2020-05-23T16:46:41Z

@Vermeille, please take a look at the most recent comments on the bug tracker for this issue. It looks like the suggested path forward is different than the solution you proposed here. Thanks!

github-actions · 2022-02-20T00:09:47Z

This PR is stale because it has been open for 30 days with no activity.

github-actions · 2022-08-29T00:13:26Z

This PR is stale because it has been open for 30 days with no activity.

github-actions · 2023-03-18T00:10:47Z

This PR is stale because it has been open for 30 days with no activity.

bedevere-app · 2025-12-17T19:43:58Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app · 2025-12-17T20:23:35Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

serhiy-storchaka

I apologize that it took so long to start reviewing this PR seriously.

Now we need this algorithm to calculate the width of text in columns, which is needed to support wide characters in many parts of the stdlib (REPL, tracebacks, etc). So we will add its implementation anyway. If you are busy or have lost interest, I will finish this work myself (keeping your credit), but if you are still interested, I would be happy to work together.

I wonder, what is the source of the state machine table? Did you created it from the original rules or from the table in GraphemeBreakTest.html? Or copied it from other source? I afraid that it is outdated and only supports legacy grapheme clusters. I can fix this, but maybe you already have a ready solution?

serhiy-storchaka · 2025-12-17T20:24:46Z

Modules/unicodedata.c

+
+    self: self
+    unistr: unicode
+    start: int = 0


It should be Py_ssize_t. Some other variables should be Py_ssize_t, not int.

serhiy-storchaka · 2025-12-17T20:26:39Z

Modules/unicodedata.c

+    self: self
+    unistr: unicode
+    start: int = 0
+    end: Py_ssize_t(c_default="PY_SSIZE_T_MAX - 1") = sys.maxsize


It should be PY_SSIZE_T_MAX.

Although I am not sure that the end parameter is needed. The user can simply stop iteration at any time.

merwok · 2025-12-17T22:05:51Z

Lib/test/test_unicodedata.py

+    @staticmethod
+    def check_version(testfile):
+        hdr = testfile.readline()
+        return unicodedata.unidata_version in hdr


What does the file header look like?

With string contains tests, I worry about things like "8.0" in "18.0" matching wrongly. Could the full line be compared?

# GraphemeBreakTest-17.0.0.txt

We have the same check for normalization tests.

bedevere-app · 2025-12-17T22:13:23Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

add unicodedata.grapheme_cluster_break()

b79f969

the-knights-who-say-ni added the CLA not signed label Jul 11, 2017

Vermeille changed the title ~~WIP: add grapheme cluster break algorithm~~ bpo-30717: WIP: add grapheme cluster break algorithm Jul 11, 2017

generate unicodedata.c.h with clinic

c9a4211

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jul 13, 2017

patchcheck

7f56b78

Vermeille force-pushed the grapheme_cluster_break branch from 0f82f82 to 62fd6e0 Compare July 13, 2017 23:29

Vermeille changed the title ~~bpo-30717: WIP: add grapheme cluster break algorithm~~ bpo-30717: add grapheme cluster break algorithm Jul 14, 2017

Vermeille changed the title ~~bpo-30717: add grapheme cluster break algorithm~~ bpo-30717: add unicode grapheme cluster break algorithm Jul 14, 2017

add the grapheme cluster break automaton

a47de54

Vermeille force-pushed the grapheme_cluster_break branch from 62fd6e0 to a47de54 Compare July 14, 2017 03:21

add my name to Misc/ACKS

c152171

serhiy-storchaka reviewed Aug 3, 2017

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

Modules/unicodedata.c Outdated Show resolved Hide resolved

Modules/grapheme_cluster_break_automaton.h Outdated Show resolved Hide resolved

Tools/unicode/makeunicodedata.py Outdated Show resolved Hide resolved

serhiy-storchaka added the type-feature A feature request or enhancement label Aug 3, 2017

code review fixes

b103be7

methane requested changes Aug 3, 2017

View reviewed changes

Modules/unicodedata.c Outdated Show resolved Hide resolved

Modules/unicodedata.c Outdated Show resolved Hide resolved

Modules/unicodedata.c Outdated Show resolved Hide resolved

rename break_graphemes to iter_graphemes

c9848e2

serhiy-storchaka reviewed Aug 3, 2017

View reviewed changes

Vermeille added 2 commits August 3, 2017 18:23

make GraphemeClusterIterator a GC type

2dee91e

allow iterating only over a range of indices

a5b3c10

brettcannon added the awaiting changes label Feb 2, 2018

github-actions bot added the stale Stale PR or inactive for long period of time. label Feb 20, 2022

ezio-melotti removed the CLA signed label Jul 13, 2022

github-actions bot removed the stale Stale PR or inactive for long period of time. label Jul 28, 2022

github-actions bot added the stale Stale PR or inactive for long period of time. label Aug 29, 2022

arhadthedev changed the title ~~bpo-30717: add unicode grapheme cluster break algorithm~~ gh-74902: add unicode grapheme cluster break algorithm Feb 14, 2023

github-actions bot removed the stale Stale PR or inactive for long period of time. label Feb 15, 2023

github-actions bot added the stale Stale PR or inactive for long period of time. label Mar 18, 2023

serhiy-storchaka mentioned this pull request Dec 11, 2025

Add functions to get the width in columns of a character #56777

Open

Vermeille mannequin mentioned this pull request Dec 11, 2025

Add unicode grapheme cluster break algorithm #74902

Open

Merge branch 'main' into grapheme_cluster_break

ba1e9b7

Add some tests.

ae06154

serhiy-storchaka reviewed Dec 17, 2025

View reviewed changes

Save Grapheme_Cluster_Break for unassigned code points.

1965ed6

merwok reviewed Dec 17, 2025

View reviewed changes

Make "Any" the first entry.

164d19b

serhiy-storchaka marked this pull request as draft December 17, 2025 22:16

bedevere-app bot removed the awaiting changes label Dec 17, 2025

serhiy-storchaka self-assigned this Dec 17, 2025

github-actions bot removed the stale Stale PR or inactive for long period of time. label Dec 18, 2025

Uh oh!

gh-74902: add unicode grapheme cluster break algorithm #2673

Are you sure you want to change the base?

gh-74902: add unicode grapheme cluster break algorithm #2673

Conversation

Vermeille commented Jul 11, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-knights-who-say-ni commented Jul 11, 2017

Uh oh!

Vermeille commented Aug 2, 2017

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

serhiy-storchaka Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

Vermeille Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Vermeille commented Jan 11, 2018

Uh oh!

brettcannon commented Feb 2, 2018

Uh oh!

csabella commented May 23, 2020

Uh oh!

github-actions bot commented Feb 20, 2022

Uh oh!

github-actions bot commented Aug 29, 2022

Uh oh!

github-actions bot commented Mar 18, 2023

Uh oh!

bedevere-app bot commented Dec 17, 2025

Uh oh!

bedevere-app bot commented Dec 17, 2025

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

merwok Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

bedevere-app bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Vermeille commented Jul 11, 2017 •

edited by bedevere-bot

Loading