-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-91576: Speed up iteration of strings #91574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6eeeee0 to
0a84504
Compare
|
Happy to help review this, let me know when you're ready |
@JelleZijlstra Finished. |
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the specialized iteratie for all Latin-1 strings?
That would add one more branch instruction and I was trying to avoid it and LATIN1 is rare compared to ASCII. |
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just be able to test
(PyUnicode_KIND((unicode)) == PyUnicode_1BYTE_KIND
to decide which iterator to create, right? Or can kind be changed (once the object is "ready")?
Given that this is a fixed cost (once per iterator construction) I think the extra branch won't be noticeable. Latin-1 may be rare compared to ASCII but it's still got some common characters and it would be essentially free. |
No, the cost is a branch instruction on each iteration as ascii and latin1 uses different structures. |
Hm, couldn't you just store a pointer to the array of bytes (and another to the end) rather than an index? Or is it possible that the bytes move around somehow? |
|
See the LATIN1 macro in unicodeobject.c. |
|
It requires a check if ch is less than 128 then it uses a different array to index depending on the comparison. |
|
How does this affect performance when ascii and non-ascii are mixed together in the same string? |
Oh, I see. That's a bit unfortunate but I see your point and I guess ASCII strings are somewhat special anyways.
In that case the representation of the whole string will not use the "compact ASCII" format and we'll be using the regular (slow) iterator. @kumaraditya303 Please address the other review comments. |
|
Added some tests and addressed comments. |
|
🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit ad2d676 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
|
🤖 New build scheduled with the buildbot fleet by @kumaraditya303 for commit 56d110c 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
erlend-aasland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy now!
Benchmark Script:
Results:
Closes #91576