bpo-41972: Use the "Two-Way" algorithm when searching for long substrings#22679
bpo-41972: Use the "Two-Way" algorithm when searching for long substrings#22679sweeneyde wants to merge 21 commits intopython:masterfrom
Conversation
eamanu
left a comment
There was a problem hiding this comment.
IMO you should send the PR when it is ready, to avoid maintain WIP PR for a long time
Disagree in this case: this is a change to extremely important core functionality, so needs to be made as easy as possible for others to try. The changes are all in one file that's typically altered less than once per year (for example, the most recent change was an edit to the comments over a year ago - the most recent non-trivial change was in March of 2017). |
|
The most recent batch of commits added a jump table. The benchmark data is here: https://pastebin.com/raw/bzQ4xQgM |
|
@sweeneyde, could you share how exactly you're running the benchmarks? I'd like to check this out myself :) |
|
I used this static file containing a bunch of randomly generated strings:
import random
from string import ascii_uppercase as alphabet
zipf = [1/x for x in range(1, 1+26)]
def zipf_string(length):
letters = random.choices(alphabet, weights=zipf, k=length)
return ''.join(letters)Then I ran the benchmarks with this script ( from lots_of_benches import needles, haystack
needles: list[str]
haystack: str
from pyperf import Runner
runner = Runner()
for needle in needles:
n = len(needle)
abbrev = needle if n <= 10 else f"{needle[:10]}..."
runner.timeit(
name=f"length={n}, value={abbrev}",
stmt=f"needle in haystack",
globals=globals(),
)Run as |
|
@maartenbreddels if you are interested in string algorithms, this may tick your curiosity. |
|
@sweeneyde, why did you close this? |
|
The source code was too heavily based on glibc's, which has a more restrictive license that forbids commercial use. I opened this as a "clean-room" implementation instead: |
https://bugs.python.org/issue41972