bpo-41972: Use the "Two-Way" algorithm when searching for long substrings#22679

sweeneyde · 2020-10-12T23:43:31Z

https://bugs.python.org/issue41972

eamanu

IMO you should send the PR when it is ready, to avoid maintain WIP PR for a long time

tim-one · 2020-10-13T03:23:58Z

IMO you should send the PR when it is ready,, to avoid maintain WIP PR for a long time

Disagree in this case: this is a change to extremely important core functionality, so needs to be made as easy as possible for others to try. The changes are all in one file that's typically altered less than once per year (for example, the most recent change was an edit to the comments over a year ago - the most recent non-trivial change was in March of 2017).

Objects/stringlib/fastsearch.h

…wo-way

sweeneyde · 2020-10-14T21:12:04Z

The most recent batch of commits added a jump table.
Between master and this PR now, there are 151 cases slower than master and 463 that faster than master.
The slower cases are at most twice as slow, but the faster cases are often 10-20x faster.
I could add a cutoff to use a simpler algorithm instead, for needles of length less than ~10,
but I wanted to get the "purer" data out before making that change.

The benchmark data is here: https://pastebin.com/raw/bzQ4xQgM

…wo-way

taleinat · 2020-10-18T21:20:43Z

@sweeneyde, could you share how exactly you're running the benchmarks? I'd like to check this out myself :)

sweeneyde · 2020-10-18T21:34:44Z

I used this static file containing a bunch of randomly generated strings:
https://gist.github.com/sweeneyde/f77ccf0d25f9c41d41163b11fe64e9b4
This file has:

One length-1_000_000 haystack
Needles with lengths ranging from length 1 to 10, then increasing by 50% + 1 up until 100_000
~20 needles of each length
All were generated with this function:

import random
from string import ascii_uppercase as alphabet
zipf = [1/x for x in range(1, 1+26)]

def zipf_string(length):
    letters = random.choices(alphabet, weights=zipf, k=length)
    return ''.join(letters)

Then I ran the benchmarks with this script (string_benchmarks.py):

from lots_of_benches import needles, haystack
needles: list[str]
haystack: str

from pyperf import Runner
runner = Runner()

for needle in needles:
    n = len(needle)
    abbrev = needle if n <= 10 else f"{needle[:10]}..."
    runner.timeit(
        name=f"length={n}, value={abbrev}",
        stmt=f"needle in haystack",
        globals=globals(),
    )

Run as python string_benchmarks.py -o whatever.json

pitrou · 2020-10-19T14:34:42Z

@maartenbreddels if you are interested in string algorithms, this may tick your curiosity.

taleinat · 2020-10-26T08:47:57Z

@sweeneyde, why did you close this?

sweeneyde · 2020-10-26T08:51:57Z

The source code was too heavily based on glibc's, which has a more restrictive license that forbids commercial use.

I opened this as a "clean-room" implementation instead:
#22904

sweeneyde added 4 commits October 11, 2020 16:56

initial implementation

899f289

refactoring

743a382

formatting fixes

fdb6800

add shift and bloom

737ac8a

the-knights-who-say-ni added the CLA signed label Oct 12, 2020

bedevere-bot added the awaiting review label Oct 12, 2020

📜🤖 Added by blurb_it.

658038d

sweeneyde changed the title ~~bpo-41972: Use the "Two-Way" algorithm for substring search~~ bpo-41972: (WIP) Use the "Two-Way" algorithm for substring search Oct 12, 2020

eamanu suggested changes Oct 13, 2020

View reviewed changes

bedevere-bot added awaiting core review and removed awaiting review labels Oct 13, 2020

DavidMertz reviewed Oct 13, 2020

View reviewed changes

Objects/stringlib/fastsearch.h Outdated Show resolved Hide resolved

sweeneyde added 5 commits October 13, 2020 23:59

add alternating find_char calls

b62e4c6

USe a shift table

25a61fb

compute a shift for the last character

64b9a0a

Remove unnecessary special case

5568ca2

Merge branch 'two-way' of https://github.com/sweeneyde/cpython into t…

f2054e0

…wo-way

sweeneyde added 4 commits October 15, 2020 03:10

removed unneeded shift computation

415f492

restore original code with special case for long needles

06c3678

Minor code cleanups

89bdc34

Restore comment and fix typo

9d7bbc3

sweeneyde marked this pull request as ready for review October 17, 2020 20:01

sweeneyde changed the title ~~bpo-41972: (WIP) Use the "Two-Way" algorithm for substring search~~ bpo-41972: Use the "Two-Way" algorithm for substring search Oct 17, 2020

sweeneyde changed the title ~~bpo-41972: Use the "Two-Way" algorithm for substring search~~ bpo-41972: Use the "Two-Way" algorithm when searching for long substrings Oct 17, 2020

sweeneyde and others added 4 commits October 17, 2020 16:03

Update 2020-10-12-23-46-49.bpo-41972.0pHodE.rst

c331307

Add test cases catered to the new algorithm

66377ce

Merge branch 'two-way' of https://github.com/sweeneyde/cpython into t…

e718e5a

…wo-way

Fix typo

cf4e398

sweeneyde added 3 commits October 19, 2020 01:23

add a cutoff for haystack length

3933992

simplify a couple of lines

c8e54c6

Add better threshholds

8208853

sweeneyde closed this Oct 19, 2020

sweeneyde deleted the two-way branch December 19, 2021 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpo-41972: Use the "Two-Way" algorithm when searching for long substrings#22679