Finding multiple substrings in a string without iterating over it multiple times

Question

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. Edit: changed line so that it better represents the problems.

@IoaTzimas No, all values in the list _legal are different. — Yarden Gal
– Yarden Gal, Commented Oct 5, 2020 at 19:16
probably a stretch, but a modified version of KMP might be able to achieve this in a single pass, especially if strings in _legal have common prefixes — DeepSpace
– DeepSpace, Commented Oct 5, 2020 at 19:17
You are looking for aho-corasick algorithm. en.m.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm — pen_and_paper
– pen_and_paper, Commented Oct 5, 2020 at 20:49

SomeDude · Accepted Answer · 2020-10-05 21:03:53Z

4

One way I could think of to improve is:

Get all unique lengths of the words in _legal
Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
Now look for each thing in the dictionary in O(1).

Code:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:

{'thing1', 'fj la', 'thing2', 't5', '5'}

edited Oct 5, 2020 at 21:03

answered Oct 5, 2020 at 19:33

SomeDude

14.3k5 gold badges26 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

IoaTzimas Over a year ago

I added an answer based on your thinking. Feel free to copy it and let me know so that i will remove mine

SomeDude Over a year ago

@IoaTzimas Thanks. I thought I will add my own code and test things like words with spaces and added.

s3dev · Accepted Answer · 2020-10-05 20:23:20Z

1

One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']

edited Oct 5, 2020 at 20:23

answered Oct 5, 2020 at 19:32

s3dev

9,9293 gold badges34 silver badges50 bronze badges

Comments

pho · Accepted Answer · 2020-10-05 20:25:34Z

0

The following should work quite efficiently. It implements the suggestion from @SomeDude given above

lens=set([len(i) for i in _legal])
d={}
for k in lens:
    d[k]=[line[i:i+k] for i in range(len(line)-k)]
s=set(sum(d.values(), []))
result=list(s.intersection(set(_legal)))

for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]

Output is:

print(result)

['thing2', 'thing1']

Explanation: We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.

edited Oct 5, 2020 at 20:25

pho

25.7k8 gold badges48 silver badges76 bronze badges

answered Oct 5, 2020 at 20:04

IoaTzimas

10.7k2 gold badges15 silver badges32 bronze badges

6 Comments

s3dev Over a year ago

Quick observation, the line value in OP shows Thing1 and Thing2, in title case.

IoaTzimas Over a year ago

yes, i changed it a little, as it would return an empty list otherwise

s3dev Over a year ago

Not being funny, I believe hacking source data to match a desired output is slightly frowned upon ...

IoaTzimas Over a year ago

Are you really in the philosophy of the question?

pho Over a year ago

Implementing somebody else's answer as code without crediting them is also frowned upon... In fact, I'd go so far as to say it's pretty dishonest

|

Collectives™ on Stack Overflow

Finding multiple substrings in a string without iterating over it multiple times

3 Answers 3

2 Comments

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related