9

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. Edit: changed line so that it better represents the problems.

13
  • Do you have duplicate values in the list? Commented Oct 5, 2020 at 19:14
  • are the items of equal length? Commented Oct 5, 2020 at 19:16
  • @IoaTzimas No, all values in the list _legal are different. Commented Oct 5, 2020 at 19:16
  • 2
    probably a stretch, but a modified version of KMP might be able to achieve this in a single pass, especially if strings in _legal have common prefixes Commented Oct 5, 2020 at 19:17
  • 3
    You are looking for aho-corasick algorithm. en.m.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm Commented Oct 5, 2020 at 20:49

3 Answers 3

4

One way I could think of to improve is:

  • Get all unique lengths of the words in _legal
  • Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
  • Now look for each thing in the dictionary in O(1).

Code:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:

{'thing1', 'fj la', 'thing2', 't5', '5'}
Sign up to request clarification or add additional context in comments.

2 Comments

I added an answer based on your thinking. Feel free to copy it and let me know so that i will remove mine
@IoaTzimas Thanks. I thought I will add my own code and test things like words with spaces and added.
1

One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']

Comments

0

The following should work quite efficiently. It implements the suggestion from @SomeDude given above

lens=set([len(i) for i in _legal])
d={}
for k in lens:
    d[k]=[line[i:i+k] for i in range(len(line)-k)]
s=set(sum(d.values(), []))
result=list(s.intersection(set(_legal)))

for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]

Output is:

print(result)

['thing2', 'thing1']

Explanation: We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.

6 Comments

Quick observation, the line value in OP shows Thing1 and Thing2, in title case.
yes, i changed it a little, as it would return an empty list otherwise
Not being funny, I believe hacking source data to match a desired output is slightly frowned upon ...
Are you really in the philosophy of the question?
Implementing somebody else's answer as code without crediting them is also frowned upon... In fact, I'd go so far as to say it's pretty dishonest
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.