Given a string, typically a sentence, I want to extract all substrings of lengths 3, 4, 5, 6. How can I achieve this efficiently using only Python's standard library? Here is my approach, I am looking for one which is faster. To me it seems the three outer loops are inevitable either way, but maybe there is a low-level optimized solution with itertools or so.
import time
def naive(test_sentence, start, end):
grams = []
for word in test_sentence:
for size in range(start, end):
for i in range(len(word)):
k = word[i:i+size]
if len(k)==size:
grams.append(k)
return grams
n = 10**6
start, end = 3, 7
test_sentence = "Hi this is a wonderful test sentence".split(" ")
start_time = time.time()
for _ in range(n):
naive(test_sentence, start, end)
end_time = time.time()
print(f"{end-start} seconds for naive approach")
Output of naive():
['thi', 'his', 'this', 'won', 'ond', 'nde', 'der', 'erf', 'rfu', 'ful', 'wond', 'onde', 'nder', 'derf', 'erfu', 'rful', 'wonde', 'onder', 'nderf', 'derfu', 'erful', 'wonder', 'onderf', 'nderfu', 'derful', 'tes', 'est', 'test', 'sen', 'ent', 'nte', 'ten', 'enc', 'nce', 'sent', 'ente', 'nten', 'tenc', 'ence', 'sente', 'enten', 'ntenc', 'tence', 'senten', 'entenc', 'ntence']
Second version:
def naive2(test_sentence,start,end):
grams = []
for word in test_sentence:
if len(word) >= start:
for size in range(start,end):
for i in range(len(word)-size+1):
grams.append(word[i:i+size])
return grams
len(k)==sizecheck can be eliminated - the only way that can fail is if you start your slice at a point too close to the end of the sentence, but that could be better handled by reducing the range of thefor iloop. Also, do you really need all of the substrings to exist at the same time, in a list? Memory usage could be vastly reduced by yielding them one at a time in a generator function.{end-start} secondsis not right. Could you fix that and also show your times for the two solutions?