8

Suppose I have two lists list_1 and list_2

list_1 = [1, 5, 10] list_2 = [3, 4, 15]

I want to get a list of tuples containing elements from both list_1 and list_2 such that the difference between the numbers in a tuple is under a some constant c.

E.g. suppose c is 2 then the tuples I would have would be: [(1, 3), (5, 3), (5, 4)]

Of course one can iterate over both lists and check that the difference between 2 elements is less than c, but that has a complexity of n^2 and I would rather reduce that complexity.

5
  • 2
    Since in the worst case the output will have O(n^2) pairs, I don't see how you could avoid an algorithm with that complexity, although that doesn't mean that you can't do better than the naive nested loop approach. Commented Oct 28, 2021 at 17:14
  • 2
    you can slightly improve by sorting one list and using bisect to find range of relevant elements, but, as John noted, the worst case will remain O(n^2) Commented Oct 28, 2021 at 17:23
  • How large are the ranges? I mean how large can c be? Commented Oct 28, 2021 at 17:32
  • 4
    I think you could do marginally better if both lists were sorted. Your example has both lists sorted, is that a guarantee on your problem or just coincidence? Commented Oct 28, 2021 at 17:35
  • Could you provide code to generate larger data that's realistic for you so we can benchmark more meaningfully? Commented Oct 29, 2021 at 19:34

2 Answers 2

4

Here is an implementation of the idea of Marat from the comments:

import bisect

def close_pairs(list1,list2,c):
  #assumes that list2 is sorted
  for x in list1:
    i = bisect.bisect_left(list2,x-c)
    j = bisect.bisect_right(list2,x+c)
    yield from ((x,y) for y in list2[i:j])

list_1 = [1, 5, 10]
list_2 = [3, 4, 15]
print(list(close_pairs(list_1,list_2,2)))
#prints [(1, 3), (5, 3), (5, 4)]

To demonstrate the potential improvement of this strategy over what might be thought of as the "naive" approach, let's timeit.

import timeit

setup_naive = '''
import numpy
list_a = numpy.random.randint(0, 2500, 500).tolist()
list_b = numpy.random.randint(0, 2500, 500).tolist()
c = 2
def close_pairs(list_a, list_b, c):
    yield from ((x,y) for x in list_a for y in list_b if abs(x-y) <= c)
'''

setup_john_coleman = '''
import bisect
import numpy
list_a = numpy.random.randint(0, 2500, 500).tolist()
list_b = numpy.random.randint(0, 2500, 500).tolist()
c = 2
def close_pairs(list_a, list_b, c):
    list_a = sorted(list_a)
    list_b = sorted(list_b)
    for x in list_a:
        i = bisect.bisect_left(list_b,x-c)
        j = bisect.bisect_right(list_b,x+c)
        yield from ((x,y) for y in list_b[i:j])
'''

print(f"john_coleman: {timeit.timeit('list(close_pairs(list_a, list_b, c))', setup=setup_john_coleman, number=1000):.2f}")
print(f"naive: {timeit.timeit('list(close_pairs(list_a, list_b, c))', setup=setup_naive, number=1000):.2f}")

On a handy laptop that gives result like:

john_coleman: 0.50
naive: 18.35
Sign up to request clarification or add additional context in comments.

3 Comments

Just a heads up that with some simple test data on my system, this is an order of magnitude faster than the naive approach even after including the required sort. I can add the timeit code to this answer if you like.
@JonSG You could add it here if you want, or have a separate answer. A separate answer could be helpful since it could be extended if someone else were to post an alternative, perhaps one that is able to exploit both lists being sorted.
@JonSG Yes please, it would be very useful to see a comparison of both.
4

If the lists are sorted as your example suggests, then remove the sorting and then this has runtime complexity O(M+N+P) where M and N are the list sizes and P is the number of close pairs. It keeps an index i so that ys[i] is the smallest y-value not too small, and then walks over ys[i:...] as long as they're not too large, yielding each pair.

def close_pairs(xs, ys, c):
    xs = sorted(xs)
    ys = sorted(ys) + [float('inf')]
    i = 0
    for x in xs:
        while x - ys[i] > c:
            i += 1
        j = i
        while ys[j] - x <= c:
            yield x, ys[j]
            j += 1

Benchmark results with lists/ranges 1000 times larger than your example:

 904.4 ms  close_pairs_naive
   4.9 ms  close_pairs_John_Coleman
   1.8 ms  close_pairs_Kelly_Bundy

Benchmark code:

from timeit import timeit
import random
import bisect
from collections import deque

def close_pairs_naive(list_a, list_b, c):
    yield from ((x,y) for x in list_a for y in list_b if abs(x-y) <= c)

def close_pairs_John_Coleman(list_a, list_b, c):
    list_a = sorted(list_a)
    list_b = sorted(list_b)
    for x in list_a:
        i = bisect.bisect_left(list_b,x-c)
        j = bisect.bisect_right(list_b,x+c)
        yield from ((x,y) for y in list_b[i:j])

def close_pairs_Kelly_Bundy(xs, ys, c):
    xs = sorted(xs)
    ys = sorted(ys) + [float('inf')]
    i = 0
    for x in xs:
        while x - ys[i] > c:
            i += 1
        j = i
        while ys[j] - x <= c:
            yield x, ys[j]
            j += 1

funcs = [
    close_pairs_naive,
    close_pairs_John_Coleman,
    close_pairs_Kelly_Bundy,
]

xs = random.choices(range(15000), k=3000)
ys = random.choices(range(15000), k=3000)
c = 2
args = xs, ys, c

expect = sorted(funcs[0](*args))
for func in funcs:
    result = sorted(func(*args))
    print(result == expect, func.__name__, len(result))
print()

for _ in range(3):
    for func in funcs:
        t = timeit(lambda: deque(func(*args), 0), number=1)
        print('%6.1f ms ' % (t * 1e3), func.__name__)
    print()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.