7

I'm looking for speedy alternatives to my function. the goal is to make a list of 32 bit integers based of any length integers. The length is explicitly given in a tuple of (value, bitlength). This is part of a bit-banging procedure for a asynchronous interface which takes 4 32 bit integers per bus transaction.

All ints are unsigned, positive or zero, the length can vary between 0 and 2000

My inputs is a list of these tuples, the output should be integers with implicit 32 bit length, with the bits in sequential order. The remaining bits not fitting into 32 should also be returned.

input: [(0,128),(1,12),(0,32)]
output:[0, 0, 0, 0, 0x100000], 0, 12

I've spent a day or two on profiling with cProfile, and trying different methods, but I seem to be kind of stuck with functions that takes ~100k tuples in one second, which is kinda slow. Ideally i would like a 10x speedup, but I haven't got enough experience to know where to start. The ultimate goal for the speed of this is to be faster than 4M tuples per second.

Thanks for any help or suggestions.

the fastest i can do is:

def foo(tuples):
    '''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
    length = 0
    remlen = 0
    remint = 0
    i32list = []
    for a, b in tuples:
        n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
        length += b
        if length > 32:
            len32 = int(length/32)
            for i in range(len32):
                i32list.append((n >> i*32) & 0xFFFFFFFF)
            remint = n >> (len32*32)
            remlen = length - len32*32
            length = remlen
        elif length == 32:
            appint = n & 0xFFFFFFFF
            remint = 0
            remlen = 0
            length -= 32
            i32list.append(appint)
        else:
            remint = n
            remlen = length
    return i32list, remint, remlen

this has very similar performance:

def tpli_2_32ili(tuples):
    '''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
#    binarylist = "".join([np.binary_repr(a, b) for a, b in inp]) # bin(a)[2:].rjust(b, '0')
    binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
    totallength = len(binarylist)
    tot32 = int(totallength/32)
    i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
    remlen = totallength - tot32*32
    remint = int(binarylist[-remlen:],2)
    return i32list, remint, remlen
2
  • @martineau : sorry, fixed. Commented May 2, 2018 at 14:40
  • Possible suggestions: switch to pypy and check the speed (note: the pypy JIT compiler takes a long time to warm up. Run it on several seconds worth of data at least), then start digging in to the JIT traces and seeing what's holding your code up, like here . Off the bat, function calls in python are known to have substantial overhead, so maybe try if possible to make your function process a whole bunch of tuples at once. Or maybe look into Cython. Commented May 2, 2018 at 17:42

2 Answers 2

2

The best I could come up with so far is a 25% speed-up

from functools import reduce

intMask = 0xffffffff

def f(x,y):
    return (x[0] << y[1]) + y[0], x[1] + y[1]

def jens(input):
    n, length = reduce( f , input, (0,0) )
    remainderBits = length % 32
    intBits = length - remainderBits
    remainder = ((n & intMask) << (32 - remainderBits)) >> (32 - remainderBits)
    n >>= remainderBits

    ints = [n & (intMask << i) for i in range(intBits-32, -32, -32)]
    return ints, remainderBits, remainder

print([hex(x) for x in jens([(0,128),(1,12),(0,32)])[0]])

It uses a long to sum up the tuple values according to their bit length, and then extract the 32-bit values and the remaining bits from this number. The computastion of the overall length (summing up the length values of the input tuple) and the computation of the large value are done in a single loop with reduce to use an intrinsic loop.

Running matineau's benchmark harness prints, the best numbers I have seen are:

Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)

          jens :  0.004151 secs, rel speed  1.00x,     0.00% slower
 First snippet :  0.005259 secs, rel speed  1.27x,    26.70% slower
Second snippet :  0.008328 secs, rel speed  2.01x,   100.64% slower

You could probably gain a better speed-up if you use some C extension implementing a bit array.

Sign up to request clarification or add additional context in comments.

Comments

1

This isn't an answer with a faster implementation. Instead it's the code in the two snippets you have in your question placed within an extensible benchmarking framework that makes comparing different approaches very easy.

Comparing just those two testcases, it indicates that your second approach does not have very similar performance to the first, based on the output shown. In fact, it's almost twice as slow.

from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback

N = 1000  # Number of executions of each "algorithm".
R = 3  # Number of repetitions of those N executions.

# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
    # Import any resources needed defined in outer benchmarking script.
    #from __main__ import ??? # Not needed at this time
""")


class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
    """ A test case is composed of separate setup and test code fragments. """
    def __new__(cls, setup, test):
        """ Dedent code fragment in each string argument. """
        return tuple.__new__(cls, (dedent(setup), dedent(test)))


testcases = {
    "First snippet": TestCase("""
        def foo(tuples):
            '''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
            length = 0
            remlen = 0
            remint = 0
            i32list = []
            for a, b in tuples:
                n = (remint << (32-remlen)) | a #n = (a << (remlen)) | remint
                length += b
                if length > 32:
                    len32 = int(length/32)
                    for i in range(len32):
                        i32list.append((n >> i*32) & 0xFFFFFFFF)
                    remint = n >> (len32*32)
                    remlen = length - len32*32
                    length = remlen
                elif length == 32:
                    appint = n & 0xFFFFFFFF
                    remint = 0
                    remlen = 0
                    length -= 32
                    i32list.append(appint)
                else:
                    remint = n
                    remlen = length

            return i32list, remint, remlen
        """, """
        foo([(0,128),(1,12),(0,32)])
        """

    ),
    "Second snippet": TestCase("""
        def tpli_2_32ili(tuples):
            '''make a list of tuples of (int, length) into a list of 32 bit integers [1,2,3]'''
            binarylist = "".join([bin(a)[2:].rjust(b, '0') for a, b in tuples])
            totallength = len(binarylist)
            tot32 = int(totallength/32)
            i32list = [int(binarylist[i:i+32],2) for i in range(0, tot32*32, 32) ]
            remlen = totallength - tot32*32
            remint = int(binarylist[-remlen:],2)
            return i32list, remint, remlen
        """, """
        tpli_2_32ili([(0,128),(1,12),(0,32)])
        """
    ),
}

# Collect timing results of executing each testcase multiple times.
try:
    results = [
        (label,
         min(timeit.repeat(testcases[label].test,
                           setup=COMMON_SETUP + testcases[label].setup,
                           repeat=R, number=N)),
        ) for label in testcases
    ]
except Exception:
    traceback.print_exc(file=sys.stdout)  # direct output to stdout
    sys.exit(1)

# Display results.
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
      '({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()

longest = max(len(result[0]) for result in results)  # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
    print('{:>{width}} : {:9.6f} secs, rel speed {:5,.2f}x, {:8,.2f}% slower '
          ''.format(
                result[0], result[1], round(result[1]/fastest, 2),
                round((result[1]/fastest - 1) * 100, 2),
                width=longest))

Output:

Fastest to slowest execution speeds using Python 3.6.5
(1,000 executions, best of 3 repetitions)

 First snippet :  0.003024 secs, rel speed  1.00x,     0.00% slower
Second snippet :  0.005085 secs, rel speed  1.68x,    68.13% slower

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.