13

In Python, I would like to quickly compute an order-invariant hash for the lines of a file as a way to identify "uniquely" its content. These files are for example the output of a select ... from table and thus the order of the lines is random.

Here is an example that achieves what I want (using one of the hashers in hashlib), but at the expense of having to sort the lines. Note that sorting the lines is just a way to achieve the goal, i.e. to get a hash that doesn't depend on the ordering of the lines in the file. But clearly, I'd like to avoid the O(n*log(n)) cost, esp. when the files are much longer.

def get_hexdigest(filename, hasher, blocksize=65536, order_invariant=False):
    if not os.path.isfile(filename):
        return None
    if order_invariant:
        with open(filename, 'r') as f:
            for line in sorted(f):
                hasher.update(line.encode())
    else:
        with open(filename, 'rb') as f:
            while True:
                buf = f.read(blocksize)
                hasher.update(buf)
                if len(buf) < blocksize:
                    break
    return hasher.hexdigest()

So, for e.g. 1MB, 50K rows file:

%%time
get_hexdigest('some_file', hashlib.sha1())
# Wall time: 1.71 ms

But:

%%time
get_hexdigest('some_file', hashlib.sha1(), order_invariant=True)
# Wall time: 77.4 ms

What is a better/faster way to do that?

As noted in this answer, Scala has an order-invariant hash based on Murmurhash, but I assume it is the 32-bit version of mmh3 (too collision-prone for my usage), and also I would rather use some standard library available in Python rather than implementing something in C or in Cython. Murmurhash3 has a 128bit version, but its output is different on x64 vs x86. I would like to have machine independent results.

So, in summary, I would like:

  • consistent results across machine architectures
  • low collision rate, i.e. at least 128 bits with good dispersion (but I don't need the hash to be cryptographic)
  • reasonably fast, i.e. at least under 5ms for 1MB, 50K lines file.
  • readily available, if possible, as a library on PyPi or Conda.
  • amenable to files with repeated lines (so just XORing per-line hashes is a non-starter, as any pair of identical lines would cancel each other).

Edits and notes: Thanks to several comments, the code above is updated to sort lines in memory. The original version for order_invariant is True was:

    with os.popen('sort {}'.format(filename)) as f:
        for line in f:
            hasher.update(line.encode(encoding='utf-8'))
    return hasher.hexdigest()

The associated wall time (for the file used above) was then 238 ms. This is now reduced to 77 ms, but still way slower than not sorting the lines. Sorting will add a n*log(n) cost for n lines.

The encoding (to UTF-8) and reading in mode 'r' nor 'rb' is necessary when reading lines, as then we get strings not bytes. I don't want to rely on assuming that the files contain only ASCII data; reading in 'rb' could lead to lines not properly split. I don't have the same concern when order_invariant is False, because then I don't have to split the file, and thus the fastest way is to slurp chunks of binary data to update the hasher.

17
  • 2
    first improvement: you could probably sort the python lines using for lines in sorted(f) instead of calling an external process... Commented Feb 17, 2017 at 20:21
  • As a side note, if you're working with lines (based on 'order-invariant hash for the lines of a file' ) why are you using a buffer in the 'ordersensitive' case? Just pump the hasher with .readline() output. Commented Feb 17, 2017 at 20:33
  • Also: it's probably much faster to encode the whole file and then split the lines, then calling .encode on every line: with open(...) as f: for line in sorted(f.read().encode('utf-8').split('\n')): hasher.update(line) (yes, it loads the whole file into memory, but sorting requires this). Commented Feb 17, 2017 at 20:36
  • Or even better, open the file in binary mode, so no decoding/reencoding will take place Commented Feb 17, 2017 at 20:44
  • @ThierryLathuille - his minimal unit for hashing are lines so reading the whole file into a buffer is a moot point. Then again, he's opening the file in binary mode for some reason, too, pretty much guaranteeing that the above will not return the same hash for two exact same files with turned line ordering - in the first case he's not even hashing it line by line. Commented Feb 17, 2017 at 20:50

4 Answers 4

3

I think you should sort the file before (select ... from table order by ...) or come up with another solution for your actual problem.

Anyways, a possible approach in Python using a frozenset:

#!/usr/bin/python

lines1 = ['line1', 'line2', 'line3', 'line4']
lines2 = ['line2', 'line1', 'line3', 'line4']  # same as lines1 but different order
lines3 = ['line1', 'line1', 'line3', 'line4', 'line5']


for lines in [lines1, lines2, lines3]:
    print(lines)
    print(hash(frozenset(lines)))
    print('')

Output

['line1', 'line2', 'line3', 'line4']
8013284786872469720

['line2', 'line1', 'line3', 'line4']
8013284786872469720

['line1', 'line1', 'line3', 'line4', 'line5']
7430298023231386903

I doubt it will match your performance constrains. I don't know the time complexity (Big O) of the frozenset(). It also assumes lines are unique. Again, I highly suggest to tackle the underlying problem differently.

Sign up to request clarification or add additional context in comments.

4 Comments

Frozenset is a good fit for this task. And since duplicate lines are eliminated while building the frozenset, there is no risk of two duplicate lines cancelling each other out.
The downside is that the hash value is only 32 or 64 bits.
Cool, +1 for thinking of the frozenset. However, I did some measurement for increasing sizes and found that the time taken by this approach is more than linear with the number of lines. That said, the overall constants are low. For example, it takes 15ms for 65536 lines, best I've seen so far. But then it takes 7.6s for 11.8M lines, not the best.
BTW, the files come as is from a different system over which I have little or no control; there are many of them, and I cannot affect the way they are produced. I wish they were always sorted. Often they are not.
1

How about this merkle-style map-reduce (hash concatenated mapped hashes, optional sort for invariant after hash map step):

import hashlib

def hasher(data):
    hasher = hashlib.sha1()
    hasher.update(data.encode('utf-8'))
    return hasher.hexdigest()


def get_digest_by_line(filename, line_invariant=False, hasher=hasher):
    with open(filename, 'r') as f:
        hashes = (hasher(line) for line in f)
        if line_invariant:
            hashes = sorted(hashes)
        return hasher(''.join(hashes))

1 Comment

Hey, good idea. In fact, with some minor mods it could also be distributed for very large files, e.g. in a map-reduce scheme with some binning on hash prefix.
0

Thank you all for the interesting comments and answers so far.

At this time, the best answer for large files (>350K lines) is (a) below. It is based on Murmurhash3, adding the mmh3.hash128() of each line. For smaller files, it is (b) below: a variant of the frozenset approach proposed by Rolf, which I adapted to produce 128 bits hash (although I wouldn't vouch for the quality of those 128 bits).

a) mmh3.hash128() for each line and add

import mmh3
def get_digest_mmh3_128_add(filename):
    a = 0
    with open(filename, 'rb') as f:
        for line in f:
            a += mmh3.hash128(line)
    return '{:032x}'.format(a & 0xffffffffffffffffffffffffffffffff)

In my setting: constant 0.4 second per million lines.

b) two frozenset hash

def get_digest_hash_frozenset128(filename):
    with open(filename, 'rb') as f:
        frz = frozenset(f.readlines())
    return '{:032x}'.format(((hash(frz) << 64) + hash(frz.union('not a line'))) & 0xffffffffffffffffffffffffffffffff)

In my setting: between 0.2 and 0.6 second per million lines.

Notes

  1. After consideration, I decided it was ok to read the lines of the file in binary mode, even if they potentially contain UTF-8 text. The reason is, if some Unicode character contains a '\n', the line would be accidentally split at that point. The file would then get the same digest as another, where the two parts of that line were arranged differently (or even split apart and put at some other place through the file), but the probability of this is extremely slow and I can live with it.

  2. Adding all the 128-bit hashes in (a) is done using Python's arbitrary precision ints. At first, I tried to keep the sum in 128 bits (by repeatingly and-ing with the 0xfff...fff constant). But it turns out to be slower than letting Python use arbitrary precision and do the masking once at the end.

  3. I am trying to get 128 bits from the regular hash of a frozenset by taking two hashes: that of the frozenset, and another one from the frozenset augmented with a line that is unlikely to appear in any file (kind of the same as using different seed for the hash, I guess).

Complete results

A full notebook is available here. It creates pseudo-random files of arbitrary sizes, and tries several digest approaches while measuring the time taken by each of them. This is run on an EC2 instance (r3.4xlarge, using an EBS volume for the storage of the pseudo-random file) and Jupyter iPython notebook, and Python 3.6.

For 46341 lines, we get

fun                              lines millis
get_digest_xxh64_order_sensitive 46341    0.4 *
get_digest_sha1                  46341    1.7 *
get_digest_hash_frozenset64      46341    8.7
get_digest_hash_frozenset128     46341   10.8
get_digest_sha1_by_lines         46341   14.1 *
get_digest_mmh3_128_add_cy       46341   18.6
get_digest_mmh3_128_add          46341   19.7
get_digest_sha1_sort_binary      46341   44.3
get_digest_sha1_sort             46341   65.9

*: These are order-dependent, just here for comparison.

get_digest_hash_frozenset64 isn't really suitable as it gives only 64 bits.

get_digest_mmh3_128_add_cy is a cythonized version of the function given above in (a), but there is little difference.

get_digest_xxh64_order_sensitive is extremely fast, but it is order dependent. My attempts (not listed here) to derive an order-invariant version all gave some pretty slow results. The reason, I think, is the apparently high cost of initializing and finalizing the hash.

For larger files, the get_digest_mmh3_128_add_cy wins. Here is for 11.8M lines:

fun                                 lines    millis
get_digest_xxh64_order_sensitive 11863283      97.8 *
get_digest_sha1                  11863283     429.3 *
get_digest_sha1_by_lines         11863283    3453.0 *
get_digest_mmh3_128_add_cy       11863283    4692.8
get_digest_mmh3_128_add          11863283    4956.6
get_digest_hash_frozenset64      11863283    6418.2
get_digest_hash_frozenset128     11863283    7663.6
get_digest_sha1_sort_binary      11863283   27851.3
get_digest_sha1_sort             11863283   34806.4

Focusing on the two leading contenders (order invariant, not the others), here is how much time they take in function of size (number of lines). The y-axis is microseconds/line and the x-axis is number of lines of the file. Note how the get_digest_mmh3_128_add_cy spends a constant time (0.4 us) per line.

time of two order-invariant digests in function of size

Next steps

Sorry for the long-winded answer. This is only an interim answer, as I might (time permitting) try later further experimentation with either numba or Cython (or C++) of a direct implementation of Murmurhash3.

Comments

0

in case someone is interested in a concise (and unbenchmarked) solution:`

def hash(data):
    return hashlib.md5(data).digest()

#order invariant (since sum is commutative) hash of strings bytewise
def hash_up(*args):
    hashes = [hash(d.encode()) for d in args]
    return ''.join(format(sum(t)  % 256 ,'02x') for t in zip(*hashes))`

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.