67

Based on this comment and the referenced documentation, Pickle 4.0+ from Python 3.4+ should be able to pickle byte objects larger than 4 GB.

However, using python 3.4.3 or python 3.5.0b2 on Mac OS X 10.10.4, I get an error when I try to pickle a large byte array:

>>> import pickle
>>> x = bytearray(8 * 1000 * 1000 * 1000)
>>> fp = open("x.dat", "wb")
>>> pickle.dump(x, fp, protocol = 4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

Is there a bug in my code or am I misunderstanding the documentation?

15
  • There's no problem for me. Python 3.4.1 on Windows. Commented Jul 17, 2015 at 3:59
  • 5
    Breaks on OS X. This doesn't actually have anything to do with pickle. open('/dev/null', 'wb').write(bytearray(2**31 - 1)) works, but open('/dev/null', 'wb').write(bytearray(2**3)) throws that error. Python 2 doesn't have this issue. Commented Jul 17, 2015 at 4:13
  • @Blender: What throws an error for you works for me with both Python 2.7.10 and Python 3.4.3 (on OS X, MacPorts versions). Commented Jul 17, 2015 at 4:15
  • 1
    @Blender, @EOL open('/dev/null','wb').write(bytearray(2**31) fails for me as well with the MacPort's python 3.4.3. Commented Jul 17, 2015 at 4:25
  • 4
    Bug reported: bugs.python.org/issue24658. Commented Jul 18, 2015 at 3:00

7 Answers 7

39

Here is a simple workaround for issue 24658. Use pickle.loads or pickle.dumps and break the bytes object into chunks of size 2**31 - 1 to get it in or out of the file.

import pickle
import os.path

file_path = "pkl.pkl"
n_bytes = 2**31
max_bytes = 2**31 - 1
data = bytearray(n_bytes)

## write
bytes_out = pickle.dumps(data)
with open(file_path, 'wb') as f_out:
    for idx in range(0, len(bytes_out), max_bytes):
        f_out.write(bytes_out[idx:idx+max_bytes])

## read
bytes_in = bytearray(0)
input_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f_in:
    for _ in range(0, input_size, max_bytes):
        bytes_in += f_in.read(max_bytes)
data2 = pickle.loads(bytes_in)

assert(data == data2)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. This helped greatly. One thing: for write Should for idx in range(0, n_bytes, max_bytes): be for idx in range(0, len(bytes_out), max_bytes):
@lunguini, for the write chunk, instead of range(0, n_bytes, max_bytes), should it be range(0, len(bytes_out), max_bytes)? Reason I'm suggesting this is (on my machine, anyway), n_bytes = 1024, but len(bytes_out) = 1062, and for others coming to this solution, you're only using the length of your example file, which isn't necessarily useful for real-world scenarios.
25

To sum up what was answered in the comments:

Yes, Python can pickle byte objects bigger than 4GB. The observed error is caused by a bug in the implementation (see Issue24658).

3 Comments

How is this issue still not fixed? Insane
It's 2018 and the bug is still there. Does anyone know why?
It’s been fixed for 3.6.8, 3.7.2 and 3.8 in October 2018; the issue remains open because the author wanted to backport to 2.7. In 6 weeks time that’ll be moot as Python 2.x reaches EOL.
15

Here is the full workaround, though it seems pickle.load no longer tries to dump a huge file anymore (I am on Python 3.5.2) so strictly speaking only the pickle.dumps needs this to work properly.

import pickle

class MacOSFile(object):

    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        # print("reading total_bytes=%s" % n, flush=True)
        if n >= (1 << 31):
            buffer = bytearray(n)
            idx = 0
            while idx < n:
                batch_size = min(n - idx, 1 << 31 - 1)
                # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True)
                buffer[idx:idx + batch_size] = self.f.read(batch_size)
                # print("done.", flush=True)
                idx += batch_size
            return buffer
        return self.f.read(n)

    def write(self, buffer):
        n = len(buffer)
        print("writing total_bytes=%s..." % n, flush=True)
        idx = 0
        while idx < n:
            batch_size = min(n - idx, 1 << 31 - 1)
            print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True)
            self.f.write(buffer[idx:idx + batch_size])
            print("done.", flush=True)
            idx += batch_size


def pickle_dump(obj, file_path):
    with open(file_path, "wb") as f:
        return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL)


def pickle_load(file_path):
    with open(file_path, "rb") as f:
        return pickle.load(MacOSFile(f))

Comments

9

You can specify the protocol for the dump. If you do pickle.dump(obj,file,protocol=4) it should work.

1 Comment

What I did was: pickle.dump(data,w,protocol=pickle.HIGHEST_PROTOCOL). It worked!
4

Reading a file by 2GB chunks takes twice as much memory as needed if bytes concatenation is performed, my approach to loading pickles is based on bytearray:

class MacOSFile(object):
    def __init__(self, f):
        self.f = f

    def __getattr__(self, item):
        return getattr(self.f, item)

    def read(self, n):
        if n >= (1 << 31):
            buffer = bytearray(n)
            pos = 0
            while pos < n:
                size = min(n - pos, 1 << 31 - 1)
                chunk = self.f.read(size)
                buffer[pos:pos + size] = chunk
                pos += size
            return buffer
        return self.f.read(n)

Usage:

with open("/path", "rb") as fin:
    obj = pickle.load(MacOSFile(fin))

2 Comments

Will the above code work for any platform? If so, the above code is more like "FileThatAlsoCanBeLoadedByPickleOnOSX" right? Just trying to understand... It's not like if I use pickle.load(MacOSFile(fin)) on linux this will break, correct? @markhor
Also, would you implement a write method?
1

Had the same issue and fixed it by upgrading to Python 3.6.8.

This seems to be the PR that did it: https://github.com/python/cpython/pull/9937

Comments

0

I also found this issue, to solve this problem i chunk the code into several iteration. Let say in this case i have 50.000 data which i have to calc tf-idf and do knn classfication. When i run and directly iterate 50.000 it give me "that error". So, to solve this problem i chunk it.

tokenized_documents = self.load_tokenized_preprocessing_documents()
    idf = self.load_idf_41227()
    doc_length = len(documents)
    for iteration in range(0, 9):
        tfidf_documents = []
        for index in range(iteration, 4000):
            doc_tfidf = []
            for term in idf.keys():
                tf = self.term_frequency(term, tokenized_documents[index])
                doc_tfidf.append(tf * idf[term])
            doc = documents[index]
            tfidf = [doc_tfidf, doc[0], doc[1]]
            tfidf_documents.append(tfidf)
            print("{} from {} document {}".format(index, doc_length, doc[0]))

        self.save_tfidf_41227(tfidf_documents, iteration)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.