Skip to content

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

@makmanalp

Description

@makmanalp

Hello, this error happens when loading a large (~3.3GB) stata 13 file:

/Users/makmana/colombia/colombia/datasets.py in <lambda>()
    111
    112 industry4digit_department = {
--> 113     "read_function": lambda: pd.read_stata("/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta"),
    114     "field_mapping": pila_to_atlas,
    115     "classification_fields": {

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
    160         return reader
    161
--> 162     return reader.read()
    163
    164 _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index, convert_missing, preserve_dtypes, columns, order_categoricals)
   1349         self.path_or_buf.seek(self.data_location + offset)
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)
   1353         self._lines_read += read_lines

OSError: [Errno 22] Invalid argument
> /Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py(1351)read()
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)

where read_len is 3525947880 and self.path_or_buf is <_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>.

At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe.

Another funny thing is that this happens on Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin, but when I do the same read_stata on Python 2.7.9 (default, Jan 7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.

A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.

This issue and this one perhaps might be related.

I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO Stataread_stata, to_stata

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions