Loading large STATA file throws "OSError: [Errno 22] Invalid argument"

Hello, this error happens when loading a large (~3.3GB) stata 13 file:

``` python
/Users/makmana/colombia/colombia/datasets.py in <lambda>()
    111
    112 industry4digit_department = {
--> 113     "read_function": lambda: pd.read_stata("/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta"),
    114     "field_mapping": pila_to_atlas,
    115     "classification_fields": {

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
    160         return reader
    161
--> 162     return reader.read()
    163
    164 _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index, convert_missing, preserve_dtypes, columns, order_categoricals)
   1349         self.path_or_buf.seek(self.data_location + offset)
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)
   1353         self._lines_read += read_lines

OSError: [Errno 22] Invalid argument
> /Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py(1351)read()
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)
```

where `read_len` is `3525947880` and `self.path_or_buf` is `<_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>`.

At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe. 

Another funny thing is that this happens on `Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin`, but when I do the same read_stata on `Python 2.7.9 (default, Jan  7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin`, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.

A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.

[This issue](https://github.com/numpy/numpy/issues/3858) and [this one](http://stackoverflow.com/questions/11662960/ioerror-errno-22-invalid-argument-when-reading-writing-large-bytestring) perhaps might be related.

I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions