-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Hello, this error happens when loading a large (~3.3GB) stata 13 file:
/Users/makmana/colombia/colombia/datasets.py in <lambda>()
111
112 industry4digit_department = {
--> 113 "read_function": lambda: pd.read_stata("/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta"),
114 "field_mapping": pila_to_atlas,
115 "classification_fields": {
/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
160 return reader
161
--> 162 return reader.read()
163
164 _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]
/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index, convert_missing, preserve_dtypes, columns, order_categoricals)
1349 self.path_or_buf.seek(self.data_location + offset)
1350 read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351 data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
1352 count=read_lines)
1353 self._lines_read += read_lines
OSError: [Errno 22] Invalid argument
> /Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py(1351)read()
1350 read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351 data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
1352 count=read_lines)where read_len is 3525947880 and self.path_or_buf is <_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>.
At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe.
Another funny thing is that this happens on Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin, but when I do the same read_stata on Python 2.7.9 (default, Jan 7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.
A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.
This issue and this one perhaps might be related.
I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.