-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Roundtrip unicode strings even when written as character arrays #1648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jhamman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - looks pretty good. I had a few small comments. I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.
doc/io.rst
Outdated
|
|
||
| xarray can write unicode strings to netCDF files in two ways: | ||
|
|
||
| - As variable lengths strings. This is only supported on netCDF4 (HDF5) files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use the nc4/hdf lingo, perhaps "variable length arrays" is clearer. Also, typo in "lengths".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11
and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
These both sound like variable length strings to me.
(I fixed the typo.)
| '(https://github.com/Unidata/netcdf4-python/issues/730). ' | ||
| "Either remove '_FillValue' from encoding on variable %r " | ||
| "or set {'dtype': 'S1'} in encoding to use the fixed width " | ||
| 'NC_CHAR type.' % name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is raised in the h5netcdf backend. It may be confusing to get an error that says, "netCDF4 doesn't support what you're trying to do." If you think it is important to call out netCDF4 here, perhaps lump both backends together, i.e. "the h5netcdf/netCDF4 backends do not yet support..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed to refer to an h5netcdf issue instead.
|
|
||
| def test_roundtrip_string_encoded_characters(self): | ||
| # Override method in DatasetIOTestCases - not applicable to dask | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use pytest.skip() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.
xarray/tests/test_backends.py
Outdated
| with self.roundtrip(original) as actual: | ||
| self.assertDatasetIdentical(expected, actual) | ||
| except NotImplementedError: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use pytest.skip() here and the pass statement above.
e854bdc to
f8142a3
Compare
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up my _FillValue tests to explicit check for errors -- that seems like a better approach for now, rather than marking them as expected failures or skipping them.
doc/io.rst
Outdated
|
|
||
| xarray can write unicode strings to netCDF files in two ways: | ||
|
|
||
| - As variable lengths strings. This is only supported on netCDF4 (HDF5) files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11
and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
These both sound like variable length strings to me.
(I fixed the typo.)
| '(https://github.com/Unidata/netcdf4-python/issues/730). ' | ||
| "Either remove '_FillValue' from encoding on variable %r " | ||
| "or set {'dtype': 'S1'} in encoding to use the fixed width " | ||
| 'NC_CHAR type.' % name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed to refer to an h5netcdf issue instead.
|
|
||
| def test_roundtrip_string_encoded_characters(self): | ||
| # Override method in DatasetIOTestCases - not applicable to dask | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.
@jhamman This reminds me of myself three years ago when I wrote these original messy tests! The mostly but not entirely compatible semantics of Python 3, NumPy, netCDF and HDF5 makes this very hard to get right. |
Unicode strings (
stron Python 3) are now round-tripped successfully even when written as character arrays (e.g., as netCDF3 files or when usingengine='scipy'). This is controlled by the_Encodingattribute convention, which is also understood directly by the netCDF4-Python interface.This PR also resolves some long-standing technical debt in the test suite related to the hacky use of
decode_bytesinassert_allclose(recently encountered by @jhamman in #1609). Once we're sure that we don't need it anymore, I'd like to deprecate and eventually remove thedecode_bytesoption.Note that there are still a few unresolved issues with regards to serializing missing values in strings, so I've intentionally held off on documenting the handling of
_FillValuefor now. I'd like to resolve those separately after discussion in #1647, but ideally this could make it in for the v0.10 release.open_dataset#1638git diff upstream/master | flake8 --diffwhats-new.rstfor all changes andapi.rstfor new API