-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: be more cautious when guessing what a backend can open #10804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: be more cautious when guessing what a backend can open #10804
Conversation
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @ianhi !
xarray/backends/pydap_.py
Outdated
| if not (isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)): | ||
| return False | ||
|
|
||
| # Check file extension to avoid claiming non-OPeNDAP URLs (e.g., remote Zarr stores) | ||
| _, ext = os.path.splitext(filename_or_obj.rstrip("/")) | ||
| # Pydap handles OPeNDAP endpoints, which typically have no extension or .nc/.nc4 | ||
| # Reject URLs with non-OPeNDAP extensions like .zarr | ||
| return ext not in {".zarr", ".zip", ".tar", ".gz"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure on this. We could go further and require "dap" to be in the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a standard extension for OpenDAP URLs. @Mikejmnez do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with a co-worker on slack. He said:
There's no standard extension for DAP URLs. Explicitly excluding .zarr seems good enough for this disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Yes, there is no standard extension for opendap urls. OPeNDAP servers produce urls with the filename at the end, but for example NASA does something completely different. Excluding .zarr should be good.
What I am trying to push for this, is an opendap protocol-ization via the URL scheme. This is "dap2://<file_url>" vs "dap4://<file_url>". I already added it to the documentation back then dap2vdap4 Right now, if an opendap begins with http, then it is assumed to be dap2. This is completely on the client side and not a server thing. But pydap and python-netcdf4 support this, some NASA subsetting tools do this. Perhaps this may help separating opendap urls from non-opendap urls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually, Thredds (TDS) does have this "standard" way to specify the protocol that may help to discern between opendap url vs non-opendap url: a TDS dap2 url will have a dodsC in its urls. A TDS dap4 url will have a dap4 in its url. (see here). However, an organization running an opendap server may decide how their own urls are exposed.
|
Conceptually, I think there is an ambiguity about Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit |
I agree. I can make a follow up PR to the backends page to try make this more explicit. For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for But for this PR I feel pretty happy with where it is and would defer further dap improvements to later, modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly |
Actually im not sure that's reasonable. Other backends do a similar check without thinking about case: xarray/xarray/backends/scipy_.py Lines 365 to 367 in eed12c4
is that more correct? |
Really hard to say. Honestly, I think even this list of extensions is pretty generous:
|
b06a0a5 to
7ed1f0a
Compare
The only think that I can think of, that would be opendap specific, and part of the spec (dap4), appears when using a constraint expression in the url. These can often be part of the user-provided URL. For example: url = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4" # full data
url_ce2 = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time[0:1:0];/Y[0:1:39];/X[0:1:39];/Eta[0:1:0][0:1:39][0:1:39]"
url_ce2 ="http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time=[0:1:0];/Y=[0:1:39];/X=[0:1:39];/time;/X;/Y;/Eta"
|
Since it is often but not "always" what would you recommend conditioning guessing true on these? It would be a signfiicant increase in strictness over what is there now. The current changes are a medium increase. I'm not an dap user so I don't have a strong opinion. |
|
guess_can_open should be pretty conservative, something like 95% confident
that the given backend can open the given file. It is a way better user
experience to ask them to explicitly supply an engine than to guess wrong
and get a mysterious error message. So pydap should require something
explict to identify DAP, rather than guessing it can open any HTTP url.
…On Thu, Oct 2, 2025 at 11:13 AM Ian Hunt-Isaak ***@***.***> wrote:
*ianhi* left a comment (pydata/xarray#10804)
<#10804 (comment)>
dap4.ce= must appear in the query parameter. That is a exclusively opendap
that is implemented by any dap4 server.
hese can often be part of the user-provided URL.
Since it is often but not "always" what would you recommend conditioning
guessing true on these? It would be a signfiicant increase in strictness
over what is there now. The current changes are a medium increase. I'm not
an dap user so I don't have a strong opinion.
—
Reply to this email directly, view it on GitHub
<#10804 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVRUMOHILDO2VR5JWMD3VVTMXAVCNFSM6AAAAACH5N2T7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNRSGQYTOMJVGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
If the url has either |
So the NETCDF4 backend should also say yes to urls with dap in them? I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately
seems to work but then fails with So i wasn't able to manually verify |
I'll defer to @Mikejmnez, but my impression is that we should definitely be preferring pydap to netCDF4 for DAP. I think DAP support is optional in netCDF-C. So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4. |
It looks like the test server was down overnight, and got restarted this morning. I tried it and it worked for me.
My preference, would be to have any |
|
@shoyer While it's possible to build netcdf-c without DAP support, it's almost always built on and is frequently used. For instance, the netcdf4-python wheels are built with DAP support turned on in netCDF-C, as are the pacakges on conda-forge. So, PLEASE leave netcdf4-python able to read DAP by default. EDIT: My default environment has neither h5netcdf nor pydap, just netcdf4. I'd really like for that environment to not suddenly start breaking my existing code/examples. |
|
And I am up for keeping the defaults as is, and definitely NOT break people's workflows. I think the issue at hand is how to identify dap urls. |
I had a thought the other day when working on a custom zarr backend, that it would be nice to have a robust and cross-file-type system for expressing user preference for backend resolution order. It currently seems to be alphabetical, but with a special case of reordering the 3 built in netcdf backend. So the default guessing order will be: from the sorting here: xarray/xarray/backends/plugins.py Lines 94 to 100 in eed12c4
but if i add a zarr backend for a subtype of zarr it will only come first due to alphbetical order, but if i had a backend named like
I will make sure to not remove the current situation where netcdf4 can report as being able to a. This is just about the automatic engine resolution, not changing anything for an explicit |
|
This is also related to this issue i guess: #10810 (comment) |
|
Ok. Following the meeting today I am now happy with the state of this PR. It improves on the current state by having engine's not guess things that cannot read. However, it does not change the resolution order in the case that a backend can read something. This means we are still left the ambiguity of what a url ending in I have also added a section to the |
66c9e1a to
079b290
Compare
xarray/tests/test_backends.py
Outdated
| ("DAP4://example.com/dataset", "pydap"), # uppercase scheme | ||
| ("https://example.com/services/DAP2/dataset", "pydap"), # uppercase in path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following #10804 (comment) can we check that netCDF4 gets chosen for these if pydap is not installed; you can use has_pydap from tests/__init__.py for the check and remove the requires_pydap decorator
dcherian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one minor test request
|
The remaining failures are not casued by the content of this PR. Docs build: failing due to intersphinx isssue flaky tests. Not caused by this PR but related. This URL is returning a 404 http://test.opendap.org/opendap/dap4/unaligned_simple_datatree.nc.h5.dmr i think the test server is down? Or possibly the content of the server was modified. looking here http://test.opendap.org/opendap/dap4/contents.html it sseems that the file we were looking for is not present @Mikejmnez maybe knows more here? |
|
Ok let's try this out! |
| return True | ||
| # Helper to check if magic number is netCDF or HDF5 | ||
| def _is_netcdf_magic(magic: bytes) -> bool: | ||
| return magic.startswith((b"CDF", b"\211HDF\r\n\032\n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what runs when trying URLs too? If so, netcdf4 will never be the default and opendap URLs will fail without an engine specification. I'm seeing crashes due to this in many libraries. Or the is_remote_uri is failing? (Looks like is_remote no longer runs here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a public example for which this is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess any opendap URL that doesn't specify engine=netcdf4, but expect it to be the default. Here is an example:
import xarray
url = "https://www.neracoos.org/erddap/griddap/WW3_EastCoast_latest"
ds = xarray.open_dataset(url) # Doesn't work anymore unless we add `engine=netcdf4`,.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it - If you specify the dap2 protocol in the url scheme, it picks up engine="pydap". The following works.
url = "dap2://www.neracoos.org/erddap/griddap/WW3_EastCoast_latest"
ds = xr.open_dataset(url)(erddap only implements dap2, but I think they're moving towards dap4/dmrpp as well). I imagined that without specifying the protocol it would have been picked up by netcdf4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but the problem is that this is a breaking change, no? In the past URLs could be passed to netcdf4 engine as the default, not it is forcing most users to either specify it or install another dependency. Like @dopplershift comment here, netcdf4-python is likely installed and dap enable in most, it not all, scientific environments.
PS: The original URL is auto-generated via ERDDAP, one cannot know before hand and edit it. Seems like unnecessary extra steps just to get something to work, which used to be easy and automatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid requiring every biology user of xarray needing to do extra work to specify the engine.
I can appreciate that, but this is currently being exchanged for all of my existing users who rely on netcdf4 for DAP2 now needing to specify the engine (or install pydap). Which is why I commented above, and I'm a bit disappointed it was still broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guessing things using code is always hard to get right and should generally be avoided as much as possible since it has a non-negligible maintenance burden.
We do it anyways because it may be helpful to guess file types in interactive workflows. However, I do think that anyone who knows the file type in advance (or at least knows the engine that can open it) should specify the engine explicitly, and that we should nudge people in that direction. Something along the lines of, "this is a guessing mechanism for interactive use. If you need long-term stability please specify the engine parameter explicitly". (I'm not sure on the exact mechanism we can use to advertise this. A UserWarning, maybe? That can get pretty annoying in the interactive use case, though)
Edit: what I'm trying to say is that I personally would like to have specifying the engine parameter explicitly perceived as the default, and that the guessing feature is only used during the exploratory steps when writing a workflow (if we were to ever write a linter for xarray code, it should complain for open_{dataarray,dataset,datatree,groups} calls if the engine parameter is missing).
From a library maintenance perspective, I would argue that 2 is actually the more desirable option. As-in, we roll back the change and release again soon, and then start a deprecation cycle for the current guessing behavior.
@dcherian, @shoyer, would you have some time to look at this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate that this behavior (defaulting to netCDF4/DAP for URL strings) is long-standing in Xarray, but I don't think it's a sustainable design in the long-term. It is fundamentally highly ambiguous what a URL means (is it an DAP endpoint? is it an HTTP server? is it some Zarr end-point?), and there's no easy way to disambiguate (this is different from the situation for local files, where we can inspect headers).
So I agree that we should restoring netCDF4 grabbing most URLs for now (sudden breaking changes are bad!), but I think this behavior should be deprecated, and users should be encouraged to make an explicit choice of backend for HTTP URLs. The users should see something like FutureWarning: Xarray will not no longer guess that netCDF4 can open all URLs in the future. Set engine='netcdf4' explicitly (or another xarray backend engine) to silence this message.
There are many ways we could allow users to be explicit about the sort of files they are opening, and in the long term I think this is a way better strategy that adding more clever "guessing" logic. Some possibilities, which hopefully are not too much worse than xr.open_dataset('https://myserver.org/dataset'):
- Explicit protocol in the URI:
xr.open_dataset('dap://myserver.org/dataset') - Explicit suffix in the URI:
xr.open_dataset('https://myserver.org/dataset.dap') - Explicit engine keyword:
xr.open_dataset('https://myserver.org/dataset', engine='netcdf4') - Explicit constructors:
xr.open_dap_dataset('https://myserver.org/dataset')(and maybe alsoopen_dap_datatree()andopen_dap_array()) - String matching anywhere in a URI:
xr.open_dataset('https://myserver.org/erddap/dataset')(I really don't like this option, it seems very error prone/surprising)
The explicit constructor option is probably my favorite. We would only need a few of these built into Xarray (e.g., open_netcdf_dataset(), open_dap_dataset(), open_zarr_dataset()) and it would be entirely explicit for both users and type-systems, with just a few extra characters typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the one big downside with explicit constructors is that you'd get an explosion in opening functions, as you'd have one per engine and output type. We have scipy, h5netcdf, netcdf4, zarr and pydap as builtin engines and the DataArray, Dataset, and DataTree types, giving us an additional 15 constructor functions. And if I'm missing something and you'd combine the functions for the netcdf files you'd still need to have an engine parameter to choose the netcdf library.
As such, I'm very much in favor of the explicit engine parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making separate constructors for separate filetypes, not backend engines. But that's still at least 9 different versions.
| if is_remote: | ||
| path = strip_uri_params(path) | ||
| _, ext = os.path.splitext(path) | ||
| return ext in {".nc", ".nc4", ".cdf"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mikejmnez the problem is here. You are expecting an extension, when it is for some opendap URLs.
Previously pydap or netcdf if installed would grab any remote URL according the order of backend resolution.
whats-new.rstapi.rst