-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Migrate iterators.py for datatree. #8879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate iterators.py for datatree. #8879
Conversation
| from abc import abstractmethod | ||
| from collections import abc | ||
| from typing import Callable, Iterator, List, Optional | ||
| from collections.abc import Iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looked a bit unexpected, but typing.Iterator is a deprecated alias for collections.abc.Iterator
|
|
||
|
|
||
| class AbstractIter(abc.Iterator): | ||
| def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this __init__, using @dataclass could be an option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Illviljan - sorry for not replying to this comment before. In the latest PR, the AbstractIter class was removed in favour or importing directly from anytree, so I think this comment is now addressed. Let me know if not, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, LevelOrderIter can still use @dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't familiar with this decorator prior to this PR, but it looks like it generates some magic methods for the class. I'm a little wary of adding them (for example, do we want LevelOrderIter.__eq__?). Is there a strong argument here for this class needing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this a dataclass, but I don't think we need to bother. The main advantage of using dataclass is automatically defining a bunch of methods that you know you want, but here we aren't defining a bunch of property methods / comparison operators so it wouldn't save us many lines of code. It's also nice to be able to just say "this came directly from anytree as-is".
|
As this module comes directly from the anytree library, we might need to copy its license into xarray/licenses. |
xarray/core/treenode.py
Outdated
| from xarray.core.iterators import PreOrderIter | ||
|
|
||
| return iterators.PreOrderIter(self) | ||
| return PreOrderIter(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we currently have two patterns of iteration used: depth-first (PreOrderIter) and breadth-first (LevelOrderIter). It looks like PreOrderIter is used in TreeNode.subtree, and LevelOrderIter is used in mapping.diff_treestructure. diff_treestructure is called inside check_isomorphic, and every other time we iterate over the tree it just calls .subtree.
Why the distinction? In diff_treestructure when comparing two trees with multiple points at which their structure deviates, the order of iteration will affect which deviation is raised as an error first. @flamingbear and I think here is it more intuitive to raise errors from the top-level of the tree first, so we agree that LevelOrderIter is the right choice here.
For mapping over the nodes of the tree in .subtree, I had previously thought that there was a good reason to use PreOrderIter. My reasoning was to do with how dt['/a/b/c/d/'] = data would immediately create the nodes /a, /a/b, /a/b/c even if they didn't already exist. But actually I don't think that's relevant here.
Another reason to use PreOrderIter might be around using map_over_subtree to map an operation that then fails. For example taking dt.mean(dim='time') when some of the nodes doesn't have a time dimension. Again the order of iteration affects which node will raise the first error.
However, if we actually think that LevelOrderIter is fine for the .subtree case too, we can potentially simplify this PR considerably by replacing the whole AbstractIter/PreOrderIter/LevelOrderIter framework with just a single iterate_breadth_first function that returns an iterator, and use that in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also then we wouldn't need the anytree license)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas - I've made some updates (or rather nabbed some work @flamingbear did and made some mypy tweaks).
A couple of things here:
- This now contains the single class, but hasn't wrapped it in a function. I think it gets at what you were after, but just FYI. (Although, part of me wonders if this now warrants a module all of it's own?)
- I left the
anytreelicense in the PR, because theLevelOrderIteris still 99% the code fromanytree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now contains the single class, but hasn't wrapped it in a function.
That's fine, thanks.
(Although, part of me wonders if this now warrants a module all of it's own?)
If you wanted to move it to treenode.py instead that would also make sense. But I don't have a strong opinion, and it's trivial to change later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - I'm happy to leave this in it's own module for now.
xarray/core/iterators.py
Outdated
| """Iterate over tree applying level-order strategy starting at `node`. | ||
| This is the iterator used by `DataTree` to traverse nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Iterate over tree applying level-order strategy starting at `node`. | |
| This is the iterator used by `DataTree` to traverse nodes. | |
| """ | |
| Iterate over tree applying level-order strategy starting at `node`. | |
| This is the iterator used by `DataTree` to traverse nodes. |
| for ``node``. | ||
| maxlevel : int, optional | ||
| Maximum level to descend in the node hierarchy. | ||
| Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Examples | |
| Examples |
xarray/core/iterators.py
Outdated
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """ | |
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the suggested whitespace changes in this commit.
Well I made 2.5 of them - I left the first line of the class documentation string for LevelOrderIter on the same line, as this seems consistent with other documentation strings in the repository. I did de-dent the second line, though, because that definitely looked a bit off.
| descendants = root.descendants | ||
| expected = [ | ||
| "b", | ||
| "c", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has no typing, why doesn't mypy complain?
New test files shouldn't be allowed to stay untyped:
Line 158 in a07e16c
| "xarray.tests.*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the values of expected, I'm guessing that because it's all just built-in Python types (a list of str values) mypy can understand that (here is a random similar example from test_dataset.py.
For the places where create_test_tree is called (so root in this test), the typing is done within that test function, so my guess is that mypy is pulling the typing from there (see L242 - L250).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave worrying about mypy on tests for #8926
| "/set2", | ||
| "/set3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Interesting how few tests explicit rely on the order of iteration.
TomNicholas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
| descendants = root.descendants | ||
| expected = [ | ||
| "b", | ||
| "c", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave worrying about mypy on tests for #8926
|
|
||
|
|
||
| class AbstractIter(abc.Iterator): | ||
| def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this a dataclass, but I don't think we need to bother. The main advantage of using dataclass is automatically defining a bunch of methods that you know you want, but here we aren't defining a bunch of property methods / comparison operators so it wouldn't save us many lines of code. It's also nice to be able to just say "this came directly from anytree as-is".
|
This looks good to me now too @TomNicholas are you waiting for another approval? |
This PR continues the overall work of migrating DataTree into xarray.
iterators.pydoes not have direct tests. In discussions with @TomNicholas and @flamingbear, we concurred that other unit tests utilise this functionality.iterators.pyTrack merging datatree into xarray #8572Tests addedwhats-new.rstNew functions/methods are listed inapi.rst