-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add argwhere, *nonzero, where (cond) #2539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Provides a basic implementation of NumPy's nonzero for Dask. Makes use of `compress` to make it work. As `compress` seems to be eagerly evaluated, this means `nonzero` is as well. So it is worth keeping that in mind. However it is possible the `compress` implementation could be revisited and made lazy, which would ensure `nonzero` is lazy as well.
Simply calls the nonzero function from the method.
Perform a simple comparison between the results gotten from the nonzero function and method for Dask Arrays to those provided by NumPy.
|
I don't think it makes sense to add new dask.array functions that aren't lazy. So I would only support this if |
Simply takes a Dask Array, flattens it, uses `nonzero`, and drops the unneeded singleton tuple. This simple strategy seems to simply reproduce the behavior of NumPy's implementation.
Simply compare the behavior between the Dask Array implementation and the NumPy implementation to verify that they are the same.
Provides a simple implementation of `count_nonzero` for Dask Arrays. Does the simplest thing one might expect in this scenario. Namely checks which elements are non-zero. Then sums over the user provide axis or axes (if any). This seems to match nicely to the NumPy implementation without much work.
So I guess the question goes to @jakirkham : do you have any interest in also fixing |
|
Unfortunately I don't really have time for such an activity. Sorry. 😞 |
Provides some basic comparison tests for count_nonzero using both NumPy and Dask implementations to make sure they are in line with each other.
As Windows treats `int` as 32-bit, we need to force 64-bit to get the expected result.
|
Just to follow-up on the concerns raised, I don't think there are any technical limitations that would block these functions from working with a lazy version of |
|
@shoyer, do you still have concerns here? |
|
Well, I suppose we could merge these but not add them to the public API for dask.array (i.e., omit them from |
|
OK, it sounds like you'd be more in favor of removing It looks like this might be a non-issue after #2555 |
Appears that Dask Array's `stack` does not work with Dask Arrays that have an unknown dimension length. While that makes sense in the general case (we don't know their lengths), it doesn't apply to this case where we do know their lengths. In any event, we can update our tests accordingly as we do here.
This was only needed when `compress` was unable to handle the unspecified `axis` case. However, as `compress` is now able to handle this case, there is no need for this argument. After all both arrays are flattened anyways.
To keep the computation of the nonzero indices compact, run `compress` on the non-zero indices before splitting them out into separate arrays in a `tuple`. This also improves readability a bit as well.
|
Yep makes perfect sense. Will push a change. Sorry was following up on another thread. |
Adds a private function that performs a non-zero check on Python strings that acts the same way NumPy does. This is then used inside `isnonzero`.
|
Is that more what you are looking for or do you want vectorize to be applied and assigned at the module level too? |
|
I actually didn't know what to expect from np.vectorize. Here are some numbers: In [1]: import numpy as np
In [2]: import cloudpickle
In [3]: from distributed.utils_test import inc
In [4]: len(cloudpickle.dumps(inc))
Out[4]: 33
In [5]: len(cloudpickle.dumps(np.vectorize(inc)))
Out[5]: 238
In [6]: len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
Out[6]: 545
In [7]: %timeit len(cloudpickle.dumps(inc))
The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 15.9 µs per loop
In [8]: %timeit len(cloudpickle.dumps(np.vectorize(inc)))
10000 loops, best of 3: 116 µs per loop
In [9]: %timeit len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
1000 loops, best of 3: 291 µs per loopEither way is fine. This isn't in core functionality so it's not a big deal either way. I'm inclined to just move on. |
|
Ah, just saw your comment. Went ahead and moved the assignment out just to be safe. After all there is always the chance this goes awry if someone tries to load the pickled vectorized function in a separate process if it isn't defined at module level. It's a good point that vectorize is slow. Then again this is only used when a NumPy array of strings is in play. Not too much we can do about that as we need to workaround issues ( numpy/numpy#9479 ) and ( numpy/numpy#9462 ). Though am open to suggestions if you know a better way. |
|
Got some spurious test failures in one build as can be seen. Could that build be restarted please? Edit: This was done. Thanks. |
|
Looks like it passes now. Any more thoughts on this? |
Renames `_isnonzero_str` to `_isnonzero_vec`. Also changes its behavior to use `count_nonzero` on each value provided. This leverages NumPy's own non-zero implementation to determine whether something is non-zero instead of rolling our own. The result should be a more robust implementation to changes in NumPy. Also this implementation can be used on all manners of types (not just strings).
Try a simple test conversion for the type in question to see if NumPy can in fact convert it to `bool`. From quick testing, this seems to properly pass through all numeric types and object arrays. It catches string types correctly and passes them on to the vectorization function. So it seems to retain the behavior had before, but a better forward looking manner. The nice thing about this check is it actually is testing the operation Dask will need to perform to make the conversion happen. So it is a good indicator if it can proceed or not. Further if NumPy fixes any cases that currently fails (or vice versa), we should still be able to catch them correctly and handle them appropriately. All of this being done before generating the Dask array.
|
Does anyone have any further comments on this? If not then I'll plan to merge sometime early tomorrow. |
Given that `argwhere` is used to implement `nonzero` and other related functions, it makes more sense to test `argwhere`'s ability to handle `str`s instead of `nonzero`'s. The latter will effectively be proved by the former working as intended.
Make sure `asarray` is called immediately after entering `argwhere` and `count_nonzero`. Also drop the `asarray` call from `isnonzero` as it should be handled before entering this function. This also provides the added benefit of allowing `isnonzero` to work on NumPy arrays as is.
To ensure the `condition` is properly converted to a Dask Array and uses our implementation of `nonzero`, call the function `nonzero` (as opposed to the method) on `condition`.
|
Thanks for all of the effort here @jakirkham . Merged! |
|
Hmm...seems the docs are not building correctly. Namely it is not showing the docs for these functions. Seems to be an import error. My only guess is it is getting confused and importing from the Could I suggest that the RTD settings get moved into a |
|
Fine with me. I don't know much about this personally.
…On Fri, Jul 28, 2017 at 10:01 AM, jakirkham ***@***.***> wrote:
Hmm...seems the docs are not building correctly. Namely it is not showing
the docs for these functions. Seems to be an import error
<https://readthedocs.org/projects/dask/builds/5756209/>. My only guess is
it is getting confused and importing from the dask that was pip-installed
due to the distributed requirement
<https://github.com/dask/dask/blob/a48abc1d6e84f8c597e0c58fba27661b45b558dc/docs/requirements-docs.txt#L7>
instead of from source.
Could I suggest that the RTD settings get moved into a .readthedocs.yml
<http://docs.readthedocs.io/en/latest/yaml-config.html> file in the repo?
It would make it easier to see what is going on and tweak it to get better
behavior.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2539 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszE4SfdVSfN0y2z88aLVSYa1uRYyUks5sSemcgaJpZM4OhjfV>
.
|
|
Ok, have opened issue ( #2568 ) to follow-up on this. |
|
Also would add that the functionality provided here could be used to support |
Fixes #1076
Provides a basic implementation of NumPy's
argwherefor Dask Arrays. Makes use ofcompressto make it work. Leverages this implementation ofargwhereto make an implementation of NumPy'snonzerofor Dask Arrays. Also builds on top of this to provideflatnonzero. Independently also addscount_nonzero. Finally adds support forwherewhen only the condition is provided by usingnonzeroin this case. All of these implementations are made to be lazy and use unknown dimension lengths as needed.cc @shoyer @mrocklin