Skip to content

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Jul 24, 2017

Fixes #1076

Provides a basic implementation of NumPy's argwhere for Dask Arrays. Makes use of compress to make it work. Leverages this implementation of argwhere to make an implementation of NumPy's nonzero for Dask Arrays. Also builds on top of this to provide flatnonzero. Independently also adds count_nonzero. Finally adds support for where when only the condition is provided by using nonzero in this case. All of these implementations are made to be lazy and use unknown dimension lengths as needed.

cc @shoyer @mrocklin

Provides a basic implementation of NumPy's nonzero for Dask. Makes use
of `compress` to make it work. As `compress` seems to be eagerly
evaluated, this means `nonzero` is as well. So it is worth keeping that
in mind. However it is possible the `compress` implementation could be
revisited and made lazy, which would ensure `nonzero` is lazy as well.
Simply calls the nonzero function from the method.
Perform a simple comparison between the results gotten from the nonzero
function and method for Dask Arrays to those provided by NumPy.
@shoyer
Copy link
Member

shoyer commented Jul 24, 2017

I don't think it makes sense to add new dask.array functions that aren't lazy. So I would only support this if compress is made lazy first.

Simply takes a Dask Array, flattens it, uses `nonzero`, and drops the
unneeded singleton tuple. This simple strategy seems to simply reproduce
the behavior of NumPy's implementation.
@jakirkham jakirkham changed the title Add nonzero Add nonzero, flatnonzero Jul 24, 2017
Simply compare the behavior between the Dask Array implementation and
the NumPy implementation to verify that they are the same.
Provides a simple implementation of `count_nonzero` for Dask Arrays.
Does the simplest thing one might expect in this scenario. Namely checks
which elements are non-zero. Then sums over the user provide axis or
axes (if any). This seems to match nicely to the NumPy implementation
without much work.
@mrocklin
Copy link
Member

I don't think it makes sense to add new dask.array functions that aren't lazy. So I would only support this if compress is made lazy first.

So I guess the question goes to @jakirkham : do you have any interest in also fixing compress? Presumably this is now doable because of the support for unknown chunk sizes with np.nan?

@jakirkham
Copy link
Member Author

Unfortunately I don't really have time for such an activity. Sorry. 😞

Provides some basic comparison tests for count_nonzero using both NumPy
and Dask implementations to make sure they are in line with each other.
As Windows treats `int` as 32-bit, we need to force 64-bit to get the
expected result.
@jakirkham jakirkham changed the title Add nonzero, flatnonzero Add nonzero, flatnonzero, count_nonzero Jul 24, 2017
@jakirkham
Copy link
Member Author

jakirkham commented Jul 24, 2017

Just to follow-up on the concerns raised, I don't think there are any technical limitations that would block these functions from working with a lazy version of compress. Also there is nothing that I know of in these functions that are non-lazy themselves. It is merely that a non-lazy function from Dask's API is being used in their implementation. Given that the original issue noted that eager evaluation was required (and later noted it may not be required), it seems this still meets those requirements. While I certainly understand the value of having a lazy implementation of compress (and have raised issue #2540 on this point), I'm not convinced that it is reasonable to block these functions on that account.

@mrocklin
Copy link
Member

@shoyer, do you still have concerns here?

@shoyer
Copy link
Member

shoyer commented Jul 25, 2017

Well, I suppose we could merge these but not add them to the public API for dask.array (i.e., omit them from dask/array/__init__.py). I still don't think it's a good idea to have non-lazy functions in the public API, since that is highly confusing.

@mrocklin
Copy link
Member

OK, it sounds like you'd be more in favor of removing compress until it is lazy-ified rather than merge something like this.

It looks like this might be a non-issue after #2555

Appears that Dask Array's `stack` does not work with Dask Arrays that
have an unknown dimension length. While that makes sense in the general
case (we don't know their lengths), it doesn't apply to this case where
we do know their lengths. In any event, we can update our tests
accordingly as we do here.
This was only needed when `compress` was unable to handle the
unspecified `axis` case. However, as `compress` is now able to handle
this case, there is no need for this argument. After all both arrays are
flattened anyways.
To keep the computation of the nonzero indices compact, run `compress`
on the non-zero indices before splitting them out into separate arrays
in a `tuple`. This also improves readability a bit as well.
@jakirkham jakirkham changed the title Add nonzero, flatnonzero, count_nonzero Add nonzero, flatnonzero, count_nonzero, where (special case) Jul 26, 2017
@jakirkham
Copy link
Member Author

Yep makes perfect sense. Will push a change. Sorry was following up on another thread.

Adds a private function that performs a non-zero check on Python strings
that acts the same way NumPy does. This is then used inside `isnonzero`.
@jakirkham
Copy link
Member Author

Is that more what you are looking for or do you want vectorize to be applied and assigned at the module level too?

@mrocklin
Copy link
Member

I actually didn't know what to expect from np.vectorize. Here are some numbers:

In [1]: import numpy as np

In [2]: import cloudpickle

In [3]: from distributed.utils_test import inc

In [4]: len(cloudpickle.dumps(inc))
Out[4]: 33

In [5]: len(cloudpickle.dumps(np.vectorize(inc)))
Out[5]: 238

In [6]: len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
Out[6]: 545

In [7]: %timeit len(cloudpickle.dumps(inc))
The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 15.9 µs per loop

In [8]: %timeit len(cloudpickle.dumps(np.vectorize(inc)))
10000 loops, best of 3: 116 µs per loop

In [9]: %timeit len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
1000 loops, best of 3: 291 µs per loop

Either way is fine. This isn't in core functionality so it's not a big deal either way. I'm inclined to just move on.

@jakirkham
Copy link
Member Author

jakirkham commented Jul 27, 2017

Ah, just saw your comment. Went ahead and moved the assignment out just to be safe. After all there is always the chance this goes awry if someone tries to load the pickled vectorized function in a separate process if it isn't defined at module level.

It's a good point that vectorize is slow. Then again this is only used when a NumPy array of strings is in play. Not too much we can do about that as we need to workaround issues ( numpy/numpy#9479 ) and ( numpy/numpy#9462 ). Though am open to suggestions if you know a better way.

@jakirkham
Copy link
Member Author

jakirkham commented Jul 27, 2017

Got some spurious test failures in one build as can be seen. Could that build be restarted please?

Edit: This was done. Thanks.

@jakirkham
Copy link
Member Author

Looks like it passes now. Any more thoughts on this?

Renames `_isnonzero_str` to `_isnonzero_vec`. Also changes its behavior
to use `count_nonzero` on each value provided. This leverages NumPy's
own non-zero implementation to determine whether something is non-zero
instead of rolling our own. The result should be a more robust
implementation to changes in NumPy. Also this implementation can be used
on all manners of types (not just strings).
Try a simple test conversion for the type in question to see if NumPy
can in fact convert it to `bool`. From quick testing, this seems to
properly pass through all numeric types and object arrays. It catches
string types correctly and passes them on to the vectorization function.
So it seems to retain the behavior had before, but a better forward
looking manner.

The nice thing about this check is it actually is testing the operation
Dask will need to perform to make the conversion happen. So it is a good
indicator if it can proceed or not. Further if NumPy fixes any cases
that currently fails (or vice versa), we should still be able to catch
them correctly and handle them appropriately. All of this being done
before generating the Dask array.
@mrocklin
Copy link
Member

Does anyone have any further comments on this? If not then I'll plan to merge sometime early tomorrow.

Given that `argwhere` is used to implement `nonzero` and other related
functions, it makes more sense to test `argwhere`'s ability to handle
`str`s instead of `nonzero`'s. The latter will effectively be proved by
the former working as intended.
Make sure `asarray` is called immediately after entering `argwhere` and
`count_nonzero`. Also drop the `asarray` call from `isnonzero` as it
should be handled before entering this function. This also provides the
added benefit of allowing `isnonzero` to work on NumPy arrays as is.
To ensure the `condition` is properly converted to a Dask Array and uses
our implementation of `nonzero`, call the function `nonzero` (as opposed
to the method) on `condition`.
@mrocklin mrocklin merged commit a48abc1 into dask:master Jul 28, 2017
@mrocklin
Copy link
Member

Thanks for all of the effort here @jakirkham . Merged!

@jakirkham jakirkham deleted the add_nonzero branch July 28, 2017 13:23
@jakirkham
Copy link
Member Author

Thanks for the reviews @mrocklin and @shoyer. Also thanks @jcrist for rewriting compress to be lazy.

@jakirkham
Copy link
Member Author

Hmm...seems the docs are not building correctly. Namely it is not showing the docs for these functions. Seems to be an import error. My only guess is it is getting confused and importing from the dask that was pip-installed due to the distributed requirement instead of from source.

Could I suggest that the RTD settings get moved into a .readthedocs.yml file in the repo? It would make it easier to see what is going on and tweak it to get better behavior.

@mrocklin
Copy link
Member

mrocklin commented Jul 28, 2017 via email

@jakirkham
Copy link
Member Author

Ok, have opened issue ( #2568 ) to follow-up on this.

@jakirkham
Copy link
Member Author

Also would add that the functionality provided here could be used to support __getitem__ with a Dask Array mask. It may also be possible to support __setitem__ through where. Just thinking these may come up down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants