Add argwhere, *nonzero, where (cond) #2539

jakirkham · 2017-07-24T18:19:26Z

Provides a basic implementation of NumPy's argwhere for Dask Arrays. Makes use of compress to make it work. Leverages this implementation of argwhere to make an implementation of NumPy's nonzero for Dask Arrays. Also builds on top of this to provide flatnonzero. Independently also adds count_nonzero. Finally adds support for where when only the condition is provided by using nonzero in this case. All of these implementations are made to be lazy and use unknown dimension lengths as needed.

cc @shoyer @mrocklin

Provides a basic implementation of NumPy's nonzero for Dask. Makes use of `compress` to make it work. As `compress` seems to be eagerly evaluated, this means `nonzero` is as well. So it is worth keeping that in mind. However it is possible the `compress` implementation could be revisited and made lazy, which would ensure `nonzero` is lazy as well.

Simply calls the nonzero function from the method.

Perform a simple comparison between the results gotten from the nonzero function and method for Dask Arrays to those provided by NumPy.

shoyer · 2017-07-24T18:28:32Z

I don't think it makes sense to add new dask.array functions that aren't lazy. So I would only support this if compress is made lazy first.

Simply takes a Dask Array, flattens it, uses `nonzero`, and drops the unneeded singleton tuple. This simple strategy seems to simply reproduce the behavior of NumPy's implementation.

Simply compare the behavior between the Dask Array implementation and the NumPy implementation to verify that they are the same.

Provides a simple implementation of `count_nonzero` for Dask Arrays. Does the simplest thing one might expect in this scenario. Namely checks which elements are non-zero. Then sums over the user provide axis or axes (if any). This seems to match nicely to the NumPy implementation without much work.

mrocklin · 2017-07-24T18:44:49Z

I don't think it makes sense to add new dask.array functions that aren't lazy. So I would only support this if compress is made lazy first.

So I guess the question goes to @jakirkham : do you have any interest in also fixing compress? Presumably this is now doable because of the support for unknown chunk sizes with np.nan?

jakirkham · 2017-07-24T18:47:24Z

Unfortunately I don't really have time for such an activity. Sorry. 😞

Provides some basic comparison tests for count_nonzero using both NumPy and Dask implementations to make sure they are in line with each other.

As Windows treats `int` as 32-bit, we need to force 64-bit to get the expected result.

jakirkham · 2017-07-24T19:35:39Z

Just to follow-up on the concerns raised, I don't think there are any technical limitations that would block these functions from working with a lazy version of compress. Also there is nothing that I know of in these functions that are non-lazy themselves. It is merely that a non-lazy function from Dask's API is being used in their implementation. Given that the original issue noted that eager evaluation was required (and later noted it may not be required), it seems this still meets those requirements. While I certainly understand the value of having a lazy implementation of compress (and have raised issue #2540 on this point), I'm not convinced that it is reasonable to block these functions on that account.

mrocklin · 2017-07-25T17:13:49Z

@shoyer, do you still have concerns here?

shoyer · 2017-07-25T17:42:04Z

Well, I suppose we could merge these but not add them to the public API for dask.array (i.e., omit them from dask/array/__init__.py). I still don't think it's a good idea to have non-lazy functions in the public API, since that is highly confusing.

mrocklin · 2017-07-25T17:59:00Z

OK, it sounds like you'd be more in favor of removing compress until it is lazy-ified rather than merge something like this.

It looks like this might be a non-issue after #2555

Appears that Dask Array's `stack` does not work with Dask Arrays that have an unknown dimension length. While that makes sense in the general case (we don't know their lengths), it doesn't apply to this case where we do know their lengths. In any event, we can update our tests accordingly as we do here.

This was only needed when `compress` was unable to handle the unspecified `axis` case. However, as `compress` is now able to handle this case, there is no need for this argument. After all both arrays are flattened anyways.

To keep the computation of the nonzero indices compact, run `compress` on the non-zero indices before splitting them out into separate arrays in a `tuple`. This also improves readability a bit as well.

jakirkham · 2017-07-27T17:56:45Z

Yep makes perfect sense. Will push a change. Sorry was following up on another thread.

Adds a private function that performs a non-zero check on Python strings that acts the same way NumPy does. This is then used inside `isnonzero`.

jakirkham · 2017-07-27T18:04:43Z

Is that more what you are looking for or do you want vectorize to be applied and assigned at the module level too?

mrocklin · 2017-07-27T18:12:33Z

I actually didn't know what to expect from np.vectorize. Here are some numbers:

In [1]: import numpy as np

In [2]: import cloudpickle

In [3]: from distributed.utils_test import inc

In [4]: len(cloudpickle.dumps(inc))
Out[4]: 33

In [5]: len(cloudpickle.dumps(np.vectorize(inc)))
Out[5]: 238

In [6]: len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
Out[6]: 545

In [7]: %timeit len(cloudpickle.dumps(inc))
The slowest run took 4.29 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 15.9 µs per loop

In [8]: %timeit len(cloudpickle.dumps(np.vectorize(inc)))
10000 loops, best of 3: 116 µs per loop

In [9]: %timeit len(cloudpickle.dumps(np.vectorize(lambda x: x + 1)))
1000 loops, best of 3: 291 µs per loop

Either way is fine. This isn't in core functionality so it's not a big deal either way. I'm inclined to just move on.

jakirkham · 2017-07-27T18:24:43Z

Ah, just saw your comment. Went ahead and moved the assignment out just to be safe. After all there is always the chance this goes awry if someone tries to load the pickled vectorized function in a separate process if it isn't defined at module level.

It's a good point that vectorize is slow. Then again this is only used when a NumPy array of strings is in play. Not too much we can do about that as we need to workaround issues ( numpy/numpy#9479 ) and ( numpy/numpy#9462 ). Though am open to suggestions if you know a better way.

jakirkham · 2017-07-27T19:46:51Z

Got some spurious test failures in one build as can be seen. Could that build be restarted please?

Edit: This was done. Thanks.

jakirkham · 2017-07-27T20:17:55Z

Looks like it passes now. Any more thoughts on this?

Renames `_isnonzero_str` to `_isnonzero_vec`. Also changes its behavior to use `count_nonzero` on each value provided. This leverages NumPy's own non-zero implementation to determine whether something is non-zero instead of rolling our own. The result should be a more robust implementation to changes in NumPy. Also this implementation can be used on all manners of types (not just strings).

Try a simple test conversion for the type in question to see if NumPy can in fact convert it to `bool`. From quick testing, this seems to properly pass through all numeric types and object arrays. It catches string types correctly and passes them on to the vectorization function. So it seems to retain the behavior had before, but a better forward looking manner. The nice thing about this check is it actually is testing the operation Dask will need to perform to make the conversion happen. So it is a good indicator if it can proceed or not. Further if NumPy fixes any cases that currently fails (or vice versa), we should still be able to catch them correctly and handle them appropriately. All of this being done before generating the Dask array.

mrocklin · 2017-07-27T23:04:02Z

Does anyone have any further comments on this? If not then I'll plan to merge sometime early tomorrow.

Given that `argwhere` is used to implement `nonzero` and other related functions, it makes more sense to test `argwhere`'s ability to handle `str`s instead of `nonzero`'s. The latter will effectively be proved by the former working as intended.

Make sure `asarray` is called immediately after entering `argwhere` and `count_nonzero`. Also drop the `asarray` call from `isnonzero` as it should be handled before entering this function. This also provides the added benefit of allowing `isnonzero` to work on NumPy arrays as is.

To ensure the `condition` is properly converted to a Dask Array and uses our implementation of `nonzero`, call the function `nonzero` (as opposed to the method) on `condition`.

mrocklin · 2017-07-28T13:07:29Z

Thanks for all of the effort here @jakirkham . Merged!

jakirkham · 2017-07-28T13:24:45Z

Thanks for the reviews @mrocklin and @shoyer. Also thanks @jcrist for rewriting compress to be lazy.

jakirkham · 2017-07-28T14:01:00Z

Hmm...seems the docs are not building correctly. Namely it is not showing the docs for these functions. Seems to be an import error. My only guess is it is getting confused and importing from the dask that was pip-installed due to the distributed requirement instead of from source.

Could I suggest that the RTD settings get moved into a .readthedocs.yml file in the repo? It would make it easier to see what is going on and tweak it to get better behavior.

mrocklin · 2017-07-28T14:05:10Z

Fine with me. I don't know much about this personally.

…

On Fri, Jul 28, 2017 at 10:01 AM, jakirkham ***@***.***> wrote: Hmm...seems the docs are not building correctly. Namely it is not showing the docs for these functions. Seems to be an import error <https://readthedocs.org/projects/dask/builds/5756209/>. My only guess is it is getting confused and importing from the dask that was pip-installed due to the distributed requirement <https://github.com/dask/dask/blob/a48abc1d6e84f8c597e0c58fba27661b45b558dc/docs/requirements-docs.txt#L7> instead of from source. Could I suggest that the RTD settings get moved into a .readthedocs.yml <http://docs.readthedocs.io/en/latest/yaml-config.html> file in the repo? It would make it easier to see what is going on and tweak it to get better behavior. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2539 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszE4SfdVSfN0y2z88aLVSYa1uRYyUks5sSemcgaJpZM4OhjfV> .

jakirkham · 2017-07-28T20:24:15Z

Ok, have opened issue ( #2568 ) to follow-up on this.

jakirkham · 2017-07-28T20:26:16Z

Also would add that the functionality provided here could be used to support __getitem__ with a Dask Array mask. It may also be possible to support __setitem__ through where. Just thinking these may come up down the road.

jakirkham added 5 commits July 24, 2017 14:05

Adds a nonzero method for Dask Arrays

8f90d3b

Simply calls the nonzero function from the method.

Test nonzero function and method

270669a

Perform a simple comparison between the results gotten from the nonzero function and method for Dask Arrays to those provided by NumPy.

Export the nonzero function to the API

83145f1

Include nonzero in the API documentation

968ac62

jakirkham mentioned this pull request Jul 24, 2017

Consider adding nonzero/flatnonzero to dask.array #1076

Closed

Add flatnonzero for Dask Arrays

cb75fbd

Simply takes a Dask Array, flattens it, uses `nonzero`, and drops the unneeded singleton tuple. This simple strategy seems to simply reproduce the behavior of NumPy's implementation.

jakirkham changed the title ~~Add nonzero~~ Add nonzero, flatnonzero Jul 24, 2017

jakirkham added 4 commits July 24, 2017 14:38

Test flatnonzero

c0d5c27

Simply compare the behavior between the Dask Array implementation and the NumPy implementation to verify that they are the same.

Export flatnonzero as part of the Dask Array API

6eacc73

Add flatnonzero to the API docs

1a6f725

jakirkham added 4 commits July 24, 2017 14:50

Test count_nonzero

700064e

Provides some basic comparison tests for count_nonzero using both NumPy and Dask implementations to make sure they are in line with each other.

Add count_nonzero to the API docs

aef02e2

Include count_nonzero in the Dask Array API docs

a5c7233

Workaround an int type issue on Windows

fe2a20b

As Windows treats `int` as 32-bit, we need to force 64-bit to get the expected result.

jakirkham changed the title ~~Add nonzero, flatnonzero~~ Add nonzero, flatnonzero, count_nonzero Jul 24, 2017

jakirkham mentioned this pull request Jul 24, 2017

Lazy implementation of compress #2540

Closed

jakirkham added 4 commits July 25, 2017 20:04

Merge remote-tracking branch 'dask/master' into 'jakirkham/add_nonzero'

18cd74b

Drop axis from compress in nonzero

8107f4b

This was only needed when `compress` was unable to handle the unspecified `axis` case. However, as `compress` is now able to handle this case, there is no need for this argument. After all both arrays are flattened anyways.

In nonzero, apply compress to the index array

735ca97

To keep the computation of the nonzero indices compact, run `compress` on the non-zero indices before splitting them out into separate arrays in a `tuple`. This also improves readability a bit as well.

jakirkham changed the title ~~Add nonzero, flatnonzero, count_nonzero~~ Add nonzero, flatnonzero, count_nonzero, where (special case) Jul 26, 2017

jakirkham force-pushed the add_nonzero branch from 62a5e84 to e8ba0a4 Compare July 27, 2017 18:02

Refactor out _isnonzero_str

e8ba0a4

Adds a private function that performs a non-zero check on Python strings that acts the same way NumPy does. This is then used inside `isnonzero`.

Vectorize _isnonzero_str at the module level

95c85d3

jakirkham force-pushed the add_nonzero branch from d803329 to 3641add Compare July 27, 2017 22:48

jakirkham force-pushed the add_nonzero branch from 9a71bf8 to 491ae8f Compare July 28, 2017 05:05

shoyer mentioned this pull request Jul 28, 2017

WIP: indexing with broadcasting pydata/xarray#1473

Closed

4 tasks

Use nonzero function in where

7cb0b3a

To ensure the `condition` is properly converted to a Dask Array and uses our implementation of `nonzero`, call the function `nonzero` (as opposed to the method) on `condition`.

mrocklin merged commit a48abc1 into dask:master Jul 28, 2017

jakirkham deleted the add_nonzero branch July 28, 2017 13:23

jakirkham mentioned this pull request Jul 28, 2017

RTD config file in the repo #2568

Closed

sinhrks added this to the 0.15.2 milestone Aug 30, 2017

jakirkham mentioned this pull request Feb 20, 2018

Note where-style selection of Dask Arrays works #3182

Merged

jakirkham mentioned this pull request Jun 6, 2022

Add argwhere data-apis/array-api#449

Closed

Uh oh!

Add argwhere, *nonzero, where (cond) #2539

Add argwhere, *nonzero, where (cond) #2539

Uh oh!

Conversation

jakirkham commented Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Jul 24, 2017

Uh oh!

mrocklin commented Jul 24, 2017

Uh oh!

jakirkham commented Jul 24, 2017

Uh oh!

jakirkham commented Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jul 25, 2017

Uh oh!

shoyer commented Jul 25, 2017

Uh oh!

mrocklin commented Jul 25, 2017

Uh oh!

jakirkham commented Jul 27, 2017

Uh oh!

jakirkham commented Jul 27, 2017

Uh oh!

mrocklin commented Jul 27, 2017

Uh oh!

jakirkham commented Jul 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jul 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jul 27, 2017

Uh oh!

mrocklin commented Jul 27, 2017

Uh oh!

mrocklin commented Jul 28, 2017

Uh oh!

jakirkham commented Jul 28, 2017

Uh oh!

jakirkham commented Jul 28, 2017

Uh oh!

mrocklin commented Jul 28, 2017 via email

Uh oh!

jakirkham commented Jul 28, 2017

Uh oh!

jakirkham commented Jul 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jakirkham commented Jul 24, 2017 •

edited

Loading

jakirkham commented Jul 24, 2017 •

edited

Loading

jakirkham commented Jul 27, 2017 •

edited

Loading

jakirkham commented Jul 27, 2017 •

edited

Loading