-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
perf: optimise random_k_out_graph
#7702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
random_k_out_graphrandom_k_out_graph
|
Thanks very much for this! The CI test errors are because we make sure that NetworkX still works even for environments which can't/won't install numpy. So we have a "base" test environment without numpy installed. What this means for the PR is that you need to move the For backward compatibility, it would be good to make this function work even if the import of numpy doesn't work. The code is only about 25 lines long, so maybe we should include both versions (with the native-only version including your improvement of updating the out-degree, but not the numpy parts). And the choice of which to use could be based on a flag set by whether the numpy import worked...??? There are others ways to implement this in a way that it could work in both settings. Some algorithms go so far as to separate the versions into two functions -- but I don't recall any that are generators. I think for now, could you put the other option in an if-clause like try:
import numpy as np
no_numpy = False
except ImportError as exc:
no_numpy = True
<setup>
if no_numpy:
<other stuff>
return G
<main stuff>Do other folks have thoughts about how to set this up for numpy and not-numpy? |
rossbar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other folks have thoughts about how to set this up for numpy and not-numpy?
This remains an open question IMO. I definitely agree that this should continue to work for users who don't have numpy installed, but I can't think of a way to provide interfaces to the user to choose (or automatically select) which function is used that doesn't have downsides.
Given the first constraint (that at least some version of this function must continue working even when numpy is not available), we're talking about maintaining two separate implementations. Call the (improved) Python-only impl def _random_k_out_graph_python and the one here def _random_k_out_graph_numpy. Then there are a couple of things that could be done:
- Automatically select which version is used based on whether numpy is available, i.e. something like:
def random_k_out_graph(...):
try:
import numpy as np
return _random_k_out_graph_numpy(...)
except ImportError:
return _random_k_out_graph_python(...)The most obvious problem here is that users are likely to get different results from the same inputs (including seed!) based on their environment. Introducing an implicit1 dependency on the environment seems like a terrible idea from the perspective of reproducibility/code transparency.
- Provide a switch for users:
def random_k_out_graph(..., *, use="numpy"):
if use == "numpy":
return _random_k_out_graph_numpy(...)
else:
return _random_k_out_graph_python(...)This avoids the problem from step 1 by being explicit, but introduces additional complexity/stuff to read in the docstring. And given this switch, it'd only be natural to wonder: are there other fnctions in networkx that behave like this? Do they all work this way (current answer: no). What if I want to use "pytorch"? Is this a backend? If not, why not? - how is this different from backends?
So - all that's to say: I don't really know what's best. Either way, that's a larger discussion and I don't want to let it block the nice improvements proposed here (thanks @TLouf !)
For this PR, I'd propose to actually include both the numpy implementation and the improvements to the Python implementation (perhaps as separate, private functions). Then we can hook both into the test suite and decouple from the API discussion.
Footnotes
-
I say implicit here because numpy is very commonly installed even when the user doesn't explicitly do
pip install numpyas it's a dependency for so many other things. This is opposed to backends, where the installation is explicit. ↩
|
I agree with the path forward that @rossbar puts forward. @TLouf can you create two functions (named something like I'll put together a quick change to tests to parametrize the current tests to run twice -- once with each private helper function, and I'll add a deprecation process. Unless of course you want to have fun with those implementations. If this is more than you are up for, let us know. If you are up for adding the "needs_numpy" entry in conftest.py, adding deprecation warnings and running tests on both private functions, go ahead and take it on. Do as much as you want and tell us what you'd prefer we do. Thanks again! |
|
Thanks for the positive feedback :) I can write the two functions and parametrize the tests, no problem. I'll let you @dschult add the deprecation process, you probably know best how to write the message. To add my two cents on the API discussion: I agree with @rossbar that adding an implicit dependency on the environment doesn't sound great. Having an explicit switch as a function kwarg and/or config flag would make a lot of sense to me. I'd argue it'd be better as a boolean (like |
rossbar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for including both the new implementation and the improvements to the existing implementation! Everything LGTM - I left a couple minor suggestions that occurred to me while reading through, but they are entirely subjective and can be ignored.
I approve adding these implementations - as it stands the PR violates the "the default should work without numpy" principle, but we can address the API point separately. IMO that discussion shouldn't block the addition of these nice improvements!
Co-authored-by: Ross Barnowski <rossbar@caltech.edu>
|
|
||
| @py_random_state(4) | ||
| @nx._dispatchable(graphs=None, returns_graph=True) | ||
| def random_k_out_graph(n, k, alpha, self_loops=True, seed=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of a backend developer, it's nice to have @py_random_state or @np_random_state before @nx._dispatchable so that the seed parameter is resolved and the same as used in networkx, which is convenient and can make it easier to match networkx results (but this depends on implementation of any particular function).
What does everyone think about adding e.g. @np_or_py_random_state(4) decorator to make it easier for this pattern to be used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be tricky to make a @np_or_py_random_state decorator do what we want with this. The choice of seed object has to be made after checking which code to run.
Perhaps we could make a decorator specific to this situation... where we know that we will run the numpy code if numpy is installed and the python code if numpy is not installed. Does the name @np_random_state_if_possible describe what we want? Is there a better way to do this? It seems like we are setting ourselves up for making a decorator for each logical choice to be made. (but maybe there aren't many possible variants).
dschult
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this generator to be particularly slow, so I dug into it and made several optimizations. The main bottleneck I found was the re-computation of out-degrees at every single step. Updating their values at every step already allowed for a ~20x performance improvement for
nx.random_k_out_graph(1000, 100, 0.1). I then introduced numpy to optimize further to reach a ~60x improvement for this same set of parameters (from 2 minutes down to 2 seconds). There are more lines of code but not really more complexity, as most lines are for variable initialisations and updates.