Skip to content

Conversation

@TLouf
Copy link
Contributor

@TLouf TLouf commented Oct 30, 2024

I found this generator to be particularly slow, so I dug into it and made several optimizations. The main bottleneck I found was the re-computation of out-degrees at every single step. Updating their values at every step already allowed for a ~20x performance improvement for nx.random_k_out_graph(1000, 100, 0.1). I then introduced numpy to optimize further to reach a ~60x improvement for this same set of parameters (from 2 minutes down to 2 seconds). There are more lines of code but not really more complexity, as most lines are for variable initialisations and updates.

@TLouf TLouf changed the title perf: use numpy in random_k_out_graph perf: optimise random_k_out_graph Oct 30, 2024
@dschult
Copy link
Member

dschult commented Oct 30, 2024

Thanks very much for this!

The CI test errors are because we make sure that NetworkX still works even for environments which can't/won't install numpy. So we have a "base" test environment without numpy installed. What this means for the PR is that you need to move the import numpy as np inside the function (wrapped in a try/except for ImportError) and add this function to the list in networkx/conftest.py in the section indicating needs_numpy (about line 200).

For backward compatibility, it would be good to make this function work even if the import of numpy doesn't work. The code is only about 25 lines long, so maybe we should include both versions (with the native-only version including your improvement of updating the out-degree, but not the numpy parts). And the choice of which to use could be based on a flag set by whether the numpy import worked...???

There are others ways to implement this in a way that it could work in both settings. Some algorithms go so far as to separate the versions into two functions -- but I don't recall any that are generators. I think for now, could you put the other option in an if-clause like

try:
    import numpy as np
    no_numpy = False
except ImportError as exc:
    no_numpy = True

<setup>

if no_numpy:
    <other stuff>
    return G

<main stuff>

Do other folks have thoughts about how to set this up for numpy and not-numpy?

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other folks have thoughts about how to set this up for numpy and not-numpy?

This remains an open question IMO. I definitely agree that this should continue to work for users who don't have numpy installed, but I can't think of a way to provide interfaces to the user to choose (or automatically select) which function is used that doesn't have downsides.

Given the first constraint (that at least some version of this function must continue working even when numpy is not available), we're talking about maintaining two separate implementations. Call the (improved) Python-only impl def _random_k_out_graph_python and the one here def _random_k_out_graph_numpy. Then there are a couple of things that could be done:

  1. Automatically select which version is used based on whether numpy is available, i.e. something like:
def random_k_out_graph(...):
    try:
        import numpy as np
        return _random_k_out_graph_numpy(...)
    except ImportError:
        return _random_k_out_graph_python(...)

The most obvious problem here is that users are likely to get different results from the same inputs (including seed!) based on their environment. Introducing an implicit1 dependency on the environment seems like a terrible idea from the perspective of reproducibility/code transparency.

  1. Provide a switch for users:
def random_k_out_graph(..., *, use="numpy"):
    if use == "numpy":
        return _random_k_out_graph_numpy(...)
    else:
        return _random_k_out_graph_python(...)

This avoids the problem from step 1 by being explicit, but introduces additional complexity/stuff to read in the docstring. And given this switch, it'd only be natural to wonder: are there other fnctions in networkx that behave like this? Do they all work this way (current answer: no). What if I want to use "pytorch"? Is this a backend? If not, why not? - how is this different from backends?


So - all that's to say: I don't really know what's best. Either way, that's a larger discussion and I don't want to let it block the nice improvements proposed here (thanks @TLouf !)

For this PR, I'd propose to actually include both the numpy implementation and the improvements to the Python implementation (perhaps as separate, private functions). Then we can hook both into the test suite and decouple from the API discussion.

Footnotes

  1. I say implicit here because numpy is very commonly installed even when the user doesn't explicitly do pip install numpy as it's a dependency for so many other things. This is opposed to backends, where the installation is explicit.

@dschult
Copy link
Member

dschult commented Oct 30, 2024

I agree with the path forward that @rossbar puts forward.

@TLouf can you create two functions (named something like _random_k_out_graph_python and _random_k_out_graph_numpy) without doc_strings or dispatchable decorator -- just the *_random_state decorators as needed. Leave the original function (with doc_string and dispatchable decorator) and have it call the numpy version for now... We'll need to figure out whether we want a keyword arg or an environment choice to be made.

I'll put together a quick change to tests to parametrize the current tests to run twice -- once with each private helper function, and I'll add a deprecation process. Unless of course you want to have fun with those implementations.

If this is more than you are up for, let us know. If you are up for adding the "needs_numpy" entry in conftest.py, adding deprecation warnings and running tests on both private functions, go ahead and take it on. Do as much as you want and tell us what you'd prefer we do.

Thanks again!

@TLouf
Copy link
Contributor Author

TLouf commented Oct 30, 2024

Thanks for the positive feedback :)

I can write the two functions and parametrize the tests, no problem. I'll let you @dschult add the deprecation process, you probably know best how to write the message.

To add my two cents on the API discussion: I agree with @rossbar that adding an implicit dependency on the environment doesn't sound great. Having an explicit switch as a function kwarg and/or config flag would make a lot of sense to me. I'd argue it'd be better as a boolean (like use_numpy) that should be False by default, rather than a kind of ""backend"" selection, though. So to have it opt-in, rather than opt-out, to avoid reproducibility issues for unaware users. This reminds me a bit of how GeoPandas introduced PyGEOS, a faster backend for spatial operations, except that there opt-out made sense, as it was harder to have that niche library in your environment without knowing it.

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for including both the new implementation and the improvements to the existing implementation! Everything LGTM - I left a couple minor suggestions that occurred to me while reading through, but they are entirely subjective and can be ignored.

I approve adding these implementations - as it stands the PR violates the "the default should work without numpy" principle, but we can address the API point separately. IMO that discussion shouldn't block the addition of these nice improvements!

Comment on lines -417 to 423

@py_random_state(4)
@nx._dispatchable(graphs=None, returns_graph=True)
def random_k_out_graph(n, k, alpha, self_loops=True, seed=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of a backend developer, it's nice to have @py_random_state or @np_random_state before @nx._dispatchable so that the seed parameter is resolved and the same as used in networkx, which is convenient and can make it easier to match networkx results (but this depends on implementation of any particular function).

What does everyone think about adding e.g. @np_or_py_random_state(4) decorator to make it easier for this pattern to be used elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be tricky to make a @np_or_py_random_state decorator do what we want with this. The choice of seed object has to be made after checking which code to run.

Perhaps we could make a decorator specific to this situation... where we know that we will run the numpy code if numpy is installed and the python code if numpy is not installed. Does the name @np_random_state_if_possible describe what we want? Is there a better way to do this? It seems like we are setting ourselves up for making a decorator for each logical choice to be made. (but maybe there aren't many possible variants).

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TLouf !!
and @rossbar @eriknw for reviews

@dschult dschult merged commit ce8c378 into networkx:main Nov 11, 2024
@jarrodmillman jarrodmillman added this to the 3.5 milestone Nov 11, 2024
@TLouf TLouf deleted the k_out_perf branch November 14, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants