perf: optimise `random_k_out_graph` #7702

TLouf · 2024-10-30T09:41:02Z

I found this generator to be particularly slow, so I dug into it and made several optimizations. The main bottleneck I found was the re-computation of out-degrees at every single step. Updating their values at every step already allowed for a ~20x performance improvement for nx.random_k_out_graph(1000, 100, 0.1). I then introduced numpy to optimize further to reach a ~60x improvement for this same set of parameters (from 2 minutes down to 2 seconds). There are more lines of code but not really more complexity, as most lines are for variable initialisations and updates.

dschult · 2024-10-30T15:53:20Z

Thanks very much for this!

The CI test errors are because we make sure that NetworkX still works even for environments which can't/won't install numpy. So we have a "base" test environment without numpy installed. What this means for the PR is that you need to move the import numpy as np inside the function (wrapped in a try/except for ImportError) and add this function to the list in networkx/conftest.py in the section indicating needs_numpy (about line 200).

For backward compatibility, it would be good to make this function work even if the import of numpy doesn't work. The code is only about 25 lines long, so maybe we should include both versions (with the native-only version including your improvement of updating the out-degree, but not the numpy parts). And the choice of which to use could be based on a flag set by whether the numpy import worked...???

There are others ways to implement this in a way that it could work in both settings. Some algorithms go so far as to separate the versions into two functions -- but I don't recall any that are generators. I think for now, could you put the other option in an if-clause like

try:
    import numpy as np
    no_numpy = False
except ImportError as exc:
    no_numpy = True

<setup>

if no_numpy:
    <other stuff>
    return G

<main stuff>

Do other folks have thoughts about how to set this up for numpy and not-numpy?

rossbar

Do other folks have thoughts about how to set this up for numpy and not-numpy?

This remains an open question IMO. I definitely agree that this should continue to work for users who don't have numpy installed, but I can't think of a way to provide interfaces to the user to choose (or automatically select) which function is used that doesn't have downsides.

Given the first constraint (that at least some version of this function must continue working even when numpy is not available), we're talking about maintaining two separate implementations. Call the (improved) Python-only impl def _random_k_out_graph_python and the one here def _random_k_out_graph_numpy. Then there are a couple of things that could be done:

Automatically select which version is used based on whether numpy is available, i.e. something like:

def random_k_out_graph(...):
    try:
        import numpy as np
        return _random_k_out_graph_numpy(...)
    except ImportError:
        return _random_k_out_graph_python(...)

The most obvious problem here is that users are likely to get different results from the same inputs (including seed!) based on their environment. Introducing an implicit¹ dependency on the environment seems like a terrible idea from the perspective of reproducibility/code transparency.

Provide a switch for users:

def random_k_out_graph(..., *, use="numpy"):
    if use == "numpy":
        return _random_k_out_graph_numpy(...)
    else:
        return _random_k_out_graph_python(...)

This avoids the problem from step 1 by being explicit, but introduces additional complexity/stuff to read in the docstring. And given this switch, it'd only be natural to wonder: are there other fnctions in networkx that behave like this? Do they all work this way (current answer: no). What if I want to use "pytorch"? Is this a backend? If not, why not? - how is this different from backends?

So - all that's to say: I don't really know what's best. Either way, that's a larger discussion and I don't want to let it block the nice improvements proposed here (thanks @TLouf !)

For this PR, I'd propose to actually include both the numpy implementation and the improvements to the Python implementation (perhaps as separate, private functions). Then we can hook both into the test suite and decouple from the API discussion.

I say implicit here because numpy is very commonly installed even when the user doesn't explicitly do pip install numpy as it's a dependency for so many other things. This is opposed to backends, where the installation is explicit. ↩

dschult · 2024-10-30T20:27:02Z

I agree with the path forward that @rossbar puts forward.

@TLouf can you create two functions (named something like _random_k_out_graph_python and _random_k_out_graph_numpy) without doc_strings or dispatchable decorator -- just the *_random_state decorators as needed. Leave the original function (with doc_string and dispatchable decorator) and have it call the numpy version for now... We'll need to figure out whether we want a keyword arg or an environment choice to be made.

I'll put together a quick change to tests to parametrize the current tests to run twice -- once with each private helper function, and I'll add a deprecation process. Unless of course you want to have fun with those implementations.

If this is more than you are up for, let us know. If you are up for adding the "needs_numpy" entry in conftest.py, adding deprecation warnings and running tests on both private functions, go ahead and take it on. Do as much as you want and tell us what you'd prefer we do.

Thanks again!

TLouf · 2024-10-30T21:26:43Z

Thanks for the positive feedback :)

I can write the two functions and parametrize the tests, no problem. I'll let you @dschult add the deprecation process, you probably know best how to write the message.

To add my two cents on the API discussion: I agree with @rossbar that adding an implicit dependency on the environment doesn't sound great. Having an explicit switch as a function kwarg and/or config flag would make a lot of sense to me. I'd argue it'd be better as a boolean (like use_numpy) that should be False by default, rather than a kind of ""backend"" selection, though. So to have it opt-in, rather than opt-out, to avoid reproducibility issues for unaware users. This reminds me a bit of how GeoPandas introduced PyGEOS, a faster backend for spatial operations, except that there opt-out made sense, as it was harder to have that niche library in your environment without knowing it.

rossbar

Thanks very much for including both the new implementation and the improvements to the existing implementation! Everything LGTM - I left a couple minor suggestions that occurred to me while reading through, but they are entirely subjective and can be ignored.

I approve adding these implementations - as it stands the PR violates the "the default should work without numpy" principle, but we can address the API point separately. IMO that discussion shouldn't block the addition of these nice improvements!

networkx/generators/tests/test_directed.py

networkx/generators/directed.py

Co-authored-by: Ross Barnowski <rossbar@caltech.edu>

eriknw · 2024-11-07T17:50:53Z

networkx/generators/directed.py


-@py_random_state(4)
 @nx._dispatchable(graphs=None, returns_graph=True)
 def random_k_out_graph(n, k, alpha, self_loops=True, seed=None):


From the perspective of a backend developer, it's nice to have @py_random_state or @np_random_state before @nx._dispatchable so that the seed parameter is resolved and the same as used in networkx, which is convenient and can make it easier to match networkx results (but this depends on implementation of any particular function).

What does everyone think about adding e.g. @np_or_py_random_state(4) decorator to make it easier for this pattern to be used elsewhere?

It might be tricky to make a @np_or_py_random_state decorator do what we want with this. The choice of seed object has to be made after checking which code to run.

Perhaps we could make a decorator specific to this situation... where we know that we will run the numpy code if numpy is installed and the python code if numpy is not installed. Does the name @np_random_state_if_possible describe what we want? Is there a better way to do this? It seems like we are setting ourselves up for making a decorator for each logical choice to be made. (but maybe there aren't many possible variants).

dschult

Thanks @TLouf !!
and @rossbar @eriknw for reviews

perf: use numpy in random_k_out_graph

05ffe18

TLouf changed the title ~~perf: use numpy in random_k_out_graph~~ perf: optimise random_k_out_graph Oct 30, 2024

dschult added the type: Enhancements label Oct 30, 2024

rossbar reviewed Oct 30, 2024

View reviewed changes

TLouf added 2 commits November 1, 2024 11:40

Add more optimal pure Python implementation

1bb97c6

Parametrize tests to run for both implementations

fe65cac

ruidazeng approved these changes Nov 5, 2024

View reviewed changes

rossbar approved these changes Nov 7, 2024

View reviewed changes

networkx/generators/tests/test_directed.py Outdated Show resolved Hide resolved

networkx/generators/directed.py Outdated Show resolved Hide resolved

TLouf and others added 3 commits November 7, 2024 09:31

Simplify the self-loop conditional branches by merging them

a18c7d8

Co-authored-by: Ross Barnowski <rossbar@caltech.edu>

parametrize test on function directly

380004e

fmt

244941f

eriknw reviewed Nov 7, 2024

View reviewed changes

dschult mentioned this pull request Nov 11, 2024

Clean up after random_k_out_graph speed-up #7716

Closed

3 tasks

dschult approved these changes Nov 11, 2024

View reviewed changes

dschult merged commit ce8c378 into networkx:main Nov 11, 2024

jarrodmillman added this to the 3.5 milestone Nov 11, 2024

TLouf deleted the k_out_perf branch November 14, 2024 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimise `random_k_out_graph` #7702

perf: optimise `random_k_out_graph` #7702

TLouf commented Oct 30, 2024

Uh oh!

dschult commented Oct 30, 2024

Uh oh!

rossbar left a comment

Uh oh!

dschult commented Oct 30, 2024

Uh oh!

TLouf commented Oct 30, 2024 •

edited

Loading

Uh oh!

rossbar left a comment

Uh oh!

Uh oh!

Uh oh!

eriknw Nov 7, 2024

Uh oh!

dschult Nov 7, 2024

Uh oh!

dschult left a comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Uh oh!

perf: optimise random_k_out_graph #7702

perf: optimise random_k_out_graph #7702

Conversation

TLouf commented Oct 30, 2024

Uh oh!

dschult commented Oct 30, 2024

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Footnotes

Uh oh!

dschult commented Oct 30, 2024

Uh oh!

TLouf commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eriknw Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

dschult Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

dschult left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

perf: optimise `random_k_out_graph` #7702

perf: optimise `random_k_out_graph` #7702

TLouf commented Oct 30, 2024 •

edited

Loading