Skip to content

Conversation

@dcherian
Copy link
Contributor

@dcherian dcherian commented Oct 18, 2025

OK we were incredibly wasteful earlier!

| Change   | Before [b5e4b0e0] <main>   | After [c9432cfc] <min-concat-mem>   |   Ratio | Benchmark (Parameter)           |
|----------|----------------------------|-------------------------------------|---------|---------------------------------|
| -        | 4.82G                      | 920M                                |    0.19 | combine.Concat1d.peakmem_concat |
| -        | 574±20ms                   | 54.0±0.6ms                          |    0.09 | combine.Concat1d.time_concat    |

cc @mjwillson

Would be good to add a benchmark for the reindexing case at some point

Closes pydata#10864

```
| Change   | Before [b5e4b0e] <main>   | After [c9432cfc] <min-concat-mem>   |   Ratio | Benchmark (Parameter)           |
|----------|----------------------------|-------------------------------------|---------|---------------------------------|
| -        | 4.82G                      | 920M                                |    0.19 | combine.Concat1d.peakmem_concat |
| -        | 574±20ms                   | 54.0±0.6ms                          |    0.09 | combine.Concat1d.time_concat    |
```
@github-actions github-actions bot added CI Continuous Integration tools dependencies Pull requests that update a dependency file labels Oct 18, 2025
@dcherian
Copy link
Contributor Author

with reduced sizes for CI:

| Change   | Before [b5e4b0e0] <main>   | After [beb45036] <min-concat-mem~1>   |   Ratio | Benchmark (Parameter)           |
|----------|----------------------------|---------------------------------------|---------|---------------------------------|
| -        | 935M                       | 259M                                  |    0.28 | combine.Concat1d.peakmem_concat |
| -        | 91.5±1ms                   | 6.35±0.4ms                            |    0.07 | combine.Concat1d.time_concat    |

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian This is already very far away from my initial mediocre solution. Thanks, this will have extreme impact on our workflows. 🥇

file_start_indexes = np.append(0, np.cumsum(concat_dim_lengths))
concat_index = np.arange(file_start_indexes[-1])
concat_index_size = concat_index.size
concat_index_size = np.sum(concat_dim_lengths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might squeeze a bit more, if we combine the calculation of the sum with the above np.cumsum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might even think about adding the np.cumsum - trick you did further below and pre-allocate file_start_indexes. Not sure how much that gives, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_start_indexes is only ever allocated once, so doesn't seem worth it. We can use np.cumulative_sum(concat_dim_lengths, include_initial=True) once we require numpy>=2 I believe

* main:
  Update docs to reflect open_mfdataset default chunk behaviour (pydata#10567)
@dcherian dcherian added the plan to merge Final call for comments label Oct 21, 2025
@dcherian
Copy link
Contributor Author

This is already very far away from my initial mediocre solution.

Kai, your solution solved a ~10year old bug IIRC! I should've spotted this at review. I think I assumed it scaled with number of files, ~O(10_000), instead of dimension size O(10_000_000).

@kmuehlbauer
Copy link
Contributor

I think I assumed it scaled with number of files, ~O(10_000), instead of dimension size O(10_000_000).

At least, someone complained about it now 😀

@kmuehlbauer kmuehlbauer enabled auto-merge (squash) October 22, 2025 06:03
@kmuehlbauer kmuehlbauer merged commit 19f2973 into pydata:main Oct 22, 2025
35 of 36 checks passed
@kmuehlbauer
Copy link
Contributor

Thanks again @dcherian! Concatenators and combiners will have some spare time for doing more science now. 🎉

@dcherian dcherian deleted the min-concat-mem branch October 23, 2025 20:46
@dcherian dcherian mentioned this pull request Nov 14, 2025
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Continuous Integration tools dependencies Pull requests that update a dependency file plan to merge Final call for comments topic-performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xr.concat has over 3x the peak memory usage and 5x slower than np.concatenate, even with large chunk sizes

2 participants