Skip to content

branch-4.0: [opt](ann index) Make chunk size of index train configurable #58645#58727

Merged
yiguolei merged 1 commit intobranch-4.0from
auto-pick-58645-branch-4.0
Dec 5, 2025
Merged

branch-4.0: [opt](ann index) Make chunk size of index train configurable #58645#58727
yiguolei merged 1 commit intobranch-4.0from
auto-pick-58645-branch-4.0

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Dec 4, 2025

Cherry-picked from #58645

### What problem does this PR solve?
Previous pr: #57623

The current granularity for index training and data ingestion is set to
1M and is hard-coded, which makes index construction unnecessarily slow
in some scenarios. This should be made configurable and reduced when
appropriate.

For example, when having 1M vectors to add, and batch size of stream
load is set to 0.3M, this means we will have 3 stream load requests. If
it happens to make one request that having 0.3M to have 1 threads for
adding, whole process of load will be very slow. A typical cpu usage
will be like this:
<img width="1902" height="552" alt="image"
src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1"
/>

We need to make batch size configurable so that we can modify them when
we need to do it.

For example, when we set batch size to 30K, we can have a more higher
avg cpu usage when we like this:
<img width="1890" height="554" alt="image"
src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7"
/>

**Default value is still 1M, small batch size will do a damage to the
recall of the hnsw.**
@github-actions github-actions bot requested a review from yiguolei as a code owner December 4, 2025 12:47
@Thearas
Copy link
Contributor

Thearas commented Dec 4, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring closed this Dec 4, 2025
@dataroaring dataroaring reopened this Dec 4, 2025
@Thearas
Copy link
Contributor

Thearas commented Dec 4, 2025

run buildall

@yiguolei yiguolei merged commit 56b02a5 into branch-4.0 Dec 5, 2025
24 of 26 checks passed
@github-actions github-actions bot deleted the auto-pick-58645-branch-4.0 branch December 5, 2025 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants