[improve](ann index)Accumulate multiple small batches before training#57623
[improve](ann index)Accumulate multiple small batches before training#57623airborne12 merged 5 commits intoapache:masterfrom
Conversation
de23f84 to
88c3eca
Compare
|
run buildall |
ClickBench: Total hot run time: 27.76 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
ClickBench: Total hot run time: 29.09 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| // VectorIndex should be weak shared by AnnIndexWriter and VectorIndexReader | ||
| // This should be a weak_ptr | ||
| std::shared_ptr<VectorIndex> _vector_index; | ||
| std::vector<float> _ann_vec; |
There was a problem hiding this comment.
replace std::vector with DorisVector for memory safe.
|
|
||
| if (i > 0) { | ||
| vectorized::Int64 offset = i * dim; | ||
| std::copy(_ann_vec.begin() + offset, _ann_vec.end(), _ann_vec.begin()); |
There was a problem hiding this comment.
cost of memory copy can be optimized by using std::list<std::shared_ptr<DorisVector>>
d81e6ab to
bc19e89
Compare
|
run buildall |
TPC-H: Total hot run time: 34427 ms |
TPC-DS: Total hot run time: 187644 ms |
ClickBench: Total hot run time: 27.73 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
|
||
| size_t block_size = CHUNK_SIZE * build_parameter.dim; | ||
| // The array capacity will not change after resizing | ||
| _float_array.resize(block_size); |
There was a problem hiding this comment.
reserve instead of resize
| size_t block_size = CHUNK_SIZE * build_parameter.dim; | ||
| // The array capacity will not change after resizing | ||
| _float_array.resize(block_size); | ||
| _array_offset = 0; |
There was a problem hiding this comment.
_array_offset is not needed
|
PR approved by anyone and no changes requested. |
|
PR approved by at least one committer and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 36144 ms |
TPC-DS: Total hot run time: 187633 ms |
ClickBench: Total hot run time: 27.84 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run cloud_p0 |
|
run external |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
…apache#57623) Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.
…apache#57623) Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.
…apache#57623) ### What problem does this PR solve? Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.
…apache#57623) ### What problem does this PR solve? Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.
### What problem does this PR solve? Previous pr: #57623 The current granularity for index training and data ingestion is set to 1M and is hard-coded, which makes index construction unnecessarily slow in some scenarios. This should be made configurable and reduced when appropriate. For example, when having 1M vectors to add, and batch size of stream load is set to 0.3M, this means we will have 3 stream load requests. If it happens to make one request that having 0.3M to have 1 threads for adding, whole process of load will be very slow. A typical cpu usage will be like this: <img width="1902" height="552" alt="image" src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1" /> We need to make batch size configurable so that we can modify them when we need to do it. For example, when we set batch size to 30K, we can have a more higher avg cpu usage when we like this: <img width="1890" height="554" alt="image" src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7" /> **Default value is still 1M, small batch size will do a damage to the recall of the hnsw.**
### What problem does this PR solve? Previous pr: #57623 The current granularity for index training and data ingestion is set to 1M and is hard-coded, which makes index construction unnecessarily slow in some scenarios. This should be made configurable and reduced when appropriate. For example, when having 1M vectors to add, and batch size of stream load is set to 0.3M, this means we will have 3 stream load requests. If it happens to make one request that having 0.3M to have 1 threads for adding, whole process of load will be very slow. A typical cpu usage will be like this: <img width="1902" height="552" alt="image" src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1" /> We need to make batch size configurable so that we can modify them when we need to do it. For example, when we set batch size to 30K, we can have a more higher avg cpu usage when we like this: <img width="1890" height="554" alt="image" src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7" /> **Default value is still 1M, small batch size will do a damage to the recall of the hnsw.**
…58645) ### What problem does this PR solve? Previous pr: apache#57623 The current granularity for index training and data ingestion is set to 1M and is hard-coded, which makes index construction unnecessarily slow in some scenarios. This should be made configurable and reduced when appropriate. For example, when having 1M vectors to add, and batch size of stream load is set to 0.3M, this means we will have 3 stream load requests. If it happens to make one request that having 0.3M to have 1 threads for adding, whole process of load will be very slow. A typical cpu usage will be like this: <img width="1902" height="552" alt="image" src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1" /> We need to make batch size configurable so that we can modify them when we need to do it. For example, when we set batch size to 30K, we can have a more higher avg cpu usage when we like this: <img width="1890" height="554" alt="image" src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7" /> **Default value is still 1M, small batch size will do a damage to the recall of the hnsw.**
What problem does this PR solve?
Accumulate multiple small batches to avoid the following error when training:
Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters,and significantly reduce the time for faiss train/add.
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)