[improve](ann index)Accumulate multiple small batches before training#57623

uchenily · 2025-11-03T06:16:41Z

What problem does this PR solve?

Accumulate multiple small batches to avoid the following error when training:
Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters,
and significantly reduce the time for faiss train/add.

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

uchenily · 2025-11-03T06:26:24Z

run buildall

doris-robot · 2025-11-03T07:23:53Z

ClickBench: Total hot run time: 27.76 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 88c3ecacaf859194815967aeb7ea0c2d36b7a9e8, data reload: false

query1	0.05	0.04	0.04
query2	0.10	0.05	0.06
query3	0.25	0.08	0.08
query4	1.61	0.12	0.12
query5	0.29	0.28	0.25
query6	1.16	0.64	0.66
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.62	0.53	0.52
query10	0.58	0.57	0.58
query11	0.16	0.11	0.12
query12	0.16	0.12	0.13
query13	0.62	0.60	0.60
query14	1.03	1.00	1.02
query15	0.87	0.83	0.84
query16	0.40	0.40	0.40
query17	1.03	1.04	1.05
query18	0.21	0.23	0.20
query19	1.95	1.85	1.80
query20	0.01	0.01	0.02
query21	15.44	0.19	0.13
query22	5.14	0.07	0.05
query23	15.69	0.26	0.10
query24	3.15	0.59	0.54
query25	0.07	0.07	0.07
query26	0.16	0.15	0.13
query27	0.07	0.06	0.06
query28	4.91	1.14	0.94
query29	12.62	3.93	3.27
query30	0.28	0.14	0.12
query31	2.81	0.59	0.38
query32	3.23	0.56	0.47
query33	3.06	3.04	3.04
query34	15.97	5.17	4.57
query35	4.59	4.58	4.58
query36	0.69	0.52	0.50
query37	0.10	0.06	0.06
query38	0.06	0.04	0.04
query39	0.04	0.02	0.03
query40	0.19	0.15	0.15
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 99.62 s
Total hot run time: 27.76 s

hello-stephen · 2025-11-03T07:52:35Z

BE UT Coverage Report

Increment line coverage 69.57% (16/23) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.70% (18048/34249)
Line Coverage	37.99% (164064/431862)
Region Coverage	32.30% (124964/386918)
Branch Coverage	33.74% (54717/162194)

uchenily · 2025-11-03T09:30:14Z

run buildall

doris-robot · 2025-11-03T10:23:14Z

ClickBench: Total hot run time: 29.09 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 295be19d0cb6a0f46254cd373bc338477bf9a539, data reload: false

query1	0.06	0.06	0.05
query2	0.10	0.05	0.06
query3	0.26	0.09	0.09
query4	1.62	0.12	0.12
query5	0.28	0.28	0.27
query6	1.21	0.68	0.67
query7	0.04	0.02	0.03
query8	0.07	0.05	0.05
query9	0.66	0.57	0.58
query10	0.62	0.62	0.62
query11	0.19	0.13	0.14
query12	0.19	0.14	0.14
query13	0.64	0.62	0.61
query14	1.03	1.04	1.02
query15	0.89	0.87	0.91
query16	0.45	0.42	0.42
query17	1.14	1.12	1.24
query18	0.24	0.22	0.21
query19	2.01	1.91	1.99
query20	0.02	0.02	0.02
query21	15.38	0.22	0.16
query22	5.00	0.08	0.06
query23	15.60	0.31	0.12
query24	2.90	0.59	0.47
query25	0.10	0.08	0.07
query26	0.17	0.16	0.16
query27	0.08	0.06	0.06
query28	4.50	1.16	0.96
query29	12.62	4.55	3.79
query30	0.34	0.16	0.12
query31	2.84	0.65	0.42
query32	3.25	0.60	0.50
query33	3.08	3.09	3.30
query34	16.06	5.22	4.65
query35	4.64	4.62	4.56
query36	0.69	0.55	0.52
query37	0.12	0.08	0.08
query38	0.07	0.05	0.05
query39	0.05	0.03	0.03
query40	0.19	0.16	0.16
query41	0.10	0.04	0.03
query42	0.05	0.04	0.03
query43	0.05	0.04	0.04
Total cold run time: 99.6 s
Total hot run time: 29.09 s

hello-stephen · 2025-11-03T10:37:48Z

BE UT Coverage Report

Increment line coverage 69.57% (16/23) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.70% (18048/34249)
Line Coverage	37.99% (164071/431862)
Region Coverage	32.32% (125044/386918)
Branch Coverage	33.74% (54718/162194)

hello-stephen · 2025-11-03T13:12:16Z

BE Regression && UT Coverage Report

Increment line coverage 69.57% (16/23) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.44% (24052/33668)
Line Coverage	57.81% (250092/432607)
Region Coverage	52.71% (206883/392514)
Branch Coverage	54.51% (89022/163303)

zhiqiang-hhhh · 2025-11-07T09:27:21Z

be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.h

    // VectorIndex should be weak shared by AnnIndexWriter and VectorIndexReader
    // This should be a weak_ptr
    std::shared_ptr<VectorIndex> _vector_index;
+    std::vector<float> _ann_vec;


replace std::vector with DorisVector for memory safe.

be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp

zhiqiang-hhhh · 2025-11-07T10:16:17Z

be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp

+
+    if (i > 0) {
+        vectorized::Int64 offset = i * dim;
+        std::copy(_ann_vec.begin() + offset, _ann_vec.end(), _ann_vec.begin());


cost of memory copy can be optimized by using std::list<std::shared_ptr<DorisVector>>

uchenily · 2025-11-10T01:27:54Z

run buildall

doris-robot · 2025-11-10T02:13:04Z

TPC-H: Total hot run time: 34427 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bc19e89dbabb58e55e36f4453919414a7084ef94, data reload: false

------ Round 1 ----------------------------------
q1	17644	5232	5077	5077
q2	2026	321	206	206
q3	10271	1317	712	712
q4	10222	886	359	359
q5	7510	2444	2380	2380
q6	189	170	135	135
q7	967	776	620	620
q8	9353	1391	1172	1172
q9	6971	5389	5169	5169
q10	6898	2233	1828	1828
q11	520	304	281	281
q12	331	365	228	228
q13	17806	3663	3077	3077
q14	231	237	215	215
q15	573	518	502	502
q16	1031	1011	938	938
q17	593	851	391	391
q18	7556	7131	7021	7021
q19	1307	959	567	567
q20	351	340	228	228
q21	3711	3215	2337	2337
q22	1065	1022	984	984
Total cold run time: 107126 ms
Total hot run time: 34427 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5169	5103	5121	5103
q2	247	332	226	226
q3	2174	2753	2286	2286
q4	1337	1742	1309	1309
q5	4208	4658	4419	4419
q6	214	172	128	128
q7	2043	1984	1870	1870
q8	2696	2533	2514	2514
q9	7413	7233	7392	7233
q10	3092	3346	2827	2827
q11	612	534	503	503
q12	701	795	654	654
q13	3810	3907	3298	3298
q14	306	303	266	266
q15	542	517	517	517
q16	1049	1126	1101	1101
q17	1196	1540	1477	1477
q18	8121	7669	7689	7669
q19	806	889	1103	889
q20	2092	2032	1885	1885
q21	4787	4277	4304	4277
q22	1086	1044	987	987
Total cold run time: 53701 ms
Total hot run time: 51438 ms

doris-robot · 2025-11-10T02:24:46Z

TPC-DS: Total hot run time: 187644 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bc19e89dbabb58e55e36f4453919414a7084ef94, data reload: false

query1	1052	407	397	397
query2	6579	1674	1701	1674
query3	6758	232	233	232
query4	26106	23909	23113	23113
query5	4422	623	480	480
query6	322	255	212	212
query7	4645	494	287	287
query8	300	259	261	259
query9	8708	2567	2571	2567
query10	507	348	296	296
query11	15611	15064	14911	14911
query12	174	117	117	117
query13	1676	576	441	441
query14	10518	9100	9174	9100
query15	195	192	172	172
query16	7380	701	517	517
query17	1257	787	639	639
query18	2006	433	337	337
query19	208	211	190	190
query20	137	133	120	120
query21	212	138	117	117
query22	4105	4059	4026	4026
query23	34290	32807	32938	32807
query24	8420	2420	2399	2399
query25	638	564	493	493
query26	1240	282	160	160
query27	2726	501	349	349
query28	4374	2198	2187	2187
query29	850	642	508	508
query30	308	232	195	195
query31	902	823	737	737
query32	83	75	73	73
query33	601	381	349	349
query34	791	848	512	512
query35	789	839	755	755
query36	1006	966	910	910
query37	122	113	88	88
query38	3486	3567	3506	3506
query39	1472	1444	1428	1428
query40	220	130	123	123
query41	61	58	63	58
query42	126	113	111	111
query43	478	501	479	479
query44	1214	749	730	730
query45	188	178	171	171
query46	865	973	632	632
query47	1786	1822	1736	1736
query48	391	419	313	313
query49	773	505	397	397
query50	638	687	399	399
query51	3975	3920	4032	3920
query52	117	107	104	104
query53	234	262	193	193
query54	318	290	271	271
query55	83	89	82	82
query56	323	338	300	300
query57	1167	1182	1113	1113
query58	287	268	269	268
query59	2604	2615	2472	2472
query60	347	343	323	323
query61	196	160	157	157
query62	784	759	680	680
query63	229	192	194	192
query64	4538	1148	853	853
query65	4008	3978	3917	3917
query66	1183	439	356	356
query67	15297	15239	14944	14944
query68	8479	927	595	595
query69	492	321	294	294
query70	1323	1281	1281	1281
query71	489	341	315	315
query72	5981	5032	4855	4855
query73	693	585	370	370
query74	8991	9262	8944	8944
query75	3933	3313	2857	2857
query76	3736	1178	731	731
query77	799	387	335	335
query78	9465	9564	8933	8933
query79	3167	849	594	594
query80	693	579	500	500
query81	495	262	227	227
query82	451	160	128	128
query83	309	278	263	263
query84	296	120	92	92
query85	920	481	461	461
query86	339	315	284	284
query87	3810	3713	3693	3693
query88	3295	2289	2226	2226
query89	425	342	297	297
query90	1977	221	217	217
query91	180	173	140	140
query92	85	68	62	62
query93	2325	977	641	641
query94	725	414	351	351
query95	414	318	319	318
query96	496	571	281	281
query97	2956	3003	2889	2889
query98	249	225	210	210
query99	1445	1419	1306	1306
Total cold run time: 276839 ms
Total hot run time: 187644 ms

doris-robot · 2025-11-10T02:29:52Z

ClickBench: Total hot run time: 27.73 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bc19e89dbabb58e55e36f4453919414a7084ef94, data reload: false

query1	0.06	0.04	0.04
query2	0.09	0.05	0.06
query3	0.26	0.09	0.08
query4	1.61	0.12	0.12
query5	0.28	0.24	0.26
query6	1.20	0.64	0.65
query7	0.04	0.03	0.03
query8	0.05	0.04	0.05
query9	0.59	0.54	0.52
query10	0.57	0.58	0.57
query11	0.16	0.11	0.12
query12	0.16	0.12	0.12
query13	0.62	0.60	0.60
query14	1.01	1.01	1.01
query15	0.85	0.83	0.84
query16	0.37	0.38	0.40
query17	1.07	1.01	1.03
query18	0.22	0.20	0.21
query19	1.87	1.85	1.79
query20	0.02	0.02	0.02
query21	15.44	0.20	0.13
query22	5.10	0.07	0.05
query23	15.66	0.25	0.10
query24	2.39	0.69	1.01
query25	0.08	0.07	0.06
query26	0.15	0.14	0.14
query27	0.07	0.06	0.05
query28	4.59	1.16	0.93
query29	12.59	3.86	3.20
query30	0.28	0.15	0.11
query31	2.82	0.58	0.39
query32	3.23	0.55	0.46
query33	2.98	3.10	3.07
query34	15.89	5.19	4.53
query35	4.59	4.58	4.59
query36	0.67	0.51	0.49
query37	0.10	0.07	0.06
query38	0.08	0.04	0.04
query39	0.04	0.03	0.03
query40	0.16	0.14	0.14
query41	0.09	0.04	0.03
query42	0.03	0.03	0.02
query43	0.05	0.03	0.04
Total cold run time: 98.18 s
Total hot run time: 27.73 s

doris-robot · 2025-11-10T02:54:44Z

BE UT Coverage Report

Increment line coverage 100.00% (31/31) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.78% (18226/34533)
Line Coverage	38.13% (165780/434744)
Region Coverage	33.12% (128911/389206)
Branch Coverage	33.87% (55321/163336)

hello-stephen · 2025-11-10T05:34:22Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (31/31) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.56% (24294/33947)
Line Coverage	58.06% (252845/435464)
Region Coverage	53.36% (210645/394727)
Branch Coverage	54.67% (89895/164419)

zhiqiang-hhhh · 2025-11-10T06:47:22Z

be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp

+
+    size_t block_size = CHUNK_SIZE * build_parameter.dim;
+    // The array capacity will not change after resizing
+    _float_array.resize(block_size);


reserve instead of resize

zhiqiang-hhhh · 2025-11-10T06:47:35Z

be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp

+    size_t block_size = CHUNK_SIZE * build_parameter.dim;
+    // The array capacity will not change after resizing
+    _float_array.resize(block_size);
+    _array_offset = 0;


_array_offset is not needed

github-actions · 2025-11-10T06:48:50Z

PR approved by anyone and no changes requested.

airborne12

LGTM

github-actions · 2025-11-10T06:52:54Z

PR approved by at least one committer and no changes requested.

uchenily · 2025-11-10T07:29:43Z

run buildall

doris-robot · 2025-11-10T08:45:54Z

TPC-H: Total hot run time: 36144 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b7ffe325460f56d8806600839d8417c42e126e3d, data reload: false

------ Round 1 ----------------------------------
q1	17623	5218	5069	5069
q2	2039	332	200	200
q3	10280	1308	745	745
q4	10282	934	372	372
q5	8244	2404	2424	2404
q6	205	167	135	135
q7	926	768	622	622
q8	9349	1411	1150	1150
q9	7445	5214	5183	5183
q10	6926	2233	1789	1789
q11	503	292	281	281
q12	365	368	224	224
q13	17789	3659	3038	3038
q14	226	244	209	209
q15	587	499	519	499
q16	1024	1002	946	946
q17	591	897	373	373
q18	7450	7778	7745	7745
q19	1616	985	600	600
q20	354	367	237	237
q21	4356	3761	3271	3271
q22	1097	1105	1052	1052
Total cold run time: 109277 ms
Total hot run time: 36144 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5511	5322	5417	5322
q2	262	336	223	223
q3	2470	2928	2513	2513
q4	1467	1886	1504	1504
q5	4718	4542	4266	4266
q6	221	166	126	126
q7	2005	1956	1843	1843
q8	2606	2562	2568	2562
q9	7342	7601	7259	7259
q10	2953	3131	2682	2682
q11	569	511	494	494
q12	637	737	589	589
q13	3306	3623	3021	3021
q14	280	280	261	261
q15	544	490	484	484
q16	1011	1049	1003	1003
q17	1126	1450	1330	1330
q18	7304	7278	6967	6967
q19	777	731	759	731
q20	1949	1992	1808	1808
q21	4717	4359	4294	4294
q22	1099	1069	986	986
Total cold run time: 52874 ms
Total hot run time: 50268 ms

doris-robot · 2025-11-10T08:57:38Z

TPC-DS: Total hot run time: 187633 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b7ffe325460f56d8806600839d8417c42e126e3d, data reload: false

query1	1036	411	393	393
query2	6560	1697	1718	1697
query3	6757	227	220	220
query4	26185	23770	23225	23225
query5	4445	650	469	469
query6	341	242	232	232
query7	4652	508	297	297
query8	312	267	252	252
query9	8720	2584	2557	2557
query10	484	364	289	289
query11	15720	15035	14848	14848
query12	185	118	112	112
query13	1678	557	441	441
query14	11250	9062	9163	9062
query15	200	185	175	175
query16	7675	667	510	510
query17	1248	743	608	608
query18	2024	413	347	347
query19	206	205	187	187
query20	134	122	119	119
query21	214	128	116	116
query22	3970	4052	4045	4045
query23	34149	32828	33099	32828
query24	8408	2413	2418	2413
query25	610	514	437	437
query26	1240	269	154	154
query27	2760	490	346	346
query28	4394	2243	2178	2178
query29	801	602	476	476
query30	294	223	200	200
query31	935	796	713	713
query32	87	73	76	73
query33	594	367	335	335
query34	795	861	530	530
query35	805	827	718	718
query36	961	990	907	907
query37	126	109	90	90
query38	3555	3579	3439	3439
query39	1472	1411	1474	1411
query40	230	130	120	120
query41	78	61	64	61
query42	127	112	113	112
query43	491	491	445	445
query44	1245	758	745	745
query45	187	185	182	182
query46	882	973	635	635
query47	1804	1826	1711	1711
query48	396	434	313	313
query49	771	536	436	436
query50	636	684	401	401
query51	3939	4115	3923	3923
query52	110	107	105	105
query53	240	267	203	203
query54	318	303	278	278
query55	91	88	83	83
query56	344	314	306	306
query57	1182	1194	1140	1140
query58	287	275	285	275
query59	2517	2797	2583	2583
query60	375	367	336	336
query61	194	212	194	194
query62	799	722	665	665
query63	224	197	192	192
query64	4588	1282	968	968
query65	4043	3939	4002	3939
query66	1121	486	359	359
query67	15485	15190	14788	14788
query68	8342	931	598	598
query69	491	328	291	291
query70	1293	1291	1339	1291
query71	468	342	308	308
query72	6067	4926	4864	4864
query73	621	577	361	361
query74	9411	9043	8972	8972
query75	3545	3334	2859	2859
query76	3391	1163	722	722
query77	673	391	339	339
query78	9706	9646	8910	8910
query79	2792	851	594	594
query80	737	585	510	510
query81	528	261	229	229
query82	677	155	128	128
query83	269	263	260	260
query84	257	109	88	88
query85	955	479	452	452
query86	393	324	304	304
query87	3692	3796	3645	3645
query88	4088	2253	2289	2253
query89	391	328	302	302
query90	2060	224	219	219
query91	171	168	135	135
query92	83	70	67	67
query93	2556	1004	632	632
query94	706	407	348	348
query95	408	318	319	318
query96	500	573	285	285
query97	2932	2985	2904	2904
query98	239	212	210	210
query99	1317	1433	1288	1288
Total cold run time: 278445 ms
Total hot run time: 187633 ms

doris-robot · 2025-11-10T09:02:46Z

ClickBench: Total hot run time: 27.84 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b7ffe325460f56d8806600839d8417c42e126e3d, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.05	0.05
query3	0.25	0.08	0.08
query4	1.60	0.11	0.12
query5	0.26	0.26	0.24
query6	1.18	0.64	0.63
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.59	0.52	0.51
query10	0.57	0.57	0.57
query11	0.17	0.12	0.12
query12	0.16	0.11	0.12
query13	0.63	0.61	0.60
query14	1.02	1.00	0.99
query15	0.85	0.84	0.84
query16	0.38	0.40	0.42
query17	1.01	1.03	1.01
query18	0.22	0.19	0.20
query19	1.90	1.84	1.76
query20	0.02	0.01	0.02
query21	15.43	0.21	0.14
query22	4.98	0.08	0.05
query23	15.67	0.27	0.10
query24	3.56	0.72	1.19
query25	0.10	0.06	0.06
query26	0.15	0.13	0.13
query27	0.07	0.05	0.05
query28	5.33	1.12	0.93
query29	12.53	4.02	3.38
query30	0.28	0.14	0.12
query31	2.82	0.59	0.38
query32	3.23	0.54	0.48
query33	3.13	3.06	3.03
query34	15.80	5.20	4.54
query35	4.57	4.53	4.62
query36	0.69	0.50	0.49
query37	0.10	0.07	0.07
query38	0.06	0.05	0.03
query39	0.04	0.03	0.03
query40	0.16	0.14	0.14
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 99.91 s
Total hot run time: 27.84 s

doris-robot · 2025-11-10T10:04:12Z

BE UT Coverage Report

Increment line coverage 100.00% (29/29) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.78% (18226/34533)
Line Coverage	38.13% (165789/434751)
Region Coverage	33.17% (129099/389211)
Branch Coverage	33.88% (55334/163340)

uchenily · 2025-11-11T01:09:53Z

run cloud_p0

uchenily · 2025-11-11T01:10:02Z

run external

hello-stephen · 2025-11-11T02:55:15Z

BE Regression && UT Coverage Report

Increment line coverage 100.00% (29/29) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.53% (24282/33947)
Line Coverage	58.02% (252646/435471)
Region Coverage	53.47% (211070/394732)
Branch Coverage	54.67% (89889/164423)

airborne12

LGTM

github-actions · 2025-11-11T03:18:39Z

PR approved by at least one committer and no changes requested.

…apache#57623) Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.

…ore training #57623 (#57932) cherry pick from #57623 Co-authored-by: ivin <uchenily@qq.com>

…apache#57623) ### What problem does this PR solve? Accumulate multiple small batches to avoid the following error when training: `Error: 'nx >= k' failed: Number of training points should be at least as large as number of clusters`, and significantly reduce the time for faiss train/add.

### What problem does this PR solve? Previous pr: #57623 The current granularity for index training and data ingestion is set to 1M and is hard-coded, which makes index construction unnecessarily slow in some scenarios. This should be made configurable and reduced when appropriate. For example, when having 1M vectors to add, and batch size of stream load is set to 0.3M, this means we will have 3 stream load requests. If it happens to make one request that having 0.3M to have 1 threads for adding, whole process of load will be very slow. A typical cpu usage will be like this: <img width="1902" height="552" alt="image" src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1" /> We need to make batch size configurable so that we can modify them when we need to do it. For example, when we set batch size to 30K, we can have a more higher avg cpu usage when we like this: <img width="1890" height="554" alt="image" src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7" /> **Default value is still 1M, small batch size will do a damage to the recall of the hnsw.**

…58645) ### What problem does this PR solve? Previous pr: apache#57623 The current granularity for index training and data ingestion is set to 1M and is hard-coded, which makes index construction unnecessarily slow in some scenarios. This should be made configurable and reduced when appropriate. For example, when having 1M vectors to add, and batch size of stream load is set to 0.3M, this means we will have 3 stream load requests. If it happens to make one request that having 0.3M to have 1 threads for adding, whole process of load will be very slow. A typical cpu usage will be like this: <img width="1902" height="552" alt="image" src="https://github.com/user-attachments/assets/65728e56-f333-4bd5-a54a-8c12d01668f1" /> We need to make batch size configurable so that we can modify them when we need to do it. For example, when we set batch size to 30K, we can have a more higher avg cpu usage when we like this: <img width="1890" height="554" alt="image" src="https://github.com/user-attachments/assets/7d664b0e-b017-4a2e-bed8-e40f56ff97b7" /> **Default value is still 1M, small batch size will do a damage to the recall of the hnsw.**

[improve](ann index)Accumulate multiple small batches before training

88c3eca

uchenily force-pushed the ann-optimize branch from de23f84 to 88c3eca Compare November 3, 2025 06:17

fix regression test

295be19

zhiqiang-hhhh reviewed Nov 7, 2025

View reviewed changes

use DorisVector

a739ebe

zhiqiang-hhhh reviewed Nov 7, 2025

View reviewed changes

PODArray

bc19e89

uchenily force-pushed the ann-optimize branch from d81e6ab to bc19e89 Compare November 10, 2025 01:25

uchenily requested a review from zhiqiang-hhhh November 10, 2025 05:48

zhiqiang-hhhh reviewed Nov 10, 2025

View reviewed changes

zhiqiang-hhhh approved these changes Nov 10, 2025

View reviewed changes

github-actions bot added the reviewed label Nov 10, 2025

airborne12 previously approved these changes Nov 10, 2025

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 10, 2025

Remove _array_offset

b7ffe32

uchenily dismissed airborne12’s stale review via b7ffe32 November 10, 2025 07:29

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 10, 2025

airborne12 approved these changes Nov 11, 2025

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 11, 2025

airborne12 merged commit 20302fe into apache:master Nov 11, 2025
26 of 27 checks passed

airborne12 added the dev/4.0.x label Nov 11, 2025

github-actions bot added the dev/4.0.x-conflict label Nov 11, 2025

zhiqiang-hhhh mentioned this pull request Nov 12, 2025

branch-4.0: [improve](ann index)Accumulate multiple small batches before training #57623 #57932

Merged

yiguolei pushed a commit that referenced this pull request Nov 12, 2025

branch-4.0: [improve](ann index)Accumulate multiple small batches bef…

4142dc7

…ore training #57623 (#57932) cherry pick from #57623 Co-authored-by: ivin <uchenily@qq.com>

yiguolei added dev/4.0.2-merged and removed dev/4.0.x dev/4.0.x-conflict labels Nov 12, 2025

yiguolei mentioned this pull request Dec 2, 2025

4.0.2 Release Notes #58605

Open

zhiqiang-hhhh mentioned this pull request Dec 3, 2025

[opt](ann index) Make chunk size of index train configurable #58645

Merged

16 tasks

Conversation

uchenily commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

uchenily commented Nov 3, 2025

Uh oh!

doris-robot commented Nov 3, 2025

Uh oh!

hello-stephen commented Nov 3, 2025

BE UT Coverage Report

Uh oh!

uchenily commented Nov 3, 2025

Uh oh!

doris-robot commented Nov 3, 2025

Uh oh!

hello-stephen commented Nov 3, 2025

BE UT Coverage Report

Uh oh!

hello-stephen commented Nov 3, 2025

BE Regression && UT Coverage Report

Uh oh!

zhiqiang-hhhh Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhiqiang-hhhh Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

uchenily commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

BE UT Coverage Report

Uh oh!

hello-stephen commented Nov 10, 2025

BE Regression && UT Coverage Report

Uh oh!

zhiqiang-hhhh Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

zhiqiang-hhhh Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

airborne12 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

uchenily commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

Uh oh!

doris-robot commented Nov 10, 2025

BE UT Coverage Report

Uh oh!

uchenily commented Nov 11, 2025

Uh oh!

uchenily commented Nov 11, 2025

Uh oh!

hello-stephen commented Nov 11, 2025

BE Regression && UT Coverage Report

Uh oh!

airborne12 left a comment

Choose a reason for hiding this comment

Uh oh!

uchenily commented Nov 3, 2025 •

edited

Loading