[opt](multi-catalog) Optimize file split size.#58858

kaka11chen · 2025-12-09T09:38:13Z

What problem does this PR solve?

Release note

This PR introduces a dynamic and progressive file split size adjustment mechanism to improve scan parallelism and resource utilization for external table scans, while avoiding excessive small splits or inefficiently large initial splits.

1. Split Size Adjustment Strategy

1.1 Non-Batch Split Mode

In non-batch split mode, a two-phase split size selection strategy is applied based on the total size of all input files:

The total size of all splits is calculated in advance.
If the total size exceeds maxInitialSplitNum * maxInitialSplitSize:
- split_size = maxSplitSize (default 64MB)
Otherwise:
- split_size = maxInitialSplitSize (default 32MB)

This strategy reduces the number of splits for small datasets while improving parallelism for large-scale scans.

1.2 Batch Split Mode

In batch split mode, a progressive split size adjustment strategy is introduced:

As the total file size increases,
When the number of files gradually exceeds maxInitialSplitNum,
The split_size is smoothly increased from maxInitialSplitSize (32MB) toward maxSplitSize (64MB).

This approach avoids generating too many small splits at the early stage while gradually increasing scan parallelism as the workload grows, resulting in more stable scheduling and execution behavior.

1.3 User-Specified Split Size (Backward Compatibility)

This PR preserves the session variable file_split_size for user-defined split size configuration:

If file_split_size is explicitly set by the user:
- The user-defined value takes precedence.
- The dynamic split size adjustment logic is bypassed.
This ensures full backward compatibility with existing configurations and tuning practices.

2. Support Status by Data Source

Data Source	Non-Batch Split Mode	Batch Split Mode	Notes
Hive	✅ Supported	✅ Supported	Uses Doris internal HDFS FileSplitter
Iceberg	✅ Supported	❌ Not supported	File splitting is currently delegated to Iceberg APIs
Paimon	✅ Supported	❌ Not supported	Only non-batch split mode is implemented

3. New Hive HDFS FileSplitter Logic

For Hive HDFS files, this PR introduces an enhanced file splitting strategy:

Splits never span multiple HDFS blocks
- Prevents cross-block reads and avoids unnecessary IO overhead.
Tail split optimization
- If the remaining file size is smaller than split_size * 2,
- The remaining part is evenly divided into splits,
- Preventing the creation of very small tail splits and improving overall scan efficiency.

Summary

Introduces dynamic and progressive split size adjustment
Supports both batch and non-batch split modes
Preserves user-defined split size configuration for backward compatibility
Optimizes Hive HDFS file splitting to reduce small tail splits and cross-block IO

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2025-12-09T09:38:19Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

kaka11chen · 2025-12-09T09:38:24Z

run buildall

kaka11chen · 2025-12-09T11:39:24Z

run buildall

kaka11chen · 2025-12-09T13:54:05Z

run buildall

doris-robot · 2025-12-09T14:16:43Z

TPC-H: Total hot run time: 36237 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 467baba746a0e826898d9390691bae1c024128d3, data reload: false

------ Round 1 ----------------------------------
q1	17616	5019	4900	4900
q2	2063	354	240	240
q3	10134	1383	778	778
q4	10254	956	348	348
q5	7570	2155	1958	1958
q6	187	174	138	138
q7	995	845	728	728
q8	9359	1406	1108	1108
q9	6967	5350	5377	5350
q10	6808	2389	1978	1978
q11	520	323	305	305
q12	643	719	592	592
q13	17766	3716	3016	3016
q14	289	290	275	275
q15	583	525	518	518
q16	923	902	874	874
q17	710	859	506	506
q18	7786	7093	7049	7049
q19	1099	988	629	629
q20	395	378	263	263
q21	4208	4009	3727	3727
q22	1053	994	957	957
Total cold run time: 107928 ms
Total hot run time: 36237 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5025	4940	4955	4940
q2	325	413	319	319
q3	2140	2800	2306	2306
q4	1348	1768	1296	1296
q5	4916	4693	4614	4614
q6	224	176	133	133
q7	2073	1983	1837	1837
q8	2649	2660	2550	2550
q9	7406	7414	7489	7414
q10	3115	3359	2796	2796
q11	578	526	483	483
q12	696	750	630	630
q13	3619	3986	3435	3435
q14	291	322	374	322
q15	552	545	500	500
q16	878	923	866	866
q17	1183	1445	1440	1440
q18	8027	7750	7623	7623
q19	903	896	925	896
q20	2009	2070	1887	1887
q21	4645	4302	4097	4097
q22	1139	1045	1005	1005
Total cold run time: 53741 ms
Total hot run time: 51389 ms

doris-robot · 2025-12-09T14:28:03Z

TPC-DS: Total hot run time: 181948 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 467baba746a0e826898d9390691bae1c024128d3, data reload: false

query5	4859	658	464	464
query6	338	251	216	216
query7	4213	475	283	283
query8	311	269	252	252
query9	8795	2654	2673	2654
query10	483	401	338	338
query11	15420	14843	15005	14843
query12	182	122	120	120
query13	1280	526	415	415
query14	6125	3408	3070	3070
query14_1	2914	2940	2898	2898
query15	209	201	185	185
query16	947	491	514	491
query17	1102	692	571	571
query18	2640	429	349	349
query19	225	224	203	203
query20	119	112	107	107
query21	224	142	116	116
query22	3884	3934	3809	3809
query23	16525	16166	15928	15928
query23_1	16012	16026	16035	16026
query24	7494	1656	1239	1239
query24_1	1313	1246	1287	1246
query25	580	505	455	455
query26	1256	282	181	181
query27	2728	484	322	322
query28	4463	2184	2179	2179
query29	832	585	484	484
query30	319	252	222	222
query31	827	722	649	649
query32	84	75	78	75
query33	566	358	309	309
query34	925	937	559	559
query35	796	808	725	725
query36	885	928	808	808
query37	139	99	78	78
query38	3869	3844	3751	3751
query39	769	757	731	731
query39_1	709	710	740	710
query40	239	151	134	134
query41	78	71	70	70
query42	116	114	108	108
query43	429	446	405	405
query44	1397	781	766	766
query45	205	194	192	192
query46	904	1025	655	655
query47	1648	1694	1599	1599
query48	324	344	265	265
query49	673	448	385	385
query50	683	310	233	233
query51	3938	3916	3931	3916
query52	111	115	107	107
query53	341	361	313	313
query54	318	285	266	266
query55	82	82	73	73
query56	314	326	321	321
query57	1140	1131	1079	1079
query58	294	266	261	261
query59	2363	2414	2289	2289
query60	348	348	343	343
query61	156	152	156	152
query62	691	675	639	639
query63	336	304	304	304
query64	5032	1299	991	991
query65	4029	3978	3945	3945
query66	1416	434	319	319
query67	15274	15114	14755	14755
query68	8315	1023	745	745
query69	498	364	311	311
query70	1068	1018	1014	1014
query71	368	327	300	300
query72	6105	4923	4866	4866
query73	655	573	323	323
query74	8963	8773	8603	8603
query75	3642	3572	3211	3211
query76	3973	1194	790	790
query77	674	402	301	301
query78	9325	9637	8867	8867
query79	1600	904	642	642
query80	706	671	565	565
query81	507	273	248	248
query82	509	135	109	109
query83	281	259	242	242
query84	261	124	101	101
query85	992	528	467	467
query86	395	302	279	279
query87	4080	4102	3984	3984
query88	4255	2317	2280	2280
query89	459	425	391	391
query90	2099	167	164	164
query91	173	166	145	145
query92	96	73	63	63
query93	1571	947	567	567
query94	481	291	258	258
query95	576	381	303	303
query96	591	486	213	213
query97	2615	2666	2558	2558
query98	219	201	195	195
query99	1304	1324	1282	1282
Total cold run time: 265589 ms
Total hot run time: 181948 ms

doris-robot · 2025-12-09T14:33:08Z

ClickBench: Total hot run time: 27.27 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 467baba746a0e826898d9390691bae1c024128d3, data reload: false

query1	0.06	0.04	0.05
query2	0.10	0.05	0.04
query3	0.26	0.09	0.08
query4	1.61	0.11	0.11
query5	0.28	0.26	0.26
query6	1.21	0.64	0.63
query7	0.03	0.03	0.02
query8	0.05	0.04	0.05
query9	0.58	0.51	0.51
query10	0.55	0.53	0.56
query11	0.16	0.12	0.11
query12	0.16	0.11	0.12
query13	0.62	0.60	0.60
query14	1.00	0.98	0.97
query15	0.82	0.79	0.81
query16	0.39	0.41	0.39
query17	1.03	1.03	1.06
query18	0.23	0.22	0.21
query19	1.91	1.77	1.86
query20	0.02	0.02	0.01
query21	15.46	0.27	0.14
query22	4.90	0.05	0.04
query23	15.89	0.29	0.10
query24	2.21	0.86	0.24
query25	0.10	0.07	0.05
query26	0.15	0.14	0.13
query27	0.07	0.06	0.07
query28	5.21	1.21	1.02
query29	12.60	4.01	3.27
query30	0.28	0.14	0.14
query31	2.82	0.63	0.38
query32	3.23	0.56	0.46
query33	2.99	3.06	3.14
query34	16.77	5.18	4.52
query35	4.52	4.56	4.56
query36	0.67	0.51	0.49
query37	0.11	0.08	0.06
query38	0.07	0.05	0.04
query39	0.04	0.04	0.03
query40	0.17	0.15	0.14
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 99.49 s
Total hot run time: 27.27 s

kaka11chen · 2025-12-09T16:45:30Z

run buildall

doris-robot · 2025-12-09T17:06:30Z

TPC-H: Total hot run time: 36184 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 97cbf6ae2d13f6749c452d8c3404a301de80efa4, data reload: false

------ Round 1 ----------------------------------
q1	17637	5016	4924	4924
q2	2075	355	242	242
q3	10160	1279	733	733
q4	10230	933	328	328
q5	7555	2138	1911	1911
q6	185	168	134	134
q7	1000	858	697	697
q8	9364	1424	1137	1137
q9	7011	5275	5305	5275
q10	6791	2392	2013	2013
q11	537	315	303	303
q12	675	736	601	601
q13	17769	3652	3033	3033
q14	306	296	271	271
q15	586	516	508	508
q16	920	930	866	866
q17	693	765	555	555
q18	7622	7123	7103	7103
q19	1107	951	597	597
q20	402	371	244	244
q21	4268	4112	3742	3742
q22	1052	1013	967	967
Total cold run time: 107945 ms
Total hot run time: 36184 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4929	4873	4887	4873
q2	312	416	322	322
q3	2137	2707	2245	2245
q4	1325	1730	1291	1291
q5	4738	4782	4569	4569
q6	218	175	131	131
q7	2065	1984	1797	1797
q8	2717	2564	2713	2564
q9	7621	7527	7518	7518
q10	3022	3243	2848	2848
q11	588	509	508	508
q12	705	779	608	608
q13	3485	3981	3397	3397
q14	295	299	275	275
q15	546	531	532	531
q16	1010	940	869	869
q17	1215	1670	1395	1395
q18	7954	7606	7496	7496
q19	916	869	891	869
q20	2019	2176	1954	1954
q21	4869	4254	4190	4190
q22	1071	1030	998	998
Total cold run time: 53757 ms
Total hot run time: 51248 ms

doris-robot · 2025-12-09T17:17:37Z

TPC-DS: Total hot run time: 180842 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 97cbf6ae2d13f6749c452d8c3404a301de80efa4, data reload: false

query5	4404	606	494	494
query6	341	248	225	225
query7	4237	478	288	288
query8	310	258	254	254
query9	8756	2658	2650	2650
query10	503	374	340	340
query11	15247	15040	14523	14523
query12	180	117	115	115
query13	1263	529	442	442
query14	6254	3350	3082	3082
query14_1	2974	2992	2910	2910
query15	213	198	176	176
query16	801	474	453	453
query17	1149	718	573	573
query18	2498	425	333	333
query19	232	225	200	200
query20	120	118	108	108
query21	213	142	115	115
query22	3907	3916	3840	3840
query23	16668	16102	15915	15915
query23_1	16106	15970	15982	15970
query24	7411	1653	1208	1208
query24_1	1240	1215	1238	1215
query25	552	472	415	415
query26	1263	271	162	162
query27	2742	467	308	308
query28	4448	2161	2158	2158
query29	780	565	444	444
query30	320	245	219	219
query31	813	702	589	589
query32	81	73	71	71
query33	537	334	283	283
query34	891	904	537	537
query35	813	807	732	732
query36	863	904	840	840
query37	130	96	92	92
query38	3847	3907	3797	3797
query39	747	735	712	712
query39_1	703	732	684	684
query40	228	136	117	117
query41	65	63	63	63
query42	109	110	108	108
query43	419	440	417	417
query44	1360	753	769	753
query45	195	188	180	180
query46	883	984	606	606
query47	1659	1670	1612	1612
query48	341	346	261	261
query49	641	431	345	345
query50	658	304	235	235
query51	3913	4007	3815	3815
query52	109	108	101	101
query53	325	357	300	300
query54	299	270	260	260
query55	80	78	74	74
query56	292	295	293	293
query57	1142	1126	1080	1080
query58	264	250	248	248
query59	2334	2476	2319	2319
query60	313	311	293	293
query61	160	183	156	156
query62	681	657	619	619
query63	330	301	301	301
query64	4913	1275	1001	1001
query65	4028	3938	3958	3938
query66	1452	448	316	316
query67	14921	14784	14576	14576
query68	2720	1049	780	780
query69	457	352	305	305
query70	1065	1018	986	986
query71	332	309	288	288
query72	5652	5095	5029	5029
query73	473	548	311	311
query74	8856	8893	8531	8531
query75	3505	3553	3191	3191
query76	2932	1140	762	762
query77	374	418	330	330
query78	9465	9592	8747	8747
query79	1176	887	635	635
query80	1318	692	587	587
query81	549	275	243	243
query82	999	135	105	105
query83	328	263	236	236
query84	249	127	101	101
query85	950	516	466	466
query86	397	299	306	299
query87	4060	3981	3967	3967
query88	3322	2309	2307	2307
query89	476	436	397	397
query90	2004	161	151	151
query91	171	160	139	139
query92	70	71	65	65
query93	1103	916	578	578
query94	531	305	274	274
query95	580	384	313	313
query96	582	471	208	208
query97	2611	2696	2549	2549
query98	216	199	186	186
query99	1295	1300	1219	1219
Total cold run time: 255684 ms
Total hot run time: 180842 ms

doris-robot · 2025-12-09T17:22:38Z

ClickBench: Total hot run time: 27.16 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 97cbf6ae2d13f6749c452d8c3404a301de80efa4, data reload: false

query1	0.05	0.04	0.05
query2	0.11	0.05	0.05
query3	0.26	0.09	0.08
query4	1.61	0.11	0.11
query5	0.29	0.26	0.26
query6	1.19	0.66	0.63
query7	0.03	0.02	0.03
query8	0.05	0.04	0.05
query9	0.57	0.52	0.50
query10	0.55	0.54	0.55
query11	0.16	0.11	0.11
query12	0.14	0.12	0.12
query13	0.63	0.60	0.61
query14	0.99	0.99	0.97
query15	0.81	0.78	0.80
query16	0.40	0.42	0.40
query17	1.06	1.05	1.04
query18	0.23	0.22	0.21
query19	1.82	1.85	1.84
query20	0.01	0.01	0.01
query21	15.44	0.28	0.14
query22	4.67	0.05	0.05
query23	15.94	0.27	0.10
query24	0.97	0.23	0.18
query25	0.06	0.10	0.06
query26	0.14	0.13	0.13
query27	0.06	0.06	0.04
query28	2.94	1.22	1.04
query29	12.63	3.99	3.20
query30	0.28	0.13	0.12
query31	2.81	0.62	0.38
query32	3.22	0.55	0.46
query33	3.02	3.07	3.04
query34	16.71	5.15	4.52
query35	4.59	4.54	4.52
query36	0.67	0.51	0.48
query37	0.12	0.06	0.07
query38	0.08	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.15	0.13
query41	0.08	0.04	0.03
query42	0.04	0.04	0.03
query43	0.04	0.03	0.04
Total cold run time: 95.7 s
Total hot run time: 27.16 s

hello-stephen · 2025-12-09T21:50:40Z

FE Regression Coverage Report

Increment line coverage 70.83% (68/96) 🎉
Increment coverage report
Complete coverage report

kaka11chen · 2025-12-10T06:55:17Z

run buildall

kaka11chen · 2025-12-10T07:01:31Z

run buildall

kaka11chen · 2025-12-10T13:01:40Z

run buildall

doris-robot · 2025-12-10T13:42:44Z

TPC-H: Total hot run time: 35915 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be00184499843df90269e1b893e16c16776e1578, data reload: false

------ Round 1 ----------------------------------
q1	17639	4207	4053	4053
q2	2012	370	231	231
q3	10174	1357	731	731
q4	10220	838	306	306
q5	7540	2091	1957	1957
q6	177	162	132	132
q7	1025	849	703	703
q8	9361	1402	1147	1147
q9	7025	5352	5386	5352
q10	6829	2386	1964	1964
q11	519	320	302	302
q12	659	741	639	639
q13	18540	3718	3151	3151
q14	297	294	274	274
q15	586	525	523	523
q16	959	941	911	911
q17	709	900	490	490
q18	7524	7278	7231	7231
q19	1086	971	625	625
q20	410	370	248	248
q21	4300	4003	3917	3917
q22	1091	1041	1028	1028
Total cold run time: 108682 ms
Total hot run time: 35915 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4095	4081	4022	4022
q2	337	427	311	311
q3	2159	2694	2319	2319
q4	1330	1782	1297	1297
q5	4207	4823	4741	4741
q6	236	173	129	129
q7	2099	1960	1827	1827
q8	2643	2559	2509	2509
q9	7521	7562	7552	7552
q10	3071	3245	2849	2849
q11	581	499	504	499
q12	867	802	626	626
q13	3542	4149	3389	3389
q14	303	348	298	298
q15	555	531	504	504
q16	932	963	887	887
q17	1233	1452	1386	1386
q18	7766	7801	7577	7577
q19	815	805	839	805
q20	1893	1971	1812	1812
q21	4656	4243	4135	4135
q22	1028	1035	975	975
Total cold run time: 51869 ms
Total hot run time: 50449 ms

doris-robot · 2025-12-10T13:53:53Z

TPC-DS: Total hot run time: 181296 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be00184499843df90269e1b893e16c16776e1578, data reload: false

query5	5127	629	459	459
query6	333	228	234	228
query7	4226	470	289	289
query8	313	239	235	235
query9	8770	2585	2604	2585
query10	550	352	324	324
query11	15480	15297	14488	14488
query12	185	113	122	113
query13	1252	499	378	378
query14	6263	3273	2999	2999
query14_1	2944	2959	2955	2955
query15	205	196	178	178
query16	828	468	461	461
query17	1085	690	570	570
query18	2555	438	334	334
query19	223	225	209	209
query20	120	114	110	110
query21	222	135	114	114
query22	3783	3869	3838	3838
query23	16663	16145	15804	15804
query23_1	16008	16118	15907	15907
query24	7464	1652	1274	1274
query24_1	1268	1274	1285	1274
query25	563	489	450	450
query26	1248	280	175	175
query27	2734	477	326	326
query28	4458	2178	2165	2165
query29	839	585	467	467
query30	313	248	227	227
query31	828	722	648	648
query32	79	75	71	71
query33	584	327	286	286
query34	907	912	568	568
query35	794	813	726	726
query36	874	922	842	842
query37	131	93	80	80
query38	3865	3932	3858	3858
query39	753	745	714	714
query39_1	694	704	690	690
query40	222	136	117	117
query41	64	63	61	61
query42	110	107	103	103
query43	425	439	405	405
query44	1360	765	768	765
query45	192	186	184	184
query46	886	990	634	634
query47	1649	1651	1611	1611
query48	318	343	246	246
query49	643	446	359	359
query50	687	316	234	234
query51	3792	3802	3774	3774
query52	112	116	103	103
query53	329	356	300	300
query54	298	307	253	253
query55	76	79	76	76
query56	294	289	288	288
query57	1153	1135	1093	1093
query58	275	254	250	250
query59	2302	2504	2430	2430
query60	309	311	292	292
query61	160	160	158	158
query62	709	698	612	612
query63	332	308	305	305
query64	5030	1304	1021	1021
query65	4091	3995	3969	3969
query66	1422	436	337	337
query67	15039	14738	14753	14738
query68	2740	1067	780	780
query69	445	358	311	311
query70	1057	1023	977	977
query71	327	308	293	293
query72	6239	4966	4826	4826
query73	501	585	314	314
query74	8830	8836	8676	8676
query75	3523	3550	3233	3233
query76	2913	1144	785	785
query77	345	398	302	302
query78	9390	9609	8877	8877
query79	1058	874	635	635
query80	1164	677	574	574
query81	527	270	232	232
query82	415	131	107	107
query83	363	277	242	242
query84	258	117	95	95
query85	968	520	477	477
query86	386	304	286	286
query87	4053	4052	3952	3952
query88	3280	2308	2309	2308
query89	468	434	412	412
query90	1947	165	161	161
query91	171	164	141	141
query92	71	70	70	70
query93	1058	927	575	575
query94	551	316	286	286
query95	578	329	305	305
query96	589	481	218	218
query97	2582	2659	2565	2565
query98	211	203	197	197
query99	1245	1277	1239	1239
Total cold run time: 256370 ms
Total hot run time: 181296 ms

doris-robot · 2025-12-10T13:58:54Z

ClickBench: Total hot run time: 27.14 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit be00184499843df90269e1b893e16c16776e1578, data reload: false

query1	0.05	0.04	0.04
query2	0.09	0.05	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.27	0.26	0.25
query6	1.18	0.65	0.63
query7	0.03	0.02	0.02
query8	0.06	0.05	0.04
query9	0.58	0.51	0.50
query10	0.56	0.55	0.57
query11	0.16	0.10	0.11
query12	0.14	0.11	0.12
query13	0.63	0.61	0.59
query14	0.98	0.99	0.98
query15	0.83	0.80	0.80
query16	0.42	0.39	0.38
query17	1.01	1.03	1.06
query18	0.23	0.22	0.22
query19	1.92	1.83	1.82
query20	0.02	0.02	0.01
query21	15.46	0.30	0.14
query22	4.85	0.05	0.05
query23	15.91	0.29	0.11
query24	1.57	0.27	0.17
query25	0.11	0.08	0.06
query26	0.14	0.14	0.13
query27	0.07	0.04	0.06
query28	3.14	1.26	1.03
query29	12.58	4.03	3.24
query30	0.28	0.14	0.12
query31	2.82	0.62	0.40
query32	3.25	0.55	0.45
query33	3.00	2.99	3.05
query34	16.82	5.22	4.56
query35	4.56	4.53	4.55
query36	0.67	0.50	0.51
query37	0.11	0.07	0.07
query38	0.07	0.04	0.03
query39	0.04	0.02	0.02
query40	0.17	0.14	0.12
query41	0.08	0.02	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.8 s
Total hot run time: 27.14 s

kaka11chen · 2025-12-10T14:12:32Z

run buildall

doris-robot · 2025-12-10T14:54:40Z

TPC-H: Total hot run time: 36706 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 32b5147a479dcb7b2b578664bec722b26aa1d77b, data reload: false

------ Round 1 ----------------------------------
q1	17622	4183	4064	4064
q2	2035	341	230	230
q3	10187	1302	750	750
q4	10200	819	311	311
q5	7526	2128	1884	1884
q6	184	165	136	136
q7	998	832	721	721
q8	9343	1405	1093	1093
q9	7014	5244	5349	5244
q10	6838	2384	1996	1996
q11	535	317	298	298
q12	649	725	588	588
q13	17789	3644	3046	3046
q14	284	281	270	270
q15	571	508	520	508
q16	932	928	867	867
q17	708	760	603	603
q18	7575	7927	8063	7927
q19	1337	1078	670	670
q20	410	408	264	264
q21	4537	4217	4251	4217
q22	1132	1074	1019	1019
Total cold run time: 108406 ms
Total hot run time: 36706 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4334	4305	4305	4305
q2	320	393	309	309
q3	2492	2875	2524	2524
q4	1396	1879	1479	1479
q5	4728	4479	4430	4430
q6	207	167	125	125
q7	2000	2122	1742	1742
q8	2649	2519	2505	2505
q9	7501	7490	7286	7286
q10	2883	3100	2631	2631
q11	567	515	471	471
q12	645	705	568	568
q13	3274	3633	3023	3023
q14	261	269	253	253
q15	543	504	491	491
q16	845	900	839	839
q17	1124	1324	1332	1324
q18	7326	7246	7005	7005
q19	808	799	834	799
q20	1900	1978	1844	1844
q21	4542	4308	4114	4114
q22	1060	1031	977	977
Total cold run time: 51405 ms
Total hot run time: 49044 ms

doris-robot · 2025-12-10T15:05:53Z

TPC-DS: Total hot run time: 181276 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 32b5147a479dcb7b2b578664bec722b26aa1d77b, data reload: false

query5	4898	624	457	457
query6	322	243	233	233
query7	4227	476	289	289
query8	310	257	246	246
query9	8776	2589	2571	2571
query10	543	376	325	325
query11	15999	14811	14516	14516
query12	180	116	117	116
query13	1282	515	411	411
query14	6252	3275	3034	3034
query14_1	2930	2933	2914	2914
query15	213	203	185	185
query16	807	465	452	452
query17	1163	707	580	580
query18	2601	429	338	338
query19	225	228	208	208
query20	120	110	113	110
query21	217	134	115	115
query22	3839	3947	3935	3935
query23	16298	16079	15929	15929
query23_1	15902	16044	15991	15991
query24	7315	1628	1223	1223
query24_1	1249	1252	1286	1252
query25	561	468	424	424
query26	1246	289	163	163
query27	2733	463	313	313
query28	4469	2179	2155	2155
query29	848	538	457	457
query30	322	246	215	215
query31	809	717	621	621
query32	77	69	71	69
query33	534	340	290	290
query34	914	891	536	536
query35	799	833	726	726
query36	869	909	820	820
query37	129	100	79	79
query38	3861	3833	3751	3751
query39	746	785	726	726
query39_1	694	693	720	693
query40	224	132	122	122
query41	65	62	60	60
query42	107	108	106	106
query43	418	428	405	405
query44	1323	759	764	759
query45	192	190	182	182
query46	877	979	612	612
query47	1615	1667	1621	1621
query48	315	323	261	261
query49	616	424	357	357
query50	668	299	242	242
query51	3833	3857	3851	3851
query52	110	113	105	105
query53	320	348	294	294
query54	289	268	271	268
query55	81	76	73	73
query56	306	296	295	295
query57	1144	1122	1071	1071
query58	271	256	253	253
query59	2350	2482	2347	2347
query60	313	315	294	294
query61	161	159	163	159
query62	714	668	639	639
query63	337	303	305	303
query64	5154	1425	1107	1107
query65	3998	3989	3930	3930
query66	1459	491	351	351
query67	14940	14961	14897	14897
query68	5039	1068	758	758
query69	524	358	330	330
query70	1108	1039	1034	1034
query71	374	319	296	296
query72	6476	5042	4925	4925
query73	715	654	318	318
query74	8797	8766	8543	8543
query75	3569	3536	3154	3154
query76	3936	1143	757	757
query77	529	408	289	289
query78	9654	9514	8893	8893
query79	1683	894	644	644
query80	831	656	556	556
query81	539	270	242	242
query82	403	138	105	105
query83	262	254	237	237
query84	260	122	99	99
query85	917	502	464	464
query86	439	304	266	266
query87	4107	4020	3950	3950
query88	3235	2323	2266	2266
query89	465	434	391	391
query90	2155	158	149	149
query91	174	171	148	148
query92	84	72	71	71
query93	1758	913	566	566
query94	526	303	282	282
query95	585	338	357	338
query96	589	462	213	213
query97	2594	2690	2540	2540
query98	213	196	207	196
query99	1279	1274	1219	1219
Total cold run time: 261255 ms
Total hot run time: 181276 ms

doris-robot · 2025-12-10T15:10:58Z

ClickBench: Total hot run time: 27.79 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 32b5147a479dcb7b2b578664bec722b26aa1d77b, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.05
query3	0.25	0.09	0.09
query4	1.60	0.11	0.11
query5	0.28	0.26	0.27
query6	1.17	0.63	0.63
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.51	0.50
query10	0.55	0.56	0.56
query11	0.15	0.11	0.11
query12	0.14	0.11	0.12
query13	0.61	0.60	0.60
query14	0.98	1.00	0.99
query15	0.81	0.81	0.81
query16	0.41	0.43	0.39
query17	1.07	1.00	1.01
query18	0.23	0.21	0.21
query19	1.84	1.79	1.86
query20	0.01	0.01	0.01
query21	15.47	0.29	0.15
query22	4.74	0.05	0.05
query23	16.16	0.28	0.11
query24	1.56	0.66	0.81
query25	0.10	0.12	0.04
query26	0.14	0.13	0.13
query27	0.07	0.07	0.04
query28	5.20	1.22	1.02
query29	12.64	3.97	3.38
query30	0.29	0.14	0.12
query31	2.84	0.62	0.40
query32	3.23	0.54	0.46
query33	2.99	3.00	3.04
query34	16.74	5.24	4.51
query35	4.54	4.54	4.56
query36	0.66	0.50	0.49
query37	0.11	0.07	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.17	0.15	0.14
query41	0.09	0.04	0.03
query42	0.05	0.03	0.02
query43	0.04	0.03	0.04
Total cold run time: 98.85 s
Total hot run time: 27.79 s

hello-stephen · 2025-12-10T15:13:42Z

FE UT Coverage Report

Increment line coverage 36.06% (75/208) 🎉
Increment coverage report
Complete coverage report

kaka11chen · 2025-12-28T15:45:54Z

run buildall

doris-robot · 2025-12-28T16:05:38Z

TPC-H: Total hot run time: 36478 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit da983ea47784db2df29d911ce481ab297790550c, data reload: false

------ Round 1 ----------------------------------
q1	17790	4253	4089	4089
q2	2032	353	239	239
q3	10159	1350	756	756
q4	10235	877	322	322
q5	7526	2127	1922	1922
q6	185	171	137	137
q7	1047	864	716	716
q8	9369	1432	1216	1216
q9	7037	5383	5416	5383
q10	6792	2417	1969	1969
q11	532	335	317	317
q12	650	712	577	577
q13	17776	3699	3071	3071
q14	290	296	273	273
q15	571	516	513	513
q16	693	675	629	629
q17	678	785	582	582
q18	7534	7673	7985	7673
q19	1208	1015	646	646
q20	416	378	268	268
q21	4611	4215	4193	4193
q22	1072	1040	987	987
Total cold run time: 108203 ms
Total hot run time: 36478 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4334	4529	4239	4239
q2	330	393	298	298
q3	2310	2883	2539	2539
q4	1632	1807	1392	1392
q5	4442	4466	4515	4466
q6	226	177	130	130
q7	2015	2054	1820	1820
q8	2725	2578	2535	2535
q9	7721	7458	7467	7458
q10	3054	3258	2842	2842
q11	597	564	555	555
q12	709	752	638	638
q13	3440	3722	3007	3007
q14	265	273	271	271
q15	534	500	486	486
q16	630	646	596	596
q17	1094	1358	1353	1353
q18	7212	7218	7107	7107
q19	872	829	849	829
q20	1875	1928	1836	1836
q21	4631	4319	4237	4237
q22	1062	1050	998	998
Total cold run time: 51710 ms
Total hot run time: 49632 ms

doris-robot · 2025-12-28T16:16:45Z

TPC-DS: Total hot run time: 179069 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit da983ea47784db2df29d911ce481ab297790550c, data reload: false

query5	5027	591	433	433
query6	334	235	222	222
query7	4223	472	284	284
query8	306	256	249	249
query9	8786	2571	2597	2571
query10	535	401	330	330
query11	15877	14899	14691	14691
query12	176	116	118	116
query13	1287	496	397	397
query14	6411	3157	2785	2785
query14_1	2703	2645	2682	2645
query15	216	197	180	180
query16	865	470	451	451
query17	1125	729	610	610
query18	2666	453	352	352
query19	250	239	205	205
query20	126	119	116	116
query21	226	143	121	121
query22	3986	4033	4082	4033
query23	16853	16174	15865	15865
query23_1	16103	16182	16108	16108
query24	7308	1644	1254	1254
query24_1	1239	1250	1262	1250
query25	589	518	501	501
query26	1246	255	156	156
query27	2777	463	299	299
query28	4439	2143	2145	2143
query29	775	538	451	451
query30	315	244	210	210
query31	811	702	607	607
query32	76	70	69	69
query33	531	336	285	285
query34	892	891	538	538
query35	756	825	706	706
query36	865	905	849	849
query37	135	97	72	72
query38	2940	3034	3026	3026
query39	758	748	720	720
query39_1	698	729	711	711
query40	221	142	123	123
query41	68	62	60	60
query42	107	106	107	106
query43	444	432	407	407
query44	1317	745	737	737
query45	196	202	192	192
query46	858	975	614	614
query47	1688	1738	1652	1652
query48	314	331	250	250
query49	650	438	384	384
query50	648	303	214	214
query51	3801	3854	3821	3821
query52	110	109	98	98
query53	321	347	290	290
query54	289	268	247	247
query55	78	77	70	70
query56	289	290	296	290
query57	1154	1161	1110	1110
query58	287	251	249	249
query59	2405	2516	2356	2356
query60	330	314	300	300
query61	161	161	159	159
query62	726	686	668	668
query63	326	297	305	297
query64	4947	1289	1016	1016
query65	4083	3977	3939	3939
query66	1398	433	317	317
query67	15185	15131	14736	14736
query68	4751	1016	725	725
query69	497	346	310	310
query70	1079	1014	982	982
query71	354	304	281	281
query72	6295	4974	5255	4974
query73	677	592	308	308
query74	9013	8865	8695	8695
query75	3202	3152	2801	2801
query76	3905	1129	751	751
query77	518	403	294	294
query78	9448	9842	8813	8813
query79	1536	858	614	614
query80	1547	676	564	564
query81	557	263	242	242
query82	401	135	102	102
query83	357	254	239	239
query84	269	123	101	101
query85	930	502	452	452
query86	388	301	286	286
query87	3121	3188	3007	3007
query88	3246	2283	2265	2265
query89	469	422	389	389
query90	1918	159	152	152
query91	187	172	145	145
query92	74	66	65	65
query93	1086	892	560	560
query94	532	320	289	289
query95	566	377	309	309
query96	594	455	213	213
query97	2274	2290	2176	2176
query98	216	199	197	197
query99	1326	1361	1267	1267
Total cold run time: 258990 ms
Total hot run time: 179069 ms

doris-robot · 2025-12-28T16:21:47Z

ClickBench: Total hot run time: 27.18 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit da983ea47784db2df29d911ce481ab297790550c, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.04
query3	0.26	0.08	0.08
query4	1.60	0.11	0.11
query5	0.27	0.26	0.28
query6	1.16	0.64	0.62
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.56	0.50	0.49
query10	0.54	0.55	0.55
query11	0.15	0.12	0.11
query12	0.16	0.12	0.12
query13	0.61	0.61	0.60
query14	1.00	0.98	0.99
query15	0.80	0.79	0.80
query16	0.41	0.40	0.40
query17	1.03	1.00	1.01
query18	0.22	0.22	0.22
query19	1.84	1.75	1.87
query20	0.02	0.01	0.02
query21	15.44	0.29	0.14
query22	4.97	0.05	0.04
query23	16.19	0.28	0.10
query24	0.97	0.26	0.44
query25	0.08	0.07	0.06
query26	0.14	0.14	0.14
query27	0.05	0.05	0.08
query28	3.66	1.23	1.02
query29	12.58	3.94	3.30
query30	0.28	0.14	0.14
query31	2.80	0.63	0.38
query32	3.23	0.54	0.45
query33	3.02	2.95	2.97
query34	17.02	5.17	4.50
query35	4.59	4.58	4.56
query36	0.67	0.49	0.49
query37	0.10	0.07	0.07
query38	0.06	0.04	0.04
query39	0.05	0.03	0.03
query40	0.17	0.14	0.13
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 97.09 s
Total hot run time: 27.18 s

kaka11chen · 2025-12-29T00:44:55Z

run feut

hello-stephen · 2025-12-29T02:04:50Z

FE UT Coverage Report

Increment line coverage 32.92% (80/243) 🎉
Increment coverage report
Complete coverage report

morningman

LGTM

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-01-05T08:45:38Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-01-05T08:45:41Z

PR approved by anyone and no changes requested.

### What problem does this PR solve? ### Release note This PR introduces a **dynamic and progressive file split size adjustment mechanism** to improve scan parallelism and resource utilization for external table scans, while avoiding excessive small splits or inefficiently large initial splits. #### 1. Split Size Adjustment Strategy ##### 1.1 Non-Batch Split Mode In non-batch split mode, a **two-phase split size selection strategy** is applied based on the total size of all input files: * The total size of all splits is calculated in advance. * If the total size **exceeds `maxInitialSplitNum * maxInitialSplitSize`**: * `split_size = maxSplitSize` (default **64MB**) * Otherwise: * `split_size = maxInitialSplitSize` (default **32MB**) This strategy reduces the number of splits for small datasets while improving parallelism for large-scale scans. --- ##### 1.2 Batch Split Mode In batch split mode, a **progressive split size adjustment strategy** is introduced: * As the total file size increases, * When the number of files gradually **exceeds `maxInitialSplitNum`**, * The `split_size` is **smoothly increased from `maxInitialSplitSize` (32MB) toward `maxSplitSize` (64MB)**. This approach avoids generating too many small splits at the early stage while gradually increasing scan parallelism as the workload grows, resulting in more stable scheduling and execution behavior. --- ##### 1.3 User-Specified Split Size (Backward Compatibility) This PR **preserves the session variable `file_split_size`** for user-defined split size configuration: * If `file_split_size` is explicitly set by the user: * The user-defined value takes precedence. * The dynamic split size adjustment logic is bypassed. * This ensures full backward compatibility with existing configurations and tuning practices. --- #### 2. Support Status by Data Source | Data Source | Non-Batch Split Mode | Batch Split Mode | Notes | | ----------- | -------------------- | ---------------- | ----------------------------------------------------- | | Hive | ✅ Supported | ✅ Supported | Uses Doris internal HDFS FileSplitter | | Iceberg | ✅ Supported | ❌ Not supported | File splitting is currently delegated to Iceberg APIs | | Paimon | ✅ Supported | ❌ Not supported | Only non-batch split mode is implemented | --- #### 3. New Hive HDFS FileSplitter Logic For Hive HDFS files, this PR introduces an enhanced file splitting strategy: 1. **Splits never span multiple HDFS blocks** * Prevents cross-block reads and avoids unnecessary IO overhead. 2. **Tail split optimization** * If the remaining file size is smaller than `split_size * 2`, * The remaining part is **evenly divided** into splits, * Preventing the creation of very small tail splits and improving overall scan efficiency. --- #### Summary * Introduces dynamic and progressive split size adjustment * Supports both batch and non-batch split modes * Preserves user-defined split size configuration for backward compatibility * Optimizes Hive HDFS file splitting to reduce small tail splits and cross-block IO

…revent OOM in file scan (#58759) ### What problem does this PR solve? - Relate Pr: #58858 ## Problem Summary When querying external table catalog (Hive, Iceberg, Paimon, etc.), Doris splits files into multiple splits for parallel processing. In some cases, especially with numerous small files, this can generate an excessive number of splits, potentially causing: 1. **Memory pressure**: Too many splits consume significant memory in FE 2. **OOM issues**: Excessive split generation can lead to OutOfMemoryError 3. **Performance degradation**: Managing too many splits impacts query planning overhead Previously, there was no upper limit on the number of splits in non-batch mode, which could lead to problems when querying tables with many small files. ## Solution This PR introduces a new session variable `max_file_split_num` to limit the maximum number of splits allowed per table scan in non-batch mode. ### Changes 1. **New Session Variable**: `max_file_split_num` - Type: `int` - Default: `100000` - Description: "在非 batch 模式下，每个 table scan 最大允许的 split 数量，防止产生过多 split 导致 OOM。" - Forward to BE: `true` 2. **Implementation in FileQueryScanNode**: - Added method `applyMaxFileSplitNumLimit(long targetSplitSize, long totalFileSize)` - Dynamically calculates minimum split size to ensure split count doesn't exceed the limit - Formula: `minSplitSizeForMaxNum = (totalFileSize + maxFileSplitNum - 1) / maxFileSplitNum` - Returns: `Math.max(targetSplitSize, minSplitSizeForMaxNum)` 3. **Applied to multiple scan nodes**: - `HiveScanNode` - `IcebergScanNode` - `PaimonScanNode` - `TVFScanNode` 4. **Unit Tests**: - `FileQueryScanNodeTest`: Test base logic - `HiveScanNodeTest`: Test Hive-specific implementation - `IcebergScanNodeTest`: Test Iceberg-specific implementation - `PaimonScanNodeTest`: Test Paimon-specific implementation - `TVFScanNodeTest`: Test TVF-specific implementation ## Usage Users can now control the maximum number of splits per table scan by setting the session variable: ```sql -- Set to 50000 splits maximum SET max_file_split_num = 50000; -- Disable the limit (set to 0 or negative) SET max_file_split_num = 0; ```

### What problem does this PR solve? ### Release note This PR introduces a **dynamic and progressive file split size adjustment mechanism** to improve scan parallelism and resource utilization for external table scans, while avoiding excessive small splits or inefficiently large initial splits. #### 1. Split Size Adjustment Strategy ##### 1.1 Non-Batch Split Mode In non-batch split mode, a **two-phase split size selection strategy** is applied based on the total size of all input files: * The total size of all splits is calculated in advance. * If the total size **exceeds `maxInitialSplitNum * maxInitialSplitSize`**: * `split_size = maxSplitSize` (default **64MB**) * Otherwise: * `split_size = maxInitialSplitSize` (default **32MB**) This strategy reduces the number of splits for small datasets while improving parallelism for large-scale scans. --- ##### 1.2 Batch Split Mode In batch split mode, a **progressive split size adjustment strategy** is introduced: * As the total file size increases, * When the number of files gradually **exceeds `maxInitialSplitNum`**, * The `split_size` is **smoothly increased from `maxInitialSplitSize` (32MB) toward `maxSplitSize` (64MB)**. This approach avoids generating too many small splits at the early stage while gradually increasing scan parallelism as the workload grows, resulting in more stable scheduling and execution behavior. --- ##### 1.3 User-Specified Split Size (Backward Compatibility) This PR **preserves the session variable `file_split_size`** for user-defined split size configuration: * If `file_split_size` is explicitly set by the user: * The user-defined value takes precedence. * The dynamic split size adjustment logic is bypassed. * This ensures full backward compatibility with existing configurations and tuning practices. --- #### 2. Support Status by Data Source | Data Source | Non-Batch Split Mode | Batch Split Mode | Notes | | ----------- | -------------------- | ---------------- | ----------------------------------------------------- | | Hive | ✅ Supported | ✅ Supported | Uses Doris internal HDFS FileSplitter | | Iceberg | ✅ Supported | ❌ Not supported | File splitting is currently delegated to Iceberg APIs | | Paimon | ✅ Supported | ❌ Not supported | Only non-batch split mode is implemented | --- #### 3. New Hive HDFS FileSplitter Logic For Hive HDFS files, this PR introduces an enhanced file splitting strategy: 1. **Splits never span multiple HDFS blocks** * Prevents cross-block reads and avoids unnecessary IO overhead. 2. **Tail split optimization** * If the remaining file size is smaller than `split_size * 2`, * The remaining part is **evenly divided** into splits, * Preventing the creation of very small tail splits and improving overall scan efficiency. --- #### Summary * Introduces dynamic and progressive split size adjustment * Supports both batch and non-batch split modes * Preserves user-defined split size configuration for backward compatibility * Optimizes Hive HDFS file splitting to reduce small tail splits and cross-block IO

### What problem does this PR solve? Problem Summary: ### Release note Cherry-pick #58858 ### Check List (For Author) - Test  - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason  - Behavior changed: - [ ] No. - [ ] Yes.  - Does this need documentation? - [ ] No. - [ ] Yes.  ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label

…revent OOM in file scan (apache#58759) ### What problem does this PR solve? - Relate Pr: apache#58858 ## Problem Summary When querying external table catalog (Hive, Iceberg, Paimon, etc.), Doris splits files into multiple splits for parallel processing. In some cases, especially with numerous small files, this can generate an excessive number of splits, potentially causing: 1. **Memory pressure**: Too many splits consume significant memory in FE 2. **OOM issues**: Excessive split generation can lead to OutOfMemoryError 3. **Performance degradation**: Managing too many splits impacts query planning overhead Previously, there was no upper limit on the number of splits in non-batch mode, which could lead to problems when querying tables with many small files. ## Solution This PR introduces a new session variable `max_file_split_num` to limit the maximum number of splits allowed per table scan in non-batch mode. ### Changes 1. **New Session Variable**: `max_file_split_num` - Type: `int` - Default: `100000` - Description: "在非 batch 模式下，每个 table scan 最大允许的 split 数量，防止产生过多 split 导致 OOM。" - Forward to BE: `true` 2. **Implementation in FileQueryScanNode**: - Added method `applyMaxFileSplitNumLimit(long targetSplitSize, long totalFileSize)` - Dynamically calculates minimum split size to ensure split count doesn't exceed the limit - Formula: `minSplitSizeForMaxNum = (totalFileSize + maxFileSplitNum - 1) / maxFileSplitNum` - Returns: `Math.max(targetSplitSize, minSplitSizeForMaxNum)` 3. **Applied to multiple scan nodes**: - `HiveScanNode` - `IcebergScanNode` - `PaimonScanNode` - `TVFScanNode` 4. **Unit Tests**: - `FileQueryScanNodeTest`: Test base logic - `HiveScanNodeTest`: Test Hive-specific implementation - `IcebergScanNodeTest`: Test Iceberg-specific implementation - `PaimonScanNodeTest`: Test Paimon-specific implementation - `TVFScanNodeTest`: Test TVF-specific implementation ## Usage Users can now control the maximum number of splits per table scan by setting the session variable: ```sql -- Set to 50000 splits maximum SET max_file_split_num = 50000; -- Disable the limit (set to 0 or negative) SET max_file_split_num = 0; ``` (cherry picked from commit 3e5a70f)

kaka11chen force-pushed the opt_split_size branch from 794ef06 to 2a5c1e0 Compare December 9, 2025 11:39

kaka11chen force-pushed the opt_split_size branch from 2a5c1e0 to 467baba Compare December 9, 2025 13:53

kaka11chen force-pushed the opt_split_size branch from 467baba to 97cbf6a Compare December 9, 2025 16:45

kaka11chen force-pushed the opt_split_size branch from 97cbf6a to 64bfbb1 Compare December 10, 2025 06:54

kaka11chen force-pushed the opt_split_size branch from 64bfbb1 to 828927e Compare December 10, 2025 07:00

kaka11chen force-pushed the opt_split_size branch from be00184 to 01fe74b Compare December 10, 2025 13:51

kaka11chen force-pushed the opt_split_size branch from 01fe74b to 32b5147 Compare December 10, 2025 14:12

[opt](multi-catalog) Optimize file split size.

da983ea

kaka11chen force-pushed the opt_split_size branch from 20e2b8a to da983ea Compare December 28, 2025 15:45

morningman requested a review from Copilot January 5, 2026 08:24

morningman approved these changes Jan 5, 2026

View reviewed changes

Copilot AI reviewed Jan 5, 2026

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 5, 2026

github-actions bot added the reviewed label Jan 5, 2026

Copilot started reviewing on behalf of morningman January 5, 2026 09:14 View session

suxiaogang223 approved these changes Jan 5, 2026

View reviewed changes

hubgeter approved these changes Jan 5, 2026

View reviewed changes

morningman merged commit c9b1819 into apache:master Jan 5, 2026
34 of 37 checks passed

github-actions bot added the dev/4.0.x-conflict label Jan 5, 2026

suxiaogang223 mentioned this pull request Jan 22, 2026

[feature](multi-catalog) Add max_file_split_num session variable to prevent OOM in file scan #58759

Merged

16 tasks

kaka11chen mentioned this pull request Feb 10, 2026

branch-4.0: [opt](multi-catalog) Optimize file split size. #60637

Merged

16 tasks

yiguolei added dev/4.0.4-merged and removed dev/4.0.x dev/4.0.x-conflict labels Feb 12, 2026

Conversation

kaka11chen commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

1. Split Size Adjustment Strategy

1.1 Non-Batch Split Mode

1.2 Batch Split Mode

1.3 User-Specified Split Size (Backward Compatibility)

2. Support Status by Data Source

3. New Hive HDFS FileSplitter Logic

Summary

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Dec 9, 2025

Uh oh!

kaka11chen commented Dec 9, 2025

Uh oh!

kaka11chen commented Dec 9, 2025

Uh oh!

kaka11chen commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

kaka11chen commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

doris-robot commented Dec 9, 2025

Uh oh!

hello-stephen commented Dec 9, 2025

FE Regression Coverage Report

Uh oh!

kaka11chen commented Dec 10, 2025

Uh oh!

kaka11chen commented Dec 10, 2025

Uh oh!

kaka11chen commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

kaka11chen commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

doris-robot commented Dec 10, 2025

Uh oh!

hello-stephen commented Dec 10, 2025

FE UT Coverage Report

Uh oh!

kaka11chen commented Dec 28, 2025

Uh oh!

doris-robot commented Dec 28, 2025

Uh oh!

doris-robot commented Dec 28, 2025

Uh oh!

doris-robot commented Dec 28, 2025

Uh oh!

kaka11chen commented Dec 29, 2025

Uh oh!

hello-stephen commented Dec 29, 2025

FE UT Coverage Report

Uh oh!

morningman left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

kaka11chen commented Dec 9, 2025 •

edited

Loading