Skip to content

branch-3.0: [fix](csv reader) fix csv parse error when use enclose with multi-char column separator (#54581)#55052

Merged
dataroaring merged 1 commit intoapache:branch-3.0from
sollhui:pick_d0f3af0
Aug 22, 2025
Merged

branch-3.0: [fix](csv reader) fix csv parse error when use enclose with multi-char column separator (#54581)#55052
dataroaring merged 1 commit intoapache:branch-3.0from
sollhui:pick_d0f3af0

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Aug 20, 2025

pick #54581

Idx represents the position where the buffer is parsed.

If the buffer does not read a complete row, as shown in the following figure, idx will become the length of the buffer, and then the buffer will be expanded. If some of the column separators happen to be at the end of the buffer and some are not read, when reading after expansion, it will be impossible to read the complete column separators, resulting in parsing errors.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…r column separator (apache#54581)

Idx represents the position where the buffer is parsed.

If the buffer does not read a complete row, as shown in the following
figure, idx will become the length of the buffer, and then the buffer
will be expanded. If some of the column separators happen to be at the
end of the buffer and some are not read, when reading after expansion,
it will be impossible to read the complete column separators, resulting
in parsing errors.
@sollhui sollhui requested a review from dataroaring as a code owner August 20, 2025 06:50
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Aug 20, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40381 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4032203934aeaf787b502388596ea1a07168f493, data reload: false

------ Round 1 ----------------------------------
q1	17594	6864	6774	6774
q2	2034	181	179	179
q3	10593	1138	1176	1138
q4	10242	748	718	718
q5	7729	2861	2829	2829
q6	210	135	136	135
q7	984	622	622	622
q8	9362	1998	2057	1998
q9	6654	6387	6428	6387
q10	7039	2276	2315	2276
q11	477	263	267	263
q12	409	225	221	221
q13	17794	3006	3034	3006
q14	236	208	204	204
q15	505	462	453	453
q16	483	378	373	373
q17	973	624	551	551
q18	7348	6743	6859	6743
q19	1399	1112	1050	1050
q20	498	203	217	203
q21	3957	3280	3330	3280
q22	1109	982	978	978
Total cold run time: 107629 ms
Total hot run time: 40381 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6716	6656	6603	6603
q2	329	226	235	226
q3	3025	2952	2964	2952
q4	2060	1894	1838	1838
q5	5770	5817	5773	5773
q6	216	132	131	131
q7	2272	1845	1846	1845
q8	3425	3591	3603	3591
q9	8892	9017	8939	8939
q10	3575	3551	3570	3551
q11	599	487	498	487
q12	823	568	596	568
q13	8175	3242	3209	3209
q14	316	269	284	269
q15	508	489	470	470
q16	501	445	439	439
q17	1872	1655	1655	1655
q18	8314	7792	8091	7792
q19	1704	1536	1607	1536
q20	2100	1907	1914	1907
q21	5479	5236	5126	5126
q22	1113	1071	1001	1001
Total cold run time: 67784 ms
Total hot run time: 59908 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194247 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4032203934aeaf787b502388596ea1a07168f493, data reload: false

query1	957	422	412	412
query2	6231	2001	1919	1919
query3	8682	200	198	198
query4	33227	23741	24075	23741
query5	4009	471	481	471
query6	316	191	189	189
query7	4207	320	327	320
query8	318	228	234	228
query9	9299	2596	2578	2578
query10	485	267	271	267
query11	17831	15244	15296	15244
query12	153	107	103	103
query13	1547	444	418	418
query14	9534	7372	7744	7372
query15	269	175	176	175
query16	8167	519	535	519
query17	1704	620	619	619
query18	2210	324	330	324
query19	375	167	165	165
query20	124	126	123	123
query21	219	110	104	104
query22	4650	4494	4384	4384
query23	35159	34474	34530	34474
query24	11335	2955	2938	2938
query25	679	443	447	443
query26	1224	182	184	182
query27	2272	371	364	364
query28	7184	2159	2155	2155
query29	887	479	478	478
query30	268	167	160	160
query31	1066	849	835	835
query32	101	54	55	54
query33	810	305	313	305
query34	1042	522	554	522
query35	910	733	725	725
query36	1146	977	981	977
query37	140	72	79	72
query38	4164	4027	4015	4015
query39	1536	1802	1472	1472
query40	214	105	102	102
query41	50	48	48	48
query42	114	102	101	101
query43	547	503	492	492
query44	1296	808	842	808
query45	191	172	174	172
query46	1167	755	740	740
query47	2021	1912	1939	1912
query48	495	391	385	385
query49	885	410	401	401
query50	839	453	443	443
query51	7421	7180	7261	7180
query52	113	98	89	89
query53	274	192	193	192
query54	1266	490	480	480
query55	88	83	81	81
query56	284	266	246	246
query57	1297	1207	1203	1203
query58	234	219	214	214
query59	3204	3180	3016	3016
query60	305	275	266	266
query61	109	109	113	109
query62	909	681	706	681
query63	230	204	196	196
query64	4031	692	651	651
query65	3352	3365	3320	3320
query66	841	332	297	297
query67	16536	15710	15592	15592
query68	4414	592	607	592
query69	426	265	275	265
query70	1192	1091	1093	1091
query71	354	262	271	262
query72	6335	4044	4065	4044
query73	766	346	362	346
query74	10211	9026	9215	9026
query75	3426	2676	2701	2676
query76	2769	1074	1185	1074
query77	414	289	286	286
query78	10550	9577	9621	9577
query79	2601	623	633	623
query80	1076	446	439	439
query81	558	217	226	217
query82	569	96	89	89
query83	244	143	142	142
query84	233	83	86	83
query85	1688	307	295	295
query86	477	302	296	296
query87	4437	4237	4215	4215
query88	4338	2397	2392	2392
query89	417	298	298	298
query90	1924	188	191	188
query91	185	151	152	151
query92	67	50	51	50
query93	2195	572	574	572
query94	835	301	305	301
query95	372	263	265	263
query96	624	274	280	274
query97	3283	3136	3183	3136
query98	232	206	202	202
query99	1546	1321	1348	1321
Total cold run time: 301508 ms
Total hot run time: 194247 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4032203934aeaf787b502388596ea1a07168f493, data reload: false

query1	0.03	0.03	0.04
query2	0.06	0.04	0.03
query3	0.22	0.07	0.07
query4	1.61	0.11	0.11
query5	0.51	0.51	0.51
query6	1.13	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.56	0.50	0.50
query10	0.56	0.54	0.56
query11	0.15	0.12	0.10
query12	0.14	0.11	0.11
query13	0.61	0.60	0.59
query14	0.77	0.81	0.80
query15	0.84	0.84	0.83
query16	0.39	0.39	0.38
query17	1.00	1.00	1.05
query18	0.25	0.23	0.22
query19	1.87	1.80	1.90
query20	0.01	0.02	0.01
query21	15.41	0.59	0.58
query22	2.41	1.91	1.73
query23	17.13	0.91	0.88
query24	3.37	1.92	0.93
query25	0.32	0.13	0.19
query26	0.36	0.14	0.14
query27	0.05	0.04	0.04
query28	9.37	0.50	0.56
query29	12.56	3.20	3.19
query30	0.25	0.06	0.06
query31	2.84	0.40	0.40
query32	3.22	0.46	0.45
query33	2.97	3.02	3.04
query34	17.13	4.49	4.50
query35	4.58	4.59	4.58
query36	0.65	0.48	0.49
query37	0.09	0.06	0.06
query38	0.05	0.04	0.03
query39	0.03	0.03	0.02
query40	0.17	0.13	0.12
query41	0.08	0.02	0.03
query42	0.04	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 103.89 s
Total hot run time: 30.29 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 50.00% (2/4) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 42.08% (11239/26711)
Line Coverage 32.59% (96166/295122)
Region Coverage 30.53% (55233/180902)
Branch Coverage 26.85% (27335/101796)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 50.00% (2/4) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 75.24% (19716/26204)
Line Coverage 68.44% (201072/293807)
Region Coverage 66.53% (120421/181008)
Branch Coverage 59.88% (61151/102120)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit df7194e into apache:branch-3.0 Aug 22, 2025
22 of 24 checks passed
@gavinchou gavinchou mentioned this pull request Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments