Skip to content

[fix](export) remove export task executor in TransientTaskExecutor and fix concurrency issue (#42950)(#43051)(#43109)(#43250)#43305

Merged
morningman merged 4 commits intoapache:branch-3.0from
morningman:pick_42950_to_upstream-apache_branch-3.0
Nov 6, 2024
Merged

[fix](export) remove export task executor in TransientTaskExecutor and fix concurrency issue (#42950)(#43051)(#43109)(#43250)#43305
morningman merged 4 commits intoapache:branch-3.0from
morningman:pick_42950_to_upstream-apache_branch-3.0

Conversation

@morningman
Copy link
Contributor

@morningman morningman commented Nov 6, 2024

cherry pick from (#42950)(#43051)(#43109)(#43250)

@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

…he#43051)

### What problem does this PR solve?

Related PR: apache#42950

Problem Summary:

PR apache#42950 change some logic in ExportJob, by removing the
`taskIdToExecutor`, which is
a thread safe ConcurrentHashMap.
But there is a problem that, when cancelling a export job, it will clear
the `jobExecutorList` in ExportJob,
and meanwhile, this `jobExecutorList` may being traversed when creating
the export job,
causing concurrent modification exception.

This PR fix it by locking the writeLock of ExportMgr when cancelling the
export job.
### What problem does this PR solve?

Problem Summary:
```
2024-11-01 19:42:52,521 WARN (mysql-nio-pool-117|9514) [StmtExecutor.execute():616] Analyze failed. stmt[250257, 59c581a512e7468f-b1cfd7d4b63fed33]
org.apache.doris.common.NereidsException: errCode = 2, detailMessage = java.util.ConcurrentModificationException
        at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:780) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:601) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:564) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:554) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:340) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:243) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:208) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:236) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:413) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: org.apache.doris.common.AnalysisException: errCode = 2, detailMessage = java.util.ConcurrentModificationException
        ... 13 more
Caused by: java.util.ConcurrentModificationException
        at java.util.ArrayList.forEach(ArrayList.java:1513) ~[?:?]
        at org.apache.doris.load.ExportMgr.addExportJobAndRegisterTask(ExportMgr.java:120) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.trees.plans.commands.ExportCommand.run(ExportCommand.java:149) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:749) ~[doris-fe.jar:1.2-SNAPSHOT]
        ... 12 more
```

### Check List (For Committer)

- Test <!-- At least one of them must be included. -->

    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [x] Previous test can cover this change.
        - [ ] No colde files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:

    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?

    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- Release note

    <!-- bugfix, feat, behavior changed need a release note -->
    <!-- Add one line release note for this PR. -->
    None

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
…"create export job" edit log (apache#43250)

Problem Summary:

Fix bug like:
```
java.lang.NullPointerException: Cannot invoke "org.apache.doris.load.ExportJob.replayExportJobState(org.apache.doris.load.ExportJobState)" because "job" is null
        at org.apache.doris.load.ExportMgr.replayUpdateJobState(ExportMgr.java:475) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:390) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.catalog.Env.replayJournal(Env.java:2965) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.master.Checkpoint.doCheckpoint(Checkpoint.java:135) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.master.Checkpoint.runAfterCatalogReady(Checkpoint.java:80) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119) ~[doris-fe.jar:1.2-SNAPSHOT]
```

This is because the export job is finished before adding it.
The "finish" log is saved before "creating" log, so when replaying the
"finish" log,
the export job can not be found.

This PR change the order of "writing create export job log" and `adding
export task` event, to avoid this error
@morningman morningman changed the title [fix](export) remove export task executor in TransientTaskExecutor (#42880) #42950 [fix](export) remove export task executor in TransientTaskExecutor and fix concurrency issue (#42880)(#43051)(#43109)(#43250) Nov 6, 2024
@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40357 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 54cc58cc0393cc53d3be45483d9cca61d6d470bb, data reload: false

------ Round 1 ----------------------------------
q1	17562	7336	7272	7272
q2	2039	155	153	153
q3	10714	1036	1092	1036
q4	10546	761	751	751
q5	7763	2772	2747	2747
q6	234	145	147	145
q7	971	620	623	620
q8	9373	1886	1998	1886
q9	6442	6428	6409	6409
q10	7000	2245	2315	2245
q11	447	269	296	269
q12	400	212	211	211
q13	17772	2955	2968	2955
q14	242	221	216	216
q15	565	516	488	488
q16	670	601	599	599
q17	957	545	532	532
q18	7157	6604	6477	6477
q19	1804	1077	1034	1034
q20	472	203	202	202
q21	3925	3186	3139	3139
q22	1079	971	986	971
Total cold run time: 108134 ms
Total hot run time: 40357 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7309	7199	7201	7199
q2	329	232	228	228
q3	2847	2811	2880	2811
q4	2026	1784	1827	1784
q5	5628	5717	5719	5717
q6	225	138	145	138
q7	2177	1753	1756	1753
q8	3320	3509	3519	3509
q9	8769	8874	8829	8829
q10	3507	3486	3504	3486
q11	617	510	496	496
q12	797	588	591	588
q13	16443	3119	3122	3119
q14	311	273	267	267
q15	556	505	520	505
q16	718	659	660	659
q17	1866	1639	1584	1584
q18	8219	7759	7667	7667
q19	7822	1571	1586	1571
q20	2051	1880	1868	1868
q21	5298	5319	5245	5245
q22	1129	1040	1037	1037
Total cold run time: 81964 ms
Total hot run time: 60060 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 193467 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 54cc58cc0393cc53d3be45483d9cca61d6d470bb, data reload: false

query1	1240	966	901	901
query2	6232	2104	1984	1984
query3	10801	3890	3767	3767
query4	68432	30212	23592	23592
query5	5271	438	439	438
query6	412	176	174	174
query7	5641	305	313	305
query8	310	215	217	215
query9	9232	2635	2579	2579
query10	481	266	245	245
query11	17748	15367	15624	15367
query12	151	112	108	108
query13	1508	425	423	423
query14	10462	6829	6959	6829
query15	215	187	166	166
query16	7219	505	466	466
query17	1035	544	557	544
query18	1847	312	294	294
query19	203	150	155	150
query20	112	104	105	104
query21	207	99	101	99
query22	4417	4172	4062	4062
query23	34222	33523	33520	33520
query24	6016	2785	2755	2755
query25	515	407	402	402
query26	682	158	161	158
query27	1708	296	296	296
query28	4255	2505	2484	2484
query29	677	435	437	435
query30	230	158	158	158
query31	957	781	829	781
query32	64	56	54	54
query33	454	276	271	271
query34	904	477	503	477
query35	800	728	730	728
query36	1065	909	923	909
query37	119	79	74	74
query38	3919	3837	3807	3807
query39	1479	1426	1416	1416
query40	203	99	97	97
query41	51	50	48	48
query42	113	102	103	102
query43	505	476	485	476
query44	1129	773	789	773
query45	199	169	165	165
query46	1116	683	707	683
query47	1902	1835	1799	1799
query48	475	363	368	363
query49	742	384	383	383
query50	811	403	400	400
query51	7122	6980	6964	6964
query52	99	96	97	96
query53	255	185	185	185
query54	576	451	443	443
query55	80	79	77	77
query56	264	234	245	234
query57	1140	1115	1082	1082
query58	215	204	211	204
query59	3144	3171	3150	3150
query60	278	244	235	235
query61	99	94	97	94
query62	765	648	636	636
query63	208	184	176	176
query64	1693	615	612	612
query65	3252	3130	3173	3130
query66	707	308	292	292
query67	15677	15092	15328	15092
query68	4507	549	549	549
query69	420	247	253	247
query70	1151	1130	1131	1130
query71	399	261	258	258
query72	6435	3919	3936	3919
query73	763	338	346	338
query74	10315	8891	8853	8853
query75	3334	2632	2636	2632
query76	1798	922	951	922
query77	472	268	274	268
query78	10743	9696	9477	9477
query79	9921	584	594	584
query80	1670	406	408	406
query81	531	243	248	243
query82	1647	119	115	115
query83	268	157	137	137
query84	285	79	75	75
query85	1094	280	276	276
query86	415	301	300	300
query87	4432	4233	4222	4222
query88	5307	2397	2380	2380
query89	525	284	287	284
query90	2172	183	180	180
query91	171	138	146	138
query92	64	49	47	47
query93	6600	542	546	542
query94	861	283	276	276
query95	338	255	254	254
query96	639	276	293	276
query97	3374	3112	3124	3112
query98	218	202	197	197
query99	1679	1302	1310	1302
Total cold run time: 336610 ms
Total hot run time: 193467 ms

@morningman morningman merged commit e929034 into apache:branch-3.0 Nov 6, 2024
@morningman morningman changed the title [fix](export) remove export task executor in TransientTaskExecutor and fix concurrency issue (#42880)(#43051)(#43109)(#43250) [fix](export) remove export task executor in TransientTaskExecutor and fix concurrency issue (#42950)(#43051)(#43109)(#43250) Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments