Skip to content

[fix](deadlock) avoid deadlock on tabletInvertedIndex#54197

Merged
dataroaring merged 1 commit intoapache:masterfrom
dataroaring:tablet_inverted_index_lock
Aug 19, 2025
Merged

[fix](deadlock) avoid deadlock on tabletInvertedIndex#54197
dataroaring merged 1 commit intoapache:masterfrom
dataroaring:tablet_inverted_index_lock

Conversation

@dataroaring
Copy link
Contributor

@dataroaring dataroaring commented Aug 1, 2025

StampLock used by tabletInvertedIndex is a fair lock.

Comments from StampLock:
In particular, we use the phase-fair anti-barging rule: If an incoming reader arrives while read lock is held but there is a queued writer, this incoming reader is queued.

A dead lock happens like below:

ReportThread: hold TabletInvertedIndex.readLock and try to acquire DatabaseTransactionMgr.readLock.

CommitThread: try to acquire TabletInvertedIndex.readLock.

TabletOperatorThread: try to acquire TabletInvertedIndex.writeLock.

CommitThread: hold DatabaseTransactionMgr.writeLock and waiting for editlog's lock.

The DatabaseTransactionMgr.writeLock is held when writing editlog, so sometimes it is time consuming and results in time consuming tablet report.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@dataroaring dataroaring added usercase Important user case type label dev/3.0.x dev/3.1.x labels Aug 1, 2025
@Thearas
Copy link
Contributor

Thearas commented Aug 1, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉
Increment coverage report
Complete coverage report


}

private TransactionState unprotectedGetTransactionState(Long transactionId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unprotectedGetTransactionState 这个函数调用的地方很多啊,都有加速写锁,看看需不需要去掉啊

@yujun777
Copy link
Contributor

yujun777 commented Aug 3, 2025

when do "CommitThread: hold DatabaseTransactionMgr.writeLock and try to acquire TabletInvertedIndex.readLock. " ? only find function checkCommitStatus try to acquire the TabletInvertedIndex.lock, but when calling function checkCommitStatus seems no hold the DatabaseTransactionMgr.writeLock

StampLock used by tabletInvertedIndex is a fair lock.

Comments from StampLock:
In particular, we use the phase-fair anti-barging rule: If an
incoming reader arrives while read lock is held but there is a
queued writer, this incoming reader is queued.

A dead lock happens like below:

ReportThread: hold TabletInvertedIndex.readLock and try to acquire
DatabaseTransactionMgr.readLock.

CommitThread: hold DatabaseTransactionMgr.writeLock and try to acquire
TabletInvertedIndex.readLock.

TabletOperatorThread: try to acquire TabletInvertedIndex.writeLock.
@dataroaring dataroaring force-pushed the tablet_inverted_index_lock branch from d37fd15 to 4468d0b Compare August 11, 2025 01:33
@dataroaring
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33553 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4468d0b9e0b03fc876ce09dd553f286dc8bba8cc, data reload: false

------ Round 1 ----------------------------------
q1	17651	5437	5313	5313
q2	1928	297	189	189
q3	10297	1357	718	718
q4	10223	996	516	516
q5	7516	2313	2373	2313
q6	186	170	131	131
q7	911	796	614	614
q8	9302	1294	995	995
q9	6633	4996	5139	4996
q10	6910	2320	1977	1977
q11	456	278	253	253
q12	345	369	221	221
q13	17775	3501	2944	2944
q14	222	253	214	214
q15	520	464	453	453
q16	418	407	376	376
q17	551	806	348	348
q18	7411	7013	6934	6934
q19	1482	978	542	542
q20	317	308	206	206
q21	3363	2995	2288	2288
q22	1076	1074	1012	1012
Total cold run time: 105493 ms
Total hot run time: 33553 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5388	5287	5353	5287
q2	241	314	217	217
q3	2161	2653	2280	2280
q4	1331	1741	1372	1372
q5	4252	4366	4396	4366
q6	226	185	139	139
q7	1944	1952	1690	1690
q8	2464	2577	2599	2577
q9	7313	7441	7236	7236
q10	3217	3344	2913	2913
q11	545	511	490	490
q12	689	797	624	624
q13	3630	3669	3429	3429
q14	330	335	281	281
q15	506	462	477	462
q16	454	492	457	457
q17	1220	1484	1376	1376
q18	8010	7809	7574	7574
q19	9149	1022	1278	1022
q20	1970	1991	1872	1872
q21	15033	4275	4150	4150
q22	1047	1035	996	996
Total cold run time: 71120 ms
Total hot run time: 50810 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 159959 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4468d0b9e0b03fc876ce09dd553f286dc8bba8cc, data reload: false

catalog_page	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 22:43:58	2023-12-26 22:44:01	NULL	utf-8	NULL	NULL	
reason	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 22:43:56	2023-12-26 22:44:01	NULL	utf-8	NULL	NULL	
============================================
query1	999	397	439	397
query2	6553	1903	1707	1707
query3	6751	225	221	221
query4	26844	23474	22713	22713
query5	4354	609	525	525
query6	317	225	225	225
query7	4637	534	290	290
query8	293	235	240	235
query9	8613	2976	2956	2956
query10	479	338	297	297
query11	15466	14946	14831	14831
query12	184	139	136	136
query13	1689	586	457	457
query14	8799	5986	6110	5986
query15	217	188	167	167
query16	7848	673	466	466
query17	1226	768	662	662
query18	2016	448	327	327
query19	226	210	183	183
query20	150	143	137	137
query21	262	127	120	120
query22	4028	3882	3896	3882
query23	34497	34125	34129	34125
query24	7612	2350	2448	2350
query25	498	505	448	448
query26	716	309	161	161
query27	2392	505	349	349
query28	3045	2309	2318	2309
query29	681	596	482	482
query30	290	233	190	190
query31	887	774	710	710
query32	90	72	77	72
query33	553	419	372	372
query34	789	862	540	540
query35	783	848	739	739
query36	994	1041	936	936
query37	139	111	92	92
query38	3970	3920	3919	3919
query39	1437	1380	1397	1380
query40	234	145	134	134
query41	62	58	59	58
query42	138	129	124	124
query43	509	509	476	476
query44	1427	876	888	876
query45	211	206	186	186
query46	944	1047	669	669
query47	1786	1843	1744	1744
query48	399	456	315	315
query49	678	531	413	413
query50	675	673	414	414
query51	4145	4165	4147	4147
query52	133	134	117	117
query53	262	295	220	220
query54	657	644	556	556
query55	91	89	86	86
query56	356	361	351	351
query57	1180	1207	1152	1152
query58	334	324	337	324
query59	2617	2614	2607	2607
query60	401	413	408	408
query61	120	153	120	120
query62	779	744	642	642
query63	256	215	211	211
query64	2457	1090	805	805
query65	4240	4139	4139	4139
query66	998	440	329	329
query67	query68	16683	852	846	846
query69	1134	277	288	277
query70	1327	1049	1106	1049
query71	720	317	326	317
query72	9180	2316	2128	2128
query73	3429	698	418	418
query74	9079	9060	8850	8850
query75	7786	3124	2678	2678
query76	8877	1211	784	784
query77	1152	422	335	335
query78	9590	11608	query79	query80	5030	507	483	483
query81	570	230	243	230
query82	973	119	114	114
query83	400	260	263	260
query84	296	86	83	83
query85	1281	337	330	330
query86	410	278	302	278
query87	4465	4088	4115	4088
query88	10618	2259	2250	2250
query89	573	377	310	310
query90	2458	235	240	235
query91	148	140	113	113
query92	95	74	67	67
query93	8395	1026	650	650
query94	1433	405	282	282
query95	432	329	315	315
query96	505	583	285	285
query97	2699	2686	2638	2638
query98	255	243	214	214
query99	1454	1437	1335	1335
Total cold run time: 289270 ms
Total hot run time: 159959 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.18 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4468d0b9e0b03fc876ce09dd553f286dc8bba8cc, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.26	0.07	0.07
query4	1.62	0.11	0.11
query5	0.41	0.42	0.44
query6	1.16	0.68	0.66
query7	0.03	0.02	0.02
query8	0.05	0.03	0.04
query9	0.56	0.47	0.46
query10	0.53	0.54	0.54
query11	0.16	0.10	0.11
query12	0.15	0.11	0.12
query13	0.65	0.65	0.62
query14	0.93	1.18	0.95
query15	0.94	0.90	0.91
query16	0.39	0.39	0.39
query17	1.05	1.03	1.04
query18	0.21	0.22	0.20
query19	1.90	1.87	1.91
query20	0.02	0.01	0.01
query21	15.36	0.89	0.56
query22	0.76	1.30	0.78
query23	14.72	1.22	0.64
query24	6.49	2.00	0.79
query25	0.48	0.22	0.19
query26	0.60	0.14	0.15
query27	0.06	0.05	0.06
query28	9.86	0.87	0.46
query29	12.56	3.79	3.29
query30	3.04	2.96	3.00
query31	2.81	0.56	0.40
query32	3.23	0.56	0.48
query33	3.02	3.15	3.20
query34	16.00	5.36	4.90
query35	4.89	4.98	5.00
query36	0.70	0.52	0.50
query37	0.10	0.08	0.07
query38	0.05	0.05	0.05
query39	0.03	0.02	0.03
query40	0.16	0.15	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.03
query43	0.04	0.04	0.04
Total cold run time: 106.21 s
Total hot run time: 33.18 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@yujun777 yujun777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 19, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit 7ba1aeb into apache:master Aug 19, 2025
31 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Aug 19, 2025
StampLock used by tabletInvertedIndex is a fair lock.

Comments from StampLock:
In particular, we use the phase-fair anti-barging rule: If an incoming
reader arrives while read lock is held but there is a queued writer,
this incoming reader is queued.

A dead lock happens like below:

ReportThread: hold TabletInvertedIndex.readLock and try to acquire
DatabaseTransactionMgr.readLock.

CommitThread: try to acquire TabletInvertedIndex.readLock.

TabletOperatorThread: try to acquire TabletInvertedIndex.writeLock.

CommitThread: hold DatabaseTransactionMgr.writeLock and waiting for
editlog's lock.

The DatabaseTransactionMgr.writeLock is held when writing editlog, so
sometimes it is time consuming and results in time consuming tablet
report.

Co-authored-by: Yongqiang YANG <yangyogqiang@selectdb.com>
github-actions bot pushed a commit that referenced this pull request Aug 19, 2025
StampLock used by tabletInvertedIndex is a fair lock.

Comments from StampLock:
In particular, we use the phase-fair anti-barging rule: If an incoming
reader arrives while read lock is held but there is a queued writer,
this incoming reader is queued.

A dead lock happens like below:

ReportThread: hold TabletInvertedIndex.readLock and try to acquire
DatabaseTransactionMgr.readLock.

CommitThread: try to acquire TabletInvertedIndex.readLock.

TabletOperatorThread: try to acquire TabletInvertedIndex.writeLock.

CommitThread: hold DatabaseTransactionMgr.writeLock and waiting for
editlog's lock.

The DatabaseTransactionMgr.writeLock is held when writing editlog, so
sometimes it is time consuming and results in time consuming tablet
report.

Co-authored-by: Yongqiang YANG <yangyogqiang@selectdb.com>
morrySnow pushed a commit that referenced this pull request Aug 19, 2025
 (#54997)

Cherry-picked from #54197

Co-authored-by: Yongqiang YANG <yangyongqiang@selectdb.com>
Co-authored-by: Yongqiang YANG <yangyogqiang@selectdb.com>
dataroaring added a commit that referenced this pull request Aug 20, 2025
 (#54996)

Cherry-picked from #54197

Co-authored-by: Yongqiang YANG <yangyongqiang@selectdb.com>
Co-authored-by: Yongqiang YANG <yangyogqiang@selectdb.com>
@gavinchou gavinchou mentioned this pull request Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.x dev/2.1.x-conflict dev/3.0.8-merged dev/3.1.0-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants