Skip to content

[fix](cloud) Fix SingleFlight deadlock in CloudTabletMgr::get_tablet during warmup#60622

Merged
dataroaring merged 1 commit intoapache:masterfrom
liaoxin01:fix/singleflight-deadlock-master
Feb 10, 2026
Merged

[fix](cloud) Fix SingleFlight deadlock in CloudTabletMgr::get_tablet during warmup#60622
dataroaring merged 1 commit intoapache:masterfrom
liaoxin01:fix/singleflight-deadlock-master

Conversation

@liaoxin01
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:
During initial tablet loading, the load_tablet lambda inside SingleFlight calls sync_tablet_rowsets -> _add_rowsets_directly -> _submit_segment_download_task -> rowset_meta->fs(), which internally calls ExecEnv::get_tablet() for the same tablet_id. Since the tablet hasn't been inserted into cache yet, this triggers a second SingleFlight lookup for the same key, causing event.wait() to deadlock (waiting for itself).

Fix by pre-setting the encryption algorithm on RowsetMeta before calling fs() in _submit_segment_download_task, so _determine_encryption_once returns the cached value without re-entering get_tablet.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings February 9, 2026 11:20
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liaoxin01
Copy link
Contributor Author

run buildall

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a SingleFlight self-deadlock that can occur during cloud tablet warmup / initial tablet loading by preventing RowsetMeta::fs() from re-entering ExecEnv::get_tablet() for the same tablet.

Changes:

  • Add a RowsetMeta API to pre-initialize the cached encryption algorithm used by RowsetMeta::fs().
  • Pre-set the encryption algorithm in CloudTablet::_submit_segment_download_task() before calling rowset_meta->fs().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
be/src/olap/rowset/rowset_meta.h Adds an API to pre-initialize the encryption algorithm cached via DorisCallOnce.
be/src/cloud/cloud_tablet.cpp Uses the new API before RowsetMeta::fs() in the segment warmup download path to avoid re-entrant tablet lookup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1603 to 1607
// Pre-set encryption algorithm to avoid re-entrant get_tablet() call inside
// RowsetMeta::fs() which causes SingleFlight deadlock when the tablet is not
// yet cached (during initial load_tablet).
rowset_meta->set_encryption_algorithm(_tablet_meta->encryption_algorithm());
auto file_system = rowset_meta->fs();
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-set is only done in _submit_segment_download_task(), but _add_rowsets_directly() can also submit inverted-index warmup tasks. When config::file_cache_enable_only_warm_up_idx is enabled (segments skipped), _submit_inverted_index_download_task() will still call rowset_meta->fs() without the pre-set and can re-enter ExecEnv::get_tablet() during initial load, so the SingleFlight deadlock can still occur. Apply the same pre-set for the inverted-index download path too (or centralize it before any warmup task calls RowsetMeta::fs()).

Copilot uses AI. Check for mistakes.
@liaoxin01 liaoxin01 force-pushed the fix/singleflight-deadlock-master branch from b8ce1b3 to 38d4442 Compare February 9, 2026 12:27
@liaoxin01
Copy link
Contributor Author

run buildall

…during warmup

During initial tablet loading, the load_tablet lambda inside SingleFlight calls
sync_tablet_rowsets -> _add_rowsets_directly -> _submit_segment_download_task ->
rowset_meta->fs(), which internally calls ExecEnv::get_tablet() for the same
tablet_id. Since the tablet hasn't been inserted into cache yet, this triggers
a second SingleFlight lookup for the same key, causing event.wait() to deadlock
(waiting for itself).

Fix by pre-setting the encryption algorithm on RowsetMeta before calling fs()
in _submit_segment_download_task, so _determine_encryption_once returns the
cached value without re-entering get_tablet.
@liaoxin01 liaoxin01 force-pushed the fix/singleflight-deadlock-master branch from 38d4442 to 00bf565 Compare February 9, 2026 12:32
@liaoxin01
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 30325 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 00bf565d477a2b80c2c5c7c9ae440088623813b6, data reload: false

------ Round 1 ----------------------------------
q1	17607	4369	4262	4262
q2	2065	342	225	225
q3	10181	1319	743	743
q4	10208	769	314	314
q5	7533	2208	1966	1966
q6	193	178	150	150
q7	906	769	614	614
q8	9279	1412	1154	1154
q9	4643	4626	4609	4609
q10	6771	1948	1556	1556
q11	547	310	301	301
q12	341	390	236	236
q13	17760	4057	3273	3273
q14	246	255	210	210
q15	892	822	791	791
q16	674	667	650	650
q17	730	896	547	547
q18	6704	5670	5622	5622
q19	1238	988	626	626
q20	517	500	395	395
q21	2630	1847	1804	1804
q22	357	322	277	277
Total cold run time: 102022 ms
Total hot run time: 30325 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4332	4338	4349	4338
q2	255	356	262	262
q3	2114	2688	2182	2182
q4	1363	1745	1311	1311
q5	4305	4207	4288	4207
q6	216	179	140	140
q7	1856	1768	1661	1661
q8	2453	2746	2597	2597
q9	7608	7417	7507	7417
q10	2882	3363	2599	2599
q11	572	476	453	453
q12	692	766	677	677
q13	3815	4433	3504	3504
q14	306	311	284	284
q15	853	829	854	829
q16	679	759	699	699
q17	1166	1399	1419	1399
q18	8310	7755	8193	7755
q19	966	919	888	888
q20	2109	2201	2031	2031
q21	4887	4497	4288	4288
q22	606	562	533	533
Total cold run time: 52345 ms
Total hot run time: 50054 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 00bf565d477a2b80c2c5c7c9ae440088623813b6, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.05	0.04
query3	0.25	0.08	0.08
query4	1.60	0.11	0.11
query5	0.27	0.24	0.25
query6	1.18	0.67	0.67
query7	0.02	0.02	0.02
query8	0.06	0.04	0.04
query9	0.57	0.50	0.48
query10	0.56	0.55	0.55
query11	0.14	0.10	0.10
query12	0.14	0.10	0.11
query13	0.63	0.60	0.62
query14	1.07	1.07	1.06
query15	0.87	0.85	0.88
query16	0.41	0.40	0.42
query17	1.13	1.10	1.11
query18	0.23	0.21	0.21
query19	2.05	2.02	1.98
query20	0.02	0.01	0.02
query21	15.42	0.23	0.14
query22	5.44	0.05	0.05
query23	16.06	0.29	0.11
query24	1.68	0.24	0.31
query25	0.10	0.08	0.08
query26	0.14	0.14	0.13
query27	0.09	0.05	0.08
query28	3.52	1.16	0.97
query29	12.58	3.87	3.14
query30	0.28	0.14	0.11
query31	2.82	0.62	0.41
query32	3.23	0.60	0.51
query33	3.33	3.30	3.28
query34	16.19	5.42	4.80
query35	4.79	4.70	4.89
query36	0.64	0.49	0.49
query37	0.11	0.08	0.07
query38	0.09	0.04	0.04
query39	0.05	0.03	0.04
query40	0.19	0.16	0.16
query41	0.09	0.04	0.03
query42	0.04	0.04	0.03
query43	0.04	0.04	0.03
Total cold run time: 98.26 s
Total hot run time: 28.29 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.72% (19445/36885)
Line Coverage 36.21% (181040/499907)
Region Coverage 32.60% (140552/431173)
Branch Coverage 33.63% (60885/181049)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.66% (25899/36140)
Line Coverage 54.29% (270740/498668)
Region Coverage 51.83% (225767/435553)
Branch Coverage 53.26% (96801/181753)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit 33d9168 into apache:master Feb 10, 2026
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.0.x-conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants