[enhance](job) support adaptive batch param for routine load job#56930
[enhance](job) support adaptive batch param for routine load job#56930liaoxin01 merged 2 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
ad32fe5 to
9fa7d4d
Compare
|
run buildall |
TPC-DS: Total hot run time: 190198 ms |
ClickBench: Total hot run time: 30.55 s |
FE UT Coverage ReportIncrement line coverage |
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadTaskInfo.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadTaskInfo.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadTaskInfo.java
Outdated
Show resolved
Hide resolved
FE Regression Coverage ReportIncrement line coverage |
9fa7d4d to
56a93ef
Compare
|
run buildall |
56a93ef to
ea72d72
Compare
|
run buildall |
TPC-DS: Total hot run time: 190175 ms |
ClickBench: Total hot run time: 30.21 s |
FE Regression Coverage ReportIncrement line coverage |
ea72d72 to
7ad9ede
Compare
|
run buildall |
TPC-DS: Total hot run time: 190518 ms |
ClickBench: Total hot run time: 30.3 s |
7ad9ede to
2283a48
Compare
|
run buildall |
2283a48 to
43ef5e0
Compare
|
run buildall |
43ef5e0 to
4aeafb4
Compare
|
run buildall |
4aeafb4 to
dce28d8
Compare
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
3 similar comments
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
) ### What problem does this PR solve? Users may set the `max batch interval` relatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at **prioritizing throughput over visibility during data backlog**
) ### What problem does this PR solve? Users may set the `max batch interval` relatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at **prioritizing throughput over visibility during data backlog**
) ### What problem does this PR solve? Users may set the `max batch interval` relatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at **prioritizing throughput over visibility during data backlog**
…che#56930) ### What problem does this PR solve? Users may set the `max batch interval` relatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at **prioritizing throughput over visibility during data backlog**
…ptive timeout (apache#57967) Fix transaction timeout do not match routine load task adaptive timeout, introduced by apache#56930
…oad job (#58846) ### What problem does this PR solve? pick #56930 and #57967 Users may set the max batch interval relatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at prioritizing throughput over visibility during data backlog ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
### What problem does this PR solve? ## Problem When routine load uses adaptive batch interval (introduced in #56930), the FE transaction timeout does not match the BE task timeout, causing transactions to be aborted by `txnCleaner` while BE is still processing. ### Root Cause In `RoutineLoadTaskScheduler.scheduleOneTask()`, the execution order is: 1. `beginTxn()` - creates transaction with **original** `timeoutMs` 2. `createRoutineLoadTask()` → `adaptiveBatchParam()` - updates `timeoutMs` to **adaptive** value This causes a mismatch: - FE transaction timeout: original value (e.g., 200 seconds) - BE task timeout: adaptive value (e.g., 360 seconds)
### What problem does this PR solve? ## Problem When routine load uses adaptive batch interval (introduced in #56930), the FE transaction timeout does not match the BE task timeout, causing transactions to be aborted by `txnCleaner` while BE is still processing. ### Root Cause In `RoutineLoadTaskScheduler.scheduleOneTask()`, the execution order is: 1. `beginTxn()` - creates transaction with **original** `timeoutMs` 2. `createRoutineLoadTask()` → `adaptiveBatchParam()` - updates `timeoutMs` to **adaptive** value This causes a mismatch: - FE transaction timeout: original value (e.g., 200 seconds) - BE task timeout: adaptive value (e.g., 360 seconds)
…ut_value (#60664) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #56930 Problem Summary: Under the condition where `isEof = false`, the calculation of the timeout in `updateAdaptiveTimeout `also needs to take into account `Config.routine_load_task_min_timeout_sec`. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
…ut_value (#60664) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #56930 Problem Summary: Under the condition where `isEof = false`, the calculation of the timeout in `updateAdaptiveTimeout `also needs to take into account `Config.routine_load_task_min_timeout_sec`. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
Users may set the
max batch intervalrelatively small for visibility, which may result in insufficient throughput and data backlog when traffic is high. We propose an adaptive max batch interval scheme aimed at prioritizing throughput over visibility during data backlogRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)