[Feature](Streaming Job) Extend streaming job to support MySQL synchronization#58898
Merged
JNSimba merged 49 commits intoapache:masterfrom Dec 22, 2025
Merged
[Feature](Streaming Job) Extend streaming job to support MySQL synchronization#58898JNSimba merged 49 commits intoapache:masterfrom
JNSimba merged 49 commits intoapache:masterfrom
Conversation
# Conflicts: # build.sh
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR extends streaming jobs to support MySQL synchronization via CDC (Change Data Capture), enabling users to sync data from MySQL databases to Doris in real-time. The implementation includes a new CDC client service and modifications to the streaming job framework.
Key Changes:
- Introduces a CDC client Spring Boot application that interfaces with MySQL using Flink CDC connectors
- Adds support for FROM MySQL TO Database syntax in job creation
- Implements split-based data reading for both snapshot and binlog phases
- Adds RPC endpoints for BE-FE communication to handle CDC operations
Reviewed changes
Copilot reviewed 85 out of 85 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/cdc/test_streaming_mysql_job.groovy | Regression test for MySQL streaming job with CDC |
| gensrc/proto/internal_service.proto | Adds RPC interface for CDC client communication |
| fs_brokers/cdc_client/** | Complete CDC client implementation using Spring Boot |
| fe/fe-core/.../streaming/** | Extends streaming job framework with multi-table task support |
| fe/fe-core/.../offset/jdbc/** | JDBC offset provider for tracking MySQL binlog positions |
| fe/fe-core/.../util/StreamingJobUtils.java | Utility functions for streaming job management |
| docker/thirdparties/docker-compose/mysql/my.cnf | Enables MySQL binlog for CDC |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment was marked as outdated.
This comment was marked as outdated.
16 tasks
JNSimba
added a commit
that referenced
this pull request
Jan 13, 2026
### What problem does this PR solve? fix show task error info when task timeout Related PR: #58898
github-actions bot
pushed a commit
that referenced
this pull request
Jan 13, 2026
### What problem does this PR solve? fix show task error info when task timeout Related PR: #58898
This was referenced Jan 13, 2026
zzzxl1993
pushed a commit
to zzzxl1993/doris
that referenced
this pull request
Jan 13, 2026
…pache#59705) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#58898
zzzxl1993
pushed a commit
to zzzxl1993/doris
that referenced
this pull request
Jan 13, 2026
…pty tables (apache#59735) ### What problem does this PR solve? Fix the issue of synchronization failure under empty tables Related PR: apache#58898
zzzxl1993
pushed a commit
to zzzxl1993/doris
that referenced
this pull request
Jan 13, 2026
…e#59784) ### What problem does this PR solve? fix show task error info when task timeout Related PR: apache#58898
zzzxl1993
pushed a commit
to zzzxl1993/doris
that referenced
this pull request
Jan 13, 2026
…pache#59760) ### What problem does this PR solve? fix get remote meta failed to pause streaming job Releate PR: apache#58898
16 tasks
JNSimba
added a commit
that referenced
this pull request
Jan 15, 2026
… remainsplit relay problem (#59883) ### What problem does this PR solve? Related PR: #58898 After the Job is created for the first time, starting from the initial offset, the task for the first split is scheduled, When the task status is running or failed, If FE restarts, the split needs to be restore from the meta again.
github-actions bot
pushed a commit
that referenced
this pull request
Jan 15, 2026
… remainsplit relay problem (#59883) ### What problem does this PR solve? Related PR: #58898 After the Job is created for the first time, starting from the initial offset, the task for the first split is scheduled, When the task status is running or failed, If FE restarts, the split needs to be restore from the meta again.
This was referenced Jan 21, 2026
JNSimba
added a commit
that referenced
this pull request
Feb 3, 2026
### What problem does this PR solve? Related PR: #58898 #59461 This PR primarily optimizes the speed of incremental and snapshot reads. 1. For incremental reads: - Binding the fetch logic to an interval allows fetching data within that interval. - Splitting the fetch and write logic asynchronously. 2. For snapshot reads: - Introducing the `snapshot_split_size` and `snapshot_parallelism` parameters. - `snapshot_split_size`: Adjusts the size of each chunk during the split phase, allowing each split to fetch more data. - `snapshot_parallelism`: The degree of parallelism during the snapshot read phase, i.e., how many chunks can run simultaneously, and how many chunks are scheduled in a single task.
github-actions bot
pushed a commit
that referenced
this pull request
Feb 3, 2026
### What problem does this PR solve? Related PR: #58898 #59461 This PR primarily optimizes the speed of incremental and snapshot reads. 1. For incremental reads: - Binding the fetch logic to an interval allows fetching data within that interval. - Splitting the fetch and write logic asynchronously. 2. For snapshot reads: - Introducing the `snapshot_split_size` and `snapshot_parallelism` parameters. - `snapshot_split_size`: Adjusts the size of each chunk during the split phase, allowing each split to fetch more data. - `snapshot_parallelism`: The degree of parallelism during the snapshot read phase, i.e., how many chunks can run simultaneously, and how many chunks are scheduled in a single task.
Merged
16 tasks
JNSimba
added a commit
that referenced
this pull request
Feb 5, 2026
…l/pg streaming job (#60473) ### What problem does this PR solve? Related PR: #58898 #59461 In some scenarios, it is necessary to tolerate a certain amount of erroneous data. Supported parameters: `load.strict_mode`: Whether to enable strict mode, defaults to false. `load.max_filter_ratio`: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is `max_interval * 10`. That is, if the number of erroneous rows/total rows exceeds `max_filter_ratio` within the sampling window, the job will be paused, requiring manual intervention to check data quality issues. eg: ``` CREATE JOB test_streaming_mysql_job_errormsg ON STREAMING FROM MYSQL ( "jdbc_url" = "jdbc:mysql://127.0.0.1:3308", ...... ) TO DATABASE database ( "table.create.properties.replication_num" = "1" ... "load.max_filter_ratio" = "1" ) ```
github-actions bot
pushed a commit
that referenced
this pull request
Feb 5, 2026
…l/pg streaming job (#60473) ### What problem does this PR solve? Related PR: #58898 #59461 In some scenarios, it is necessary to tolerate a certain amount of erroneous data. Supported parameters: `load.strict_mode`: Whether to enable strict mode, defaults to false. `load.max_filter_ratio`: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is `max_interval * 10`. That is, if the number of erroneous rows/total rows exceeds `max_filter_ratio` within the sampling window, the job will be paused, requiring manual intervention to check data quality issues. eg: ``` CREATE JOB test_streaming_mysql_job_errormsg ON STREAMING FROM MYSQL ( "jdbc_url" = "jdbc:mysql://127.0.0.1:3308", ...... ) TO DATABASE database ( "table.create.properties.replication_num" = "1" ... "load.max_filter_ratio" = "1" ) ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: close #58896
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)