[Feature](Streaming Job) Extend streaming job to support Postgres synchronization#59461
[Feature](Streaming Job) Extend streaming job to support Postgres synchronization#59461JNSimba merged 16 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 34820 ms |
TPC-DS: Total hot run time: 174319 ms |
ClickBench: Total hot run time: 26.89 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run nonConcurrent |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
skip buildall |
### What problem does this PR solve? Related PR: #59461 1. PostgreSQL uses slots for data consumption, but only one client can use a slot at a time. Therefore, after consuming data from the WAL phase, the slot needs to be closed. This doesn't affect MySQL, but it can be closed to avoid consuming connections. 2. Create pg slot first when create job 3. fix unstable case
### What problem does this PR solve? Related PR: #59461 1. PostgreSQL uses slots for data consumption, but only one client can use a slot at a time. Therefore, after consuming data from the WAL phase, the slot needs to be closed. This doesn't affect MySQL, but it can be closed to avoid consuming connections. 2. Create pg slot first when create job 3. fix unstable case
### What problem does this PR solve? Related PR: #58898 #59461 This PR primarily optimizes the speed of incremental and snapshot reads. 1. For incremental reads: - Binding the fetch logic to an interval allows fetching data within that interval. - Splitting the fetch and write logic asynchronously. 2. For snapshot reads: - Introducing the `snapshot_split_size` and `snapshot_parallelism` parameters. - `snapshot_split_size`: Adjusts the size of each chunk during the split phase, allowing each split to fetch more data. - `snapshot_parallelism`: The degree of parallelism during the snapshot read phase, i.e., how many chunks can run simultaneously, and how many chunks are scheduled in a single task.
### What problem does this PR solve? Related PR: #58898 #59461 This PR primarily optimizes the speed of incremental and snapshot reads. 1. For incremental reads: - Binding the fetch logic to an interval allows fetching data within that interval. - Splitting the fetch and write logic asynchronously. 2. For snapshot reads: - Introducing the `snapshot_split_size` and `snapshot_parallelism` parameters. - `snapshot_split_size`: Adjusts the size of each chunk during the split phase, allowing each split to fetch more data. - `snapshot_parallelism`: The degree of parallelism during the snapshot read phase, i.e., how many chunks can run simultaneously, and how many chunks are scheduled in a single task.
…l/pg streaming job (#60473) ### What problem does this PR solve? Related PR: #58898 #59461 In some scenarios, it is necessary to tolerate a certain amount of erroneous data. Supported parameters: `load.strict_mode`: Whether to enable strict mode, defaults to false. `load.max_filter_ratio`: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is `max_interval * 10`. That is, if the number of erroneous rows/total rows exceeds `max_filter_ratio` within the sampling window, the job will be paused, requiring manual intervention to check data quality issues. eg: ``` CREATE JOB test_streaming_mysql_job_errormsg ON STREAMING FROM MYSQL ( "jdbc_url" = "jdbc:mysql://127.0.0.1:3308", ...... ) TO DATABASE database ( "table.create.properties.replication_num" = "1" ... "load.max_filter_ratio" = "1" ) ```
…l/pg streaming job (#60473) ### What problem does this PR solve? Related PR: #58898 #59461 In some scenarios, it is necessary to tolerate a certain amount of erroneous data. Supported parameters: `load.strict_mode`: Whether to enable strict mode, defaults to false. `load.max_filter_ratio`: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is `max_interval * 10`. That is, if the number of erroneous rows/total rows exceeds `max_filter_ratio` within the sampling window, the job will be paused, requiring manual intervention to check data quality issues. eg: ``` CREATE JOB test_streaming_mysql_job_errormsg ON STREAMING FROM MYSQL ( "jdbc_url" = "jdbc:mysql://127.0.0.1:3308", ...... ) TO DATABASE database ( "table.create.properties.replication_num" = "1" ... "load.max_filter_ratio" = "1" ) ```
### What problem does this PR solve? Related PR: #59461 To enhance partition table synchronization, add `publish_via_partition_root` when creating a PUBLICATION instance, specifically for PG 13+.
### What problem does this PR solve? Related PR: #59461 To enhance partition table synchronization, add `publish_via_partition_root` when creating a PUBLICATION instance, specifically for PG 13+.
…e#60560) ### What problem does this PR solve? Related PR: apache#59461 To enhance partition table synchronization, add `publish_via_partition_root` when creating a PUBLICATION instance, specifically for PG 13+.
What problem does this PR solve?
This Issues (#58896) and #58898 implements multi-table synchronization in MySQL, The main purpose of this PR is to extend the data source to Postgres.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)