Features

Version 25.3.9396

Features

CData Sync offers powerful features that help you replicate any data source to any database, data warehouse, and more. The following features are briefly explained in this document:

Incremental replication
Sync intervals (data integrity)
Deletion captures (data and records)
Data types
Data transformations
API access
Firewall traversal
Schema changes

Initial Replication

The first time that you run a job, CData Sync processes the entirety of the source’s historical data. This data can encompass a huge amount of information. Therefore, Sync uses several strategies to maximize efficiency, performance, and integrity. Sync also provides user-controlled options that you can use to optimize the synchronization strategy for your specific data set.

You can control how Sync processes an initial replication by using the following options. (These options are available on the Advanced tab, under the Job Settings category that is displayed when you open a job.)

Replicate Start Value: Sync begins replicating data from the source’s minimum date or minimum integer value of the auto-increment column (that is, from the source’s earliest available records). Some APIs do not provide a way to request the minimum date or int value for an entity. If no minimum value is available, you can configure it manually by following these steps:
1. Under Job Settings, click the Tables tab. Then, click the table that is displayed on that tab to open the Task Settings modal.
2. Click the Advanced tab in the modal.
3. Locate the Replicate Start Value option. Manually set the minimum start date/int value by adding a date in either of these forms: yyyy-mm-dd or yyyy-mm-dd hh:mm:ss.
If you do not specify a start date, Sync executes a query on the source that obtains every record in one request. However, this process can cause problems when your source table is very large because any error causes Sync to rerun the query from the beginning of the data.
Replicate Intervals: After Sync determines the minimum start date, the application moves the remainder of your data by an interval that you define until it reaches the end of the data. You define the intervals with these options, which are also available on the Advanced tab of the Job Settings page):
- Replicate Interval: Paired with Replicate Interval Unit, this option enables you to set the time interval at which to split the data during data retrieval. Sync uses this interval to batch the updates so that if a failure occurs or the replication is interrupted, the next attempt can start where the last run ended. By default, Sync uses an interval of 180 days. However, you can adjust this interval larger or smaller, depending on the amount of data that you have and how dispersed the data is in terms of time.
- Replicate Interval Unit: Paired with Replicate Interval, this option enables you to set the time interval at which to split the data during data retrieval. Accepted values for this option are minutes, hours, days, weeks, months, years.

Incremental Replication

After the initial replication, CData Sync moves data via incremental replication. Instead of querying the entirety of your data each time, Sync queries only for data that is added or changed since the last job run. Then, Sync merges that data into your data warehouse. This feature greatly reduces the workload, and it minimizes bandwidth use and latency of synchronization, especially when you are working with large data sets.

Many cloud systems use APIs, and pulling full data from those APIs into a data warehouse is often a slow process. In addition, many APIs use daily quotas where you cannot pull all the data daily even if you wanted to, much less every hour or every fifteen minutes. By moving data in increments, Sync gives you tremendous flexibility when you are dealing with slow APIs or daily quotas.

Sync pulls incremental replications by using two main methods–an incremental check column and change data capture. These methods are explained in the following sections.

Incremental Check Column

An incremental check column is either a datetime or integer-based column that Sync uses to identify new or modified records when it is replicating data. Each time that a record is added or updated in the source, the value for this column is increased. Sync uses this column as a criterion during the extraction to ensure that only new or changed records are returned. Then, Sync stores the new maximum value of the column so it can be used in the next replication.

Replication that is done by using an incremental check column can work by using two different data types:

DateTime incremental check columns&emdash;A Last Modified or a Date Updated column that represents when the record was last updated.
Integer-based incremental check columns&emdash;An auto-incrementing Id or a rowversion type that increments each time a record is added or updated.

Change Data Capture

Some sources support change data capture (CDC), where a source uses a log file to log events (Insert, Update, or Delete) that cause changes in the database. Rather than querying the source table for changes, Sync reads the log file for any change events. Then, the application extracts those changes for replication and stores the current log position for the next replication.

The sources listed below support CDC capability:

Informix (Native)&emdash;Uses enhanced CDC.
MariaDB&emdash;Uses binary logs.
Microsoft Dynamics 365&emdash;Uses change tracking.
MySQL&emdash;Uses binary logs.
Oracle&emdash;Uses Oracle LogMiner or Oracle Flashback. If both methods are enabled on a table, Sync uses Oracle LogMiner.
PostgreSQL&emdash;Uses logical replication.
SQL Server&emdash;Uses either CDC or change tracking. If both methods are enabled on a table, Sync uses CDC.

Functions for Time-Based Incremental Filtering

CData Sync supports two special functions for time-based incremental replication:

REPLICATE_LASTMODTIME()&emdash;Returns the last value that is saved in the Status table. If no value exists, Sync defaults to the replication start date (ReplicateStartDate).
REPLICATE_NEXTINTERVAL()&emdash;Returns the next value, based on the ReplicateInterval and ReplicateIntervalUnit parameters. If those parameters are not set, Sync generates an error.

These functions are useful when the source (for example, Workday RaaS) requires filtering on multiple date prompts, as shown in this example:

WHERE from_date_prompt = REPLICATE_LASTMODTIME()
  AND to_date_prompt = REPLICATE_NEXTINTERVAL()

You can use these functions across multiple columns. The REPLICATE_LASTMODTIME() function can be used on its own. However, the REPLICATE_NEXTINTERVAL() function requires both the interval options and REPLICATE_LASTMODTIME() in the same query.

After a replication run, Sync processes filter columns as follows:

If all filter columns use functions, Sync saves the REPLICATE_NEXTINTERVAL() value.
If all filter columns are normal columns, Sync saves the maximum value that is returned.
If the filter columns are mixed, Sync saves the REPLICATE_NEXTINTERVAL() value.

These functions integrate with the standard incremental-replication strategy and are supported across all sources.

Parallel Processing

You can configure CData Sync jobs to use parallel processing, which means that the application uses multiple worker threads to process one job. With parallel processing, Sync can divide its workload into multiple processes, allowing it to move more than one table simultaneously. As a result, more data is moved in less time, greatly increasing job efficiency. With Sync, you can assign as many workers as you want, on a per-job basis.

To enable parallel processing:

Click the job that you want to run (from the Jobs page in Sync). This action opens the Jobs/YourJobName page.
On the Overview tab, click the Edit Settings icon () in the Settings category. This selection opens the Edit Settings dialog box.
Select Enable under the Parallel Processing property.
Enter the number of workers that you want to assign to the job in the Worker Pool field. This value controls how many tasks can run in parallel at one time.
Click Save Changes (upper right of the Job Settings header bar).

After you save these changes, your job will use parallel processing when you run it.

Metadata Caching

In CData Sync, metadata caching is a mechanism that stores structural information about source and destination tables so the system can reuse it across jobs. Metadata caching improves performance by eliminating the need for Sync to constantly query the source or destination to retrieve table structures. By keeping metadata readily available in memory or local storage, caching improves performance by reducing repeated queries to external sources or destinations, especially those with slower storage or network response times. This functionality results in faster file access, quicker data retrieval, and better overall system responsiveness.

In the Sync application, you activate metadata caching on a per-connector basis. Anytime you configure a source or destination connector that supports metadata caching, you can configure that property at the bottom of the Settings tab on the New Connection page for a connector, as shown below:

You can select from the following options in the Refresh Interval list:

Never: Metadata never refreshes.
Start of Job: Metadata refreshes when you start a job.
Hourly (default): Metadata refreshes after an hour passes.
Daily: Metadata refreshes after a day passes.

If a connector does not support metadata caching, the Metadata Caching section displays a note to that effect.

Note: Because cached metadata does not automatically detect every schema change in real time, you should refresh the cache in a Sync job whenever the structure of a source or destination table is modified. For example, if columns are added, removed, or have their data types changed, click the Refresh button () to update the cache and ensure that Sync has the most up-to-date view of the schema. This practice prevents replication errors and guarantees that transformations, mappings, and validations are applied accurately.

Automatic Job Retry

CData Sync provides an Automatic Job Retry feature that attempts to rerun a job where one or more tasks fail, whether from task errors or network issues. The Automatic Job Retry process applies only to the failed tasks within a job. For example, if one task fails, the process retries only that one task.

This process works for all failed jobs (with the exception of cancelled jobs), but the retry attempt occurs only once for such jobs.

To enable the Automatic Job Retry feature:

Click your job name on the Jobs page (or select … > Edit) to open the YourJobName Settings page.
On the Overview tab, click the Edit Settings icon () in the Settings category. This selection opens the Edit Settings dialog box.
Select the Enable checkbox for the Automatic Job Retry setting. Then, click Save to save that setting update.

After the Automatic Job Retry process runs for a failed job, you can check the job details (… > Run Details > Tasks) for information about the retry attempt.

If the retry attempt is successful, the message RETRY SUCCESSFUL is displayed in the Last Run column for the task or tasks that failed.
If the retry attempt fails, the messages RETRY FAILED still appears in the Last Run column next to the task or tasks that failed originally. In addition, a View Details link is displayed for each failed task. Click the down arrow to the left of View Details to display an error message with details about the failure, as shown in the example below:

Sync Intervals (Data Integrity)

As part of any data-integration strategy, it is important to ensure that data is consistent between the original source and the destination. If an error occurs in your data pipeline, or if a job is interrupted, you need your data pipeline process to resume where it stops. This behavior ensures that no data is lost between updates or in the event of an error. CData Sync automatically manages that action for you without the need to configure complex data-loading scripts or processes.

Sync processes data in sync intervals. This means that, rather than attempting to move all of your data at once, Sync breaks the data into manageable intervals (or “chunks” of data) and processes one interval at a time. This feature greatly increases performance, and it enables Sync to maintain data integrity in the event of an error. Sync matches the source and destination tables. If an error occurs, Sync disposes of all data from the current interval, and it can restart processing from that point.

For example, suppose a large sync job is almost completed when an error occurs. Instead of starting the entire job from the beginning, Sync restarts the job from the last successful interval, which saves both time and resources.

Note: Some APIs have access limitations that restrict the number of times you can access them within a given time period. These limitations can cause errors. In the event of such an error, Sync discards the incomplete sync records and begins again from that point at the next scheduled job. You can set the interval size, dictating how much data is pulled in each interval, and limit the amount of data that needs to be moved if an error occurs.

Deletion Captures

CData Sync automatically captures deleted records, which ensures accuracy in your destination. Sync retrieves a list of deleted records from the source by calling the API or by using the Change Tracking feature.

If the source allows Sync to detect data that is deleted, you can control how Sync handles those deletions by using the Deletion Behavior option, as explained in Advanced Job Options:

Hard Delete (the default parameter): If a deletion is detected in the source, Sync removes that record from the destination table.
Soft Delete: Sync adds the _cdata_deleted column to your destination table. If a deletion is detected in the source, Sync sets the value to true in the destination.
Skip Delete: Sync ignores deleted records in the source.

Note: Some APIs do not allow Sync to detect deleted records, as indicated in the Source Information. In these cases, Deletion Behavior is ignored.

Data Types

CData Sync recognizes a large number of data types and, in situations where the data type is not strictly defined, Sync can infer the data type based on your data. Sync recognizes the following data types:

Boolean
Date
Time
TimeStamp
Decimal
Float
Double
SmallInt
Integer
Long
Binary
Varchar
GUID

Known Data Types

For many sources—mostly relational databases and some APIs (SQL Server, Oracle, and so on)—Sync automatically detects the data types for the schema. When the column data types for the source are known, Sync automatically creates matching data types in the destination.

Inferred Data Types

In situations where a data type is not specified, Sync can infer the data type by scanning the first few rows of data to determine what the column data type should be.

When Sync detects a string type with an unknown column size, the default size of the column is treated as 2000. In a relational database such as SQL Server, Sync creates a varchar(2000) field for this type.

For fields where the data types are not strictly defined, Sync reads the first row and automatically selects the smallest data type for each column. Then, the application reads the next row and ensures that the data still fits in those data types. If the data does not fit, Sync increases the size for the data type. Sync does this up to the row scan depth (RowScanDepth - 50 or 100 rows). When it finishes, Sync has the data types.

For example, in a source like CSV, Sync uses RowScan to read the first rows of the file and determine dynamically the data types for each column.

Transformations

In data pipelines, transformations are a method for transforming, cleansing, or aggregating data in a way that makes it easy for reporting or for data analysis. Sync supports two common ways to manage data transformations when you build data pipelines:

ETL: The extract, transform, load (ETL) process has been the traditional approach in analytics for several decades. ETL was designed originally to work with relational databases, which have dominated the market historically. ETL requires transformations to occur before the replication process. Data is extracted from data sources and then deposited into a staging area. Data is then cleaned, enriched, transformed, and loaded into the data warehouse. For more about ETL, see In-Flight ETL.
ELT: The extract, load, transform (ELT) process is a method of data extraction where modifications to the data, such as transformations, take place after the replication process. Because modern cloud data warehouses offer vastly greater storage and scaling capabilities, you can move data in its entirety and then modify it after the move.

ELT transformations are SQL scripts that are executed in the destination on your data. Transformations use the processing power of the data warehouse to quickly aggregate, join, and clean your data based on your analytics and reporting needs. By organizing your data with transformations and mapping, you can receive your data in the form that is most useful to you as it moves through your pipeline. Similar to jobs, transformations support multiple queries that are separated by a semicolon (;) execute on a schedule, and send email alerts after the transformations finish. For more about ELT, see Post-Job ELT.

The primary difference between ETL and ELT is the order in which the steps take place.

Masking for Personally Identifiable Information

Masking for personally identifiable information (PII) is a data-protection technique that enhances the privacy and security of sensitive information within databases or data tables. PII includes information such as names, addresses, social security numbers, and other data that can be used to identify individuals.

Masking restricts access to specific columns that contain PII by hiding the actual data in those columns. Within Sync, masking is a point-and-click transformation option. When you mask your data in Sync, each character in the data is replaced with an asterisk (*). Masking is also a one-way operation. That is, once you apply masking, you cannot revert the data to its previous state. In addition, masking is crucial for compliance with data-protection regulations such as EU General Data Protection Regulation (GDPR), the US Health Insurance Portability and Accountability Act (HIPAA), and others that mandate the secure handling of personal and sensitive information.

For more details about masking and other transformation options, see Applying SQL Transformations.

History Mode

CData Sync history mode provides a method for analyzing historical data in your data sources. History mode is a slowly changing dimension that stores and manages relatively static data (current and historical) in a data warehouse. That data can change slowly (but unpredictably) over time.

In CData, you can use history mode (via the History Mode property) to track the change history for data rows (records) and see how your data changes over time. History mode is available for all CDC and standard jobs, provided that the source connection supports incremental replication.

CData supports a combined approach to analyzing historical data. That is, the History Mode property offers both robust tracking for auditing as well as time-series analysis.

History mode works on a per-table basis. So, you can decide which tables you want to analyze and then activate the option for those tables only.

In standard mode, Sync merges and updates existing rows whereas in history mode, Sync appends updated rows to the database table.

History Mode Variations

CData Sync offers two variations of history mode. Both variations preserve historical data, but they differ in how they handle previous versions of a record.

History Mode: This mode appends new versions of changed records and updates the metadata of previous versions to mark them inactive. That is, Sync maintains a full history of data changes for each data record in the source database table and records those change versions to the corresponding table in your destination database table.
History Mode – Append Only: This mode appends new versions of changed records but does not update previous versions or the metadata. Older rows remain unchanged in the destination. History Mode - Append Only is enabled automatically for the Avro, CSV, and Parquet destinations. For all other destinations, you must enable it manually.

The following table summarizes the differences between the two modes:

Feature	History Mode	History Mode – Append Only
Appends new versions of changed records	✓	✓
Updates metadata for old versions	✓	✗
Previous versions remain unchanged	✗	✓

To enable History Mode functionality, Sync can include up to five columns in the destination table. Standard mode uses three columns, while change data capture (CDC) jobs include all five (History Mode). However, History Mode - Append Only mode uses just two columns.

These columns are defined in the following table:

Column Name	Column Type	Description	Mode That Uses Column
_cdatasync_active	Boolean	Specifies whether a record is active.	Standard Mode, History Mode
_cdatasync_start	Datetime	Specifies the datetime value of the incremental check column at the time the data record becomes active. This value indicates when the record was created or modified in the source table, based on a timestamp that increments with each data update.	Standard Mode, History Mode, History Mode - Append Only
_cdatasync_end	Datetime	Specifies the datetime value of the incremental check column at the time the data record becomes inactive. A null value in this column indicates that the record is active.	Standard Mode, History Mode
_cdatasync_operation	Varchar	Specifies the operation to use: Insert (I), Update (U), or Delete (D). Note: This column applies only when you use change data capture (CDC).	History Mode, History Mode - Append Only
_cdatasync_version	Varchar(100)	Specifies the version for each change in the format that is saved in the CSRS table. Note: This column applies only when you use CDC.	History Mode - Append Only

Restrictions and Limitations

With history mode, the following restrictions apply:

The source table must support an incremental check column.
The source table must contain a primary key. (For History Mode - Append Only, a primary key is not required.)
The incremental check column must be a timestamp (datetime) column.
The incremental check column cannot be a pseudocolumn because pseudocolumns do not have a value in the response and are used as criteria only.
The destination table cannot exist. (Use the Drop Table setting on the Advanced tab to re-create the table when history mode is active.)

Activate History Mode and History Mode - Append Only for a Job

To activate history mode for a job in Sync:

Click the Jobs tab to open the Jobs page.
Click the name of the job to open its overview page. Then, click the Advanced tab.
In the Replicate Options category, click the Edit Replicate Options icon ()
Scroll to bottom of the Edit Replication Options dialog box and locate the History Mode setting. Activate the setting by selecting the Enable checkbox. Once activated, Sync appends a timestamped entry to your destination for every change that occurs in the source.

To deactivate history mode for the task, clear the checkbox.
Click Save to apply your change and close the dialog box.

To active History Mode - Append-Only functionality:

Follow the steps above to enable History Mode.
In the Additional Options text box (located below History Mode), enter the HistoryModeAppendOnly=trueoption option, as shown below.
Click Save to apply your change and close the dialog box.

Activate History Mode and History Mode - Append Only for a Task

To activate history mode for a task:

Click the Task tab for your job.
Click the name of the task that you want to modify. Then, click the Advanced tab.
In the Replicate Options category, click the Edit icon ().

Note: Tasks inherit the History Mode status of the job, so this setting displays either Not Enabled (Inherited) or Enabled (Inherited), depending on how it is set for the job at the time. To override the inherited setting, select the Override job settings checkbox at the top of the dialog box. Then continue with the following steps.
Scroll to bottom of the Edit Replication Options dialog box and locate the History Mode setting.
Activate history mode by selecting the Enable checkbox under History Mode. To deactivate history mode for the task, clear the checkbox.
Click Save to apply your change and close the dialog box.

Note: History mode is deactivated in the task settings if the associated table does not support incremental check columns.

To active History Mode - Append Only functionality for a task.

Follow the steps above to enable History Mode.
Add the option HistoryModeAppendOnly=trueoption in the Additional Options text box that follows the History Mode option, as shown below.
Click Save to apply your change and close the dialog box.

Effects of Changing the Source Table

When you change the source table by inserting, updating, or deleting rows, the destination is affected in various ways, as described in the following table:

Source Change	Destination Effect
Inserted Row	A row is added to the destination table. _cdatasync_active is set to `True`, and _cdatasync_start is set to the incremental check-column value.
Updated Row	The current row in the destination table is updated. _cdatasync_active is set to `False`, and _cdatasync_end is set to the incremental check-column value. A new row is inserted into the destination table. _cdatasync_active is set to `True`, and _cdatasync_start is set to the incremental check-column value.
Deleted Row	The current row in the destination table is updated. _cdatasync_active is set to `False`.

API Access

CData Sync contains a built-in representational state transfer (REST) API that provides a flexible way to manage the application. You can use RESTful API calls to accomplish everything that you can accomplish in the administration console UI.

The REST API comprises the following:

The Job Management API enables you to create, update, and execute jobs.
The Connection Management API enables you to list, create, modify, delete, and test connections.
The User Management API enables you to modify users and to list all users.

For more information about API access, see also REST-API.

In-Network Installation

You can run CData Sync anywhere, which makes the application ideal for customers who have some systems that reside in the cloud and other systems that reside in their internal networks. You can install Sync to run inside your network, thereby avoiding exposure of ports over the internet or open firewalls, the need to create VPN connections, and so on.

The ability to run the Sync application anywhere also greatly reduces the amount of latency. Because you can run Sync close to the source or destination, performance of your ETL or ELT jobs is improved.

Schema Changes

Your data changes constantly, and Sync ensures that those changes are represented accurately at all times. In every run, CData Sync compares the source schema to the destination schema to detect differences. If Sync detects a difference in structure between the two schemas, the application modifies the destination schema, as described below, to ensure that the source data fits:

If the source table contains a column that does not exist in the destination table, Sync alters the destination table by adding the column.
If the data type in the source table increases in size, Sync alters the destination table by updating the size for the column. Sync updates the column by increasing the column size of a string column (for example, varchar(255) -> varchar(2000)) or the byte size of nonstring columns (for example, smallint -> integer).

Notes:

Sync never deletes columns from your destination table if the column is removed from your source table.
Sync never reduces the size of a destination column if the size of the data type is updated in the source (varchar(2000) -> varchar(255)).