Supported data synchronization for ApsaraDB For OceanBase - DataWorks

The OceanBase data source supports reading data from and writing data to OceanBase. You can use an OceanBase data source to configure a data synchronization task. This topic describes the data synchronization capabilities that DataWorks provides for OceanBase.

Supported versions

OceanBase Reader and OceanBase Writer support the following OceanBase versions for reading and writing offline data.

OceanBase 2.x
OceanBase 3.x
OceanBase 4.x

Limits

Offline read

OceanBase supports the Oracle and MySQL tenant modes. When you configure the where filter condition and the function columns in the column parameter, the configuration must comply with the SQL syntax of the corresponding tenant mode. Otherwise, the SQL statement may fail to be executed.
You can read data from views.
Do not modify data that is being synchronized during an offline read from OceanBase. Otherwise, data quality issues, such as data duplication or data loss, may occur.
If the data source is configured with Read By Partition, the account used to access the data source must have system permissions.

Offline write

Note

The account used for the task requires at least the permissions to execute the insert into... statement. Other permissions may be required depending on the statements that you specify in the preSql and postSql parameters.

Data is written in batch mode. A write request is initiated after the number of rows reaches a specified threshold.
OceanBase supports the Oracle and MySQL tenant modes. The statements that you specify in the preSql and postSql parameters must comply with the SQL syntax of the corresponding tenant mode. Otherwise, the SQL statements may fail to be executed.

Real-time read

OceanBase is a distributed relational database that integrates data from multiple physically distributed databases into a single logical database. However, when you synchronize data from OceanBase to AnalyticDB for MySQL in real time, you can synchronize data from only a single physical database. You cannot synchronize data from a logical database.

Note

Data sources added using a connection string are not supported for real-time tasks that synchronize an entire database.
The database version must be 3.0 or later for a real-time sync task.

Preparations before data synchronization

Before you synchronize data in DataWorks, you must prepare the OceanBase environment. This ensures that the OceanBase data synchronization task can be configured and run as expected. The following sections describe the required preparations.

Configure a whitelist

Add the VPC CIDR block of the serverless resource group or exclusive resource group for Data Integration to the whitelist of your OceanBase instance. For more information, see Add a whitelist.

Create an account and grant permissions

Create a database logon account. This account must have the required operation permissions on OceanBase. For more information, see Create an account.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch sync task for a single table

For more information about the procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For information about all parameters and a script demo for the code editor, see Appendix: Script demo and parameter descriptions.

Configure a real-time sync task for an entire database

For more information about the procedure, see Configure a real-time sync task in DataStudio.

Configure a full and incremental (real-time) read sync task for a single table or an entire database

For more information about the procedure, see Configure a real-time sync task for an entire database.

Appendix: Script demo and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo

{
    "type": "job",
    "steps": [
        {
            "stepType": "apsaradb_for_OceanBase", // Plugin name
            "parameter": {
                "datasource": "", // Data source name
                "where": "",
                "column": [ // Fields
                    "id",
                    "name"
                ],
                "splitPk": ""
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {
                "print": false,
                "fieldDelimiter": ","
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "version": "2.0",
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    },
    "setting": {
        "errorLimit": {
            "record": "0" // Number of error records
        },
        "speed": {
            "throttle": true, // Specifies whether to enable throttling. If set to true, throttling is enabled. If set to false, throttling is disabled and the mbps parameter does not take effect.
            "concurrent": 1, // Number of concurrent jobs
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    }
}

Reader script parameters

Parameter	Description	Required	Default value
datasource	The name of the ApsaraDB For OceanBase data source that you added. You can use this parameter if your DataWorks version supports adding ApsaraDB For OceanBase data sources. You can configure the connection using either the jdbcUrl or username parameter.	Yes	None
jdbcUrl	The JDBC connection information of the peer database. Use a JSON array to specify multiple connection addresses for a database. If you configure multiple addresses, ApsaraDB For OceanBase Reader probes the connectivity of the IP addresses in sequence until a valid IP address is found. If all connections fail, ApsaraDB For OceanBase Reader reports an error. Note The jdbcUrl parameter must be included in the connection configuration unit. The jdbcUrl format must comply with the official ApsaraDB For OceanBase specifications. You can also specify connection attachment control information. Example: `jdbc:oceanbase://127.0.0.1:3306/database`. You must specify either this parameter or the username parameter.	No	None
username	The username for the data source.	No	None
password	The password for the specified username.	No	None
table	The tables to synchronize. Use a JSON array to specify multiple tables. If you configure multiple tables, make sure that the tables have the same schema structure. ApsaraDB For OceanBase Reader does not check whether the tables have a unified logical structure. Note The table parameter must be included in the connection configuration unit.	Yes	None
column	The set of columns to synchronize from the configured tables. Use a JSON array to describe the fields. By default, all columns are used. Example: [ * ]. You can export a subset of columns. You can change the column order. Constants are supported. Example: `'123'`. Function columns are supported. Example: `date('now')`. You must explicitly specify the columns to synchronize. The column parameter cannot be empty.	Yes	None
splitPk	When ApsaraDB for OceanBase Reader extracts data, if you specify the splitPk parameter, the field specified in splitPk is used for data sharding. Data Integration then runs parallel threads to read the data. This way, data can be synchronized more efficiently. Set splitPk to the primary key of the table. The primary key is usually evenly distributed, which prevents data hotspots in the shards. Currently, splitPk supports data sharding only for integer types. It does not support other types such as string, float, and date. If you specify an unsupported type, ApsaraDB For OceanBase Reader reports an error. If you leave splitPk empty, the system considers that you do not allow sharding for a single table and uses a single channel for data extraction.	No	Empty
where	ApsaraDB for OceanBase Reader concatenates an SQL statement based on the specified column, table, and where conditions, and extracts data based on the SQL statement. For example, for testing purposes, you can set the where condition to limit 10. In actual business scenarios, you typically synchronize data for the current day and set the where condition to `gmt_create>$bizdate`. The where condition can be used to efficiently perform incremental synchronization. If the where condition is not configured or is left empty, data from the entire table is synchronized.	No	None
querySql	In some business scenarios, the where parameter is not sufficient to describe the filter conditions. You can use this parameter to customize a filter SQL statement. If you configure this parameter, the data synchronization system ignores the tables, columns, and splitPk parameters and uses the content of this parameter to filter data. When you configure querySql, ApsaraDB For OceanBase Reader ignores the table, column, where, and splitPk parameters.	No	None
fetchSize	This parameter specifies the number of data records to retrieve in each batch. This value determines the number of network interactions between Data Integration and the server and can significantly improve data extraction performance. Note A fetchSize value greater than 2048 may cause an out-of-memory (OOM) error for the data synchronization process.	No	1,024

Writer script demo

{
    "type":"job",
    "version":"2.0",// Version number.
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"apsaradb_for_OceanBase",// Plugin name.
            "parameter":{
                "datasource": "Data source name",
                "column": [// Fields.
                    "id",
                    "name"
                ],
                "table": "apsaradb_for_OceanBase_table",// Table name.
                "preSql": [ // The SQL statement to execute before the data synchronization task runs.
                    "delete from @table where db_id = -1"
                ],
                "postSql": [// The SQL statement to execute after the data synchronization task runs.
                    "update @table set db_modify_time = now() where db_id = 1"
                ],
                "obWriteMode": "insert",
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// Number of error records.
        },
        "speed":{
            "throttle":true,// Specifies whether to enable throttling. If set to true, throttling is enabled. If set to false, throttling is disabled and the mbps parameter does not take effect.
            "concurrent":1, // Number of concurrent jobs.
            "mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

Writer script parameters

Parameter	Description	Required	Default value
datasource	The name of the ApsaraDB For OceanBase data source that you added. You can use this parameter if your DataWorks version supports adding ApsaraDB For OceanBase data sources. You can configure the connection using either the jdbcUrl or username parameter.	No	None
jdbcUrl	The JDBC connection information of the peer database. The jdbcUrl parameter is included in the connection configuration unit. You can configure only one value for a database. The case where a database has multiple primary databases (bidirectional data import) is not supported. The jdbcUrl format must comply with the official ApsaraDB For OceanBase specifications. You can also specify connection attachment information. Example: `jdbc:oceanbase://127.0.0.1:3306/database`.	Yes	None
username	The username for the data source.	Yes	None
password	The password for the specified username.	Yes	None
table	The name of the table to which you want to write data. Use a JSON array to specify the table name. Note The table parameter must be included in the connection configuration unit.	Yes	None
column	The fields in the destination table to which you want to write data. Separate the fields with commas (,). Example: `"column": ["id", "name", "age"]`. Note You must specify the column parameter. It cannot be empty.	Yes	None
obWriteMode	Controls the mode used to write data to the destination table. This parameter is optional. insert: `insert into ...` If a primary key or unique index conflict occurs, the conflicting rows cannot be written. update: `... on duplicate key update ...` This mode is used for MySQL tenants. If a conflict occurs, the conflicting rows are updated. merge: `merge into ... matched then update ...` This mode is used for Oracle tenants. If a conflict occurs, the conflicting rows are updated.	No	insert
onClauseColumns	Note This parameter is used in Oracle tenant mode. It is required when `obWriteMode` is set to `merge`. If this parameter is not configured, the `insert` mode is used to write data. Set this parameter to the primary key field or a unique constraint field. To specify multiple fields, separate them with commas (,). Example: `ID,C1`.	No	None
obUpdateColumns	Note This parameter takes effect when `obWriteMode` is set to `merge` or `update`. The fields to update when a data write conflict occurs. To specify multiple fields, separate them with commas (,). Example: `c2,c3`.	No	All fields
preSql	The standard statement to execute before data is written to the destination table. If the SQL statement needs to operate on a table, use `@table` to represent the table name. When the SQL statement is executed, the `@table` variable is replaced with the actual table name.	No	None
postSql	The standard statement to execute after data is written to the destination table.	No	None
batchSize	The number of records in a batch commit. This value can significantly reduce the number of network interactions between the data synchronization system and the server, and improve the overall throughput. Note A large fetchSize value (greater than 2048) may cause an out-of-memory (OOM) error during data synchronization.	No	1,024