Sync TiDB Data Offline via MySQL Protocol - DataWorks

The TiDB data source enables you to read data offline. This topic describes how DataWorks supports data synchronization for TiDB.

Supported TiDB versions

Offline read: TiDB database versions 7.x and 8.x are supported.
Offline write: Not supported.
Real-time read and write: Not supported.

Note

TiDB is highly compatible with the MySQL protocol and supports the common features and syntax of MySQL 5.7 and MySQL 8.0. DataWorks uses the MySQL protocol to read data from TiDB for offline data synchronization. For more information about the compatibility between TiDB and MySQL, see Compatibility with MySQL.

Supported field types

For more information about all supported TiDB field types, see Data Type Overview. The following table lists the support for major field types.

Field type	Offline read (TiDB Reader)
TINYINT	Supported
SMALLINT	Supported
MEDIUMINT	Support
INTEGER	Support
BIGINT	Supported
FLOAT	Supported
DOUBLE	Support
DECIMAL	Supported
CHAR	Supported
VARCHAR	Support
JSON	Support
TEXT	Support
TINYTEXT	Support
MEDIUMTEXT	Support
LONGTEXT	Support
VARBINARY	Support
BINARY	Support
BLOB	Support
TINYBLOB	Supported
MEDIUMBLOB	Support
LONGBLOB	Support
ENUM	Support
SET	Support
BOOLEAN	Supported
BIT	Support
DATE	Support
DATETIME	Support
TIMESTAMP	Support
TIME	Support
YEAR	Support

Prerequisites: Prepare the TiDB environment

Before you synchronize data in DataWorks, you must prepare the TiDB environment. This ensures that you can correctly configure and run TiDB data synchronization tasks. The following sections describe the preparation steps.

Confirm the TiDB version

Data Integration requires TiDB version 7.x or 8.x. Ensure that your TiDB database meets this version requirement. To check the current version, you can run the following statement in your TiDB database.

SELECT TIDB_VERSION()\G

*************************** 1. row ***************************
TIDB_VERSION(): Release Version: v8.1.1
Edition: Community
Git Commit Hash: 821e491a20fbab36604b36b647b5bae26a2c1418
Git Branch: HEAD
UTC Build Time: 2024-08-27 19:16:25
GoVersion: go1.21.10
Race Enabled: false
Check Table Before Drop: false
Store: tikv
1 row in set (0.00 sec)

Configure account permissions

Create a dedicated TiDB account for DataWorks to access the data source.

(Optional) Create an account. For more information, see User Account Management.
Configure permissions.
For offline data reads from TiDB, this account must have the read (SELECT) permission on the tables that you want to synchronize.
You can run the following commands to grant permissions to the account or grant the SUPER permission directly. In the following statements, replace 'sync_account' with the account that you created.
```
-- CREATE USER 'sync_account'@'%' IDENTIFIED BY 'password'; // Create a sync account and set a password. This allows the account to log on to the database from any host. The percent sign (%) indicates any host.
GRANT SELECT ON *.* TO 'sync_account'@'%'; // Grant the SELECT permission on the database to the sync account.
```
*.* grants the account the specified permissions on all tables in all databases. You can also grant the account the permissions on a specific table in a target database. For example, to grant the account the permissions on the user table in the test database, you can run the GRANT SELECT ON test.user TO 'sync_account'@'%'; statement.
Note
The REPLICATION SLAVE statement grants a global permission. You cannot use this statement to grant permissions on a specific table in a target database to the sync account.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data source management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.

Appendix: TiDB script demo and parameter descriptions

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Reader script demo

The following code provides a sample script for reading data:

Note

The comments in the following JSON sample are for demonstration purposes only. You must remove the comments when you configure the script.

{
  "type": "job",
  "version": "2.0",
  "steps":
  [
    {
      "stepType": "tidb",
      "parameter":
      {
        "column":
        [
          "id",
          "name"
        ],
        "where": "",
        "splitPk": "id",
        "connection":
        [
          {
            "selectedDatabase": "test_database",
            "datasource": "test_datasource",
            "table":
            [
              "test_table"
            ]
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "odps",
      "parameter":
      {
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting":
  {
    "errorLimit":
    {
      "record": "0"
    },
    "speed":
    {
      "throttle": false,
      "concurrent": 3
    }
  },
  "order":
  {
    "hops":
    [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Reader script parameters

Script parameter name	Description	Required	Default value
datasource	The name of the data source. The code editor supports adding data sources. The value of this parameter must be the same as the name of the added data source.	Yes	None
table	The name of the table from which to synchronize data. The following examples show advanced uses of the table parameter to configure a range: You can read data from sharded tables by configuring a range. For example, `'table_[0-99]'` specifies that data is read from `'table_0'`, `'table_1'`, `'table_2'`, up to `'table_99'`. If the numeric suffixes of your table names have the same length, such as `'table_000'`, `'table_001'`, `'table_002'`, up to `'table_999'`, you can set the parameter to `'"table": ["table_00[0-9]", "table_0[10-99]", "table_[100-999]"]'`.	Yes	None
column	The columns to synchronize from the specified table. Use a JSON array to describe the fields. To synchronize all columns, set this parameter to `["*"]`. Column pruning: You can select specific columns to export. Column reordering: You can export columns in an order different from the table schema. Constant configuration: You must follow the MySQL SQL syntax. Example: `["id","table","1","'test_constant'","'null'","to_char(a+1)","2.3","true"]` . `id` is a regular column name. `table` is a column name that is also a reserved word. `1` is an integer constant. `'test_constant'` is a string constant. Note that it must be enclosed in single quotation marks. About null: `""` indicates an empty string. `null` indicates a null value. `'null'` indicates the string "null". The `to_char(a+1)` function calculates string length. `2.3` is a floating-point number. `true` is a Boolean value. The `column` parameter must explicitly specify the set of columns to synchronize. It cannot be empty.	Yes	None
splitPk	When TiDB Reader extracts data, if you specify splitPk, the system uses the specified field to partition the data. This enables concurrent tasks for data synchronization and improves efficiency. Use the primary key of the table for splitPk. Primary keys are usually distributed evenly. This helps prevent data hot spots in the resulting shards. Currently, splitPk supports only integer-based data partitioning. It does not support other data types such as strings, floating-point numbers, or dates. If you specify an unsupported data type, the splitPk feature is ignored and data is synchronized through a single channel. If you do not specify splitPk, or if its value is empty, data is synchronized from the table through a single channel.	No	None
where	The filter condition. In many business scenarios, you might synchronize data for the current day. To do this, set the where condition to `gmt_create>$bizdate`. The where condition allows for efficient incremental data synchronization. If you do not specify a where clause, or if the key or value is not provided, all data is synchronized. The limit syntax is not supported in the where condition.	No	None