Import data using DataX

DataX is an open source offline data synchronization tool provided by Alibaba Group. It supports reading from relational databases such as MySQL and Oracle, Hadoop Distributed File System (HDFS), Hive, MaxCompute, HBase, FTP, and more. Use DataX Doris Writer to write the data to ApsaraDB for SelectDB.

Data flow:

Source (e.g., MySQL) -> mysqlreader -> DataX channel -> doriswriter -> ApsaraDB for SelectDB

Prerequisites

Before you begin, make sure that you have:

Maven installed
Python 3.6 or later installed
An ApsaraDB for SelectDB instance

Import data from MySQL to ApsaraDB for SelectDB

The following example walks through a complete import on Linux, using a MySQL database as the source.

Step 1: Set up the DataX environment

Download the DataX package from the Doris extension on GitHub.
Run the initialization script:
```
sh init-env.sh
```
Compile the DataX project:

If you need the hdfsreader, hdfswriter, ossreader, or osswriter plug-ins, additional JAR packages are required. To skip these plug-ins, remove their modules from DataX/pom.xml before compiling.
```
cd DataX/
mvn package assembly:assembly -Dmaven.test.skip=true
```
The compiled output is stored in the target/datax/datax/ directory.

Compile the mysqlreader and doriswriter plug-ins separately:

mvn clean install -pl plugin-rdbms-util,mysqlreader,doriswriter -DskipTests

Step 2: Prepare test data in MySQL

Create a test table in MySQL:

CREATE TABLE `employees` (
  `emp_no`     int          NOT NULL,
  `birth_date` date         NOT NULL,
  `first_name` varchar(14)  NOT NULL,
  `last_name`  varchar(16)  NOT NULL,
  `gender`     enum('M','F') NOT NULL,
  `hire_date`  date         NOT NULL,
  PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;

Generate test data using Data Management (DMS).

Step 3: Create the destination table in ApsaraDB for SelectDB

Connect to your ApsaraDB for SelectDB instance over the MySQL protocol. For details, see Connect to an instance.

Create a database and a destination table:

CREATE DATABASE test_db;

USE test_db;
CREATE TABLE employees (
    emp_no     int NOT NULL,
    birth_date date,
    first_name varchar(20),
    last_name  varchar(20),
    gender     char(2),
    hire_date  date
)
UNIQUE KEY(`emp_no`)
DISTRIBUTED BY HASH(`emp_no`) BUCKETS 1;

Step 4: Run the DataX import job

Apply for a public endpoint for your ApsaraDB for SelectDB instance. For details, see Apply for or release a public endpoint.
Add the public IP address of the DataX host to the IP address whitelist of the instance. For details, see Configure an IP address whitelist.

Create the job configuration file mysqlToSelectDB.json:

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "column": [
              "emp_no", "birth_date", "first_name",
              "last_name", "gender", "hire_date"
            ],
            "where": "emp_no>0",
            "connection": [
              {
                "jdbcUrl": [
                  "jdbc:mysql://host:port/test_db?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                ],
                "table": ["employees"]
              }
            ],
            "password": "123456",
            "splitPk": "emp_no",
            "username": "admin"
          }
        },
        "writer": {
          "name": "doriswriter",
          "parameter": {
            "loadUrl": [
              "selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:8080"
            ],
            "loadProps": {
              "format": "json",
              "strip_outer_array": "true"
            },
            "column": [
              "emp_no", "birth_date", "first_name",
              "last_name", "gender", "hire_date"
            ],
            "username": "admin",
            "password": "123456",
            "postSql": [],
            "preSql": [],
            "connection": [
              {
                "jdbcUrl": "jdbc:mysql://selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:9030/test_db",
                "table": ["employees"],
                "selectedDatabase": "test_db"
              }
            ],
            "maxBatchRows": 1000000,
            "batchSize": 536870912000
          }
        }
      }
    ],
    "setting": {
      "errorLimit": {
        "percentage": 0.02,
        "record": 0
      },
      "speed": {
        "channel": 5
      }
    }
  }
}

Submit the job:

cd target/datax/datax/bin
python datax.py ../mysqlToSelectDB.json

Parameters

The following table describes the doriswriter parameters. The jdbcUrl and loadUrl parameters accept either the virtual private cloud (VPC) endpoint or the public endpoint of the ApsaraDB for SelectDB instance. Use the VPC endpoint if the DataX host and the instance are in the same VPC. Otherwise, use the public endpoint.

To find the endpoints, log on to the ApsaraDB for SelectDB console, go to the Instance Details page, and check the Network Information section on the Basic Information tab.

Parameter	Required	Default value	Description
`jdbcUrl`	Yes	—	Java Database Connectivity (JDBC) URL for the instance. Format: `jdbc:mysql://<endpoint>:<MySQL-port>/<database>`. Example: `jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030/test_db`
`loadUrl`	Yes	—	HTTP endpoint for Stream Load. Format: `<endpoint>:<HTTP-port>`. Example: `selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080`
`username`	Yes	—	Username for the instance.
`password`	Yes	—	Password for the username.
`connection.selectedDatabase`	Yes	—	Target database name.
`connection.table`	Yes	—	Target table name.
`column`	Yes	—	Columns to write data to. Separate multiple columns with commas. Example: `["id", "name", "age"]`.
`preSql`	No	—	SQL statements to run before writing data.
`postSql`	No	—	SQL statements to run after writing data.
`maxBatchRows`	No	500000	Maximum rows per batch. A batch is flushed when either `maxBatchRows` or `batchSize` is reached first.
`batchSize`	No	104857600 (100 MB)	Maximum data size per batch. A batch is flushed when either `batchSize` or `maxBatchRows` is reached first.
`maxRetries`	No	3	Maximum retries after a batch import fails.
`labelPrefix`	No	`datax_doris_writer_`	Prefix for the import label. Each batch uses a unique label composed of this prefix and a UUID to prevent duplicate imports.
`loadProps`	No	—	Additional Stream Load parameters. For the full list of supported parameters, see the Parameters section in "Import data by using Stream Load".
`flushInterval`	No	30000 ms	Interval between batch writes.

Configure data formats

By default, DataX Doris Writer converts data to CSV format, using \t as the column delimiter and \n as the line delimiter.

Use custom delimiters

To change the delimiters, set them in loadProps. Use non-printable characters such as \x01 and \x02 to avoid conflicts with data that contains tabs or newlines:

"loadProps": {
    "format": "csv",
    "column_separator": "\\x01",
    "line_delimiter": "\\x02"
}

Use JSON format

To import data in JSON format instead:

"loadProps": {
    "format": "json",
    "strip_outer_array": true
}