All Products
Search
Document Center

ApsaraDB for SelectDB:Import data using DataX

Last Updated:Mar 30, 2026

DataX is an open source offline data synchronization tool provided by Alibaba Group. It supports reading from relational databases such as MySQL and Oracle, Hadoop Distributed File System (HDFS), Hive, MaxCompute, HBase, FTP, and more. Use DataX Doris Writer to write the data to ApsaraDB for SelectDB.

Data flow:

Source (e.g., MySQL) -> mysqlreader -> DataX channel -> doriswriter -> ApsaraDB for SelectDB

Prerequisites

Before you begin, make sure that you have:

  • Maven installed

  • Python 3.6 or later installed

  • An ApsaraDB for SelectDB instance

Import data from MySQL to ApsaraDB for SelectDB

The following example walks through a complete import on Linux, using a MySQL database as the source.

Step 1: Set up the DataX environment

  1. Download the DataX package from the Doris extension on GitHub.

  2. Run the initialization script:

    sh init-env.sh
  3. Compile the DataX project:

    If you need the hdfsreader, hdfswriter, ossreader, or osswriter plug-ins, additional JAR packages are required. To skip these plug-ins, remove their modules from DataX/pom.xml before compiling.
    cd DataX/
    mvn package assembly:assembly -Dmaven.test.skip=true

    The compiled output is stored in the target/datax/datax/ directory.

  4. Compile the mysqlreader and doriswriter plug-ins separately:

    mvn clean install -pl plugin-rdbms-util,mysqlreader,doriswriter -DskipTests

Step 2: Prepare test data in MySQL

  1. Create a test table in MySQL:

    CREATE TABLE `employees` (
      `emp_no`     int          NOT NULL,
      `birth_date` date         NOT NULL,
      `first_name` varchar(14)  NOT NULL,
      `last_name`  varchar(16)  NOT NULL,
      `gender`     enum('M','F') NOT NULL,
      `hire_date`  date         NOT NULL,
      PRIMARY KEY (`emp_no`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
  2. Generate test data using Data Management (DMS).

Step 3: Create the destination table in ApsaraDB for SelectDB

  1. Connect to your ApsaraDB for SelectDB instance over the MySQL protocol. For details, see Connect to an instance.

  2. Create a database and a destination table:

    CREATE DATABASE test_db;
    
    USE test_db;
    CREATE TABLE employees (
        emp_no     int NOT NULL,
        birth_date date,
        first_name varchar(20),
        last_name  varchar(20),
        gender     char(2),
        hire_date  date
    )
    UNIQUE KEY(`emp_no`)
    DISTRIBUTED BY HASH(`emp_no`) BUCKETS 1;

Step 4: Run the DataX import job

  1. Apply for a public endpoint for your ApsaraDB for SelectDB instance. For details, see Apply for or release a public endpoint.

  2. Add the public IP address of the DataX host to the IP address whitelist of the instance. For details, see Configure an IP address whitelist.

  3. Create the job configuration file mysqlToSelectDB.json:

    {
      "job": {
        "content": [
          {
            "reader": {
              "name": "mysqlreader",
              "parameter": {
                "column": [
                  "emp_no", "birth_date", "first_name",
                  "last_name", "gender", "hire_date"
                ],
                "where": "emp_no>0",
                "connection": [
                  {
                    "jdbcUrl": [
                      "jdbc:mysql://host:port/test_db?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                    ],
                    "table": ["employees"]
                  }
                ],
                "password": "123456",
                "splitPk": "emp_no",
                "username": "admin"
              }
            },
            "writer": {
              "name": "doriswriter",
              "parameter": {
                "loadUrl": [
                  "selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:8080"
                ],
                "loadProps": {
                  "format": "json",
                  "strip_outer_array": "true"
                },
                "column": [
                  "emp_no", "birth_date", "first_name",
                  "last_name", "gender", "hire_date"
                ],
                "username": "admin",
                "password": "123456",
                "postSql": [],
                "preSql": [],
                "connection": [
                  {
                    "jdbcUrl": "jdbc:mysql://selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:9030/test_db",
                    "table": ["employees"],
                    "selectedDatabase": "test_db"
                  }
                ],
                "maxBatchRows": 1000000,
                "batchSize": 536870912000
              }
            }
          }
        ],
        "setting": {
          "errorLimit": {
            "percentage": 0.02,
            "record": 0
          },
          "speed": {
            "channel": 5
          }
        }
      }
    }
  4. Submit the job:

    cd target/datax/datax/bin
    python datax.py ../mysqlToSelectDB.json

Parameters

The following table describes the doriswriter parameters. The jdbcUrl and loadUrl parameters accept either the virtual private cloud (VPC) endpoint or the public endpoint of the ApsaraDB for SelectDB instance. Use the VPC endpoint if the DataX host and the instance are in the same VPC. Otherwise, use the public endpoint.

To find the endpoints, log on to the ApsaraDB for SelectDB console, go to the Instance Details page, and check the Network Information section on the Basic Information tab.

Parameter Required Default value Description
jdbcUrl Yes Java Database Connectivity (JDBC) URL for the instance. Format: jdbc:mysql://<endpoint>:<MySQL-port>/<database>. Example: jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030/test_db
loadUrl Yes HTTP endpoint for Stream Load. Format: <endpoint>:<HTTP-port>. Example: selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080
username Yes Username for the instance.
password Yes Password for the username.
connection.selectedDatabase Yes Target database name.
connection.table Yes Target table name.
column Yes Columns to write data to. Separate multiple columns with commas. Example: ["id", "name", "age"].
preSql No SQL statements to run before writing data.
postSql No SQL statements to run after writing data.
maxBatchRows No 500000 Maximum rows per batch. A batch is flushed when either maxBatchRows or batchSize is reached first.
batchSize No 104857600 (100 MB) Maximum data size per batch. A batch is flushed when either batchSize or maxBatchRows is reached first.
maxRetries No 3 Maximum retries after a batch import fails.
labelPrefix No datax_doris_writer_ Prefix for the import label. Each batch uses a unique label composed of this prefix and a UUID to prevent duplicate imports.
loadProps No Additional Stream Load parameters. For the full list of supported parameters, see the Parameters section in "Import data by using Stream Load".
flushInterval No 30000 ms Interval between batch writes.

Configure data formats

By default, DataX Doris Writer converts data to CSV format, using \t as the column delimiter and \n as the line delimiter.

Use custom delimiters

To change the delimiters, set them in loadProps. Use non-printable characters such as \x01 and \x02 to avoid conflicts with data that contains tabs or newlines:

"loadProps": {
    "format": "csv",
    "column_separator": "\\x01",
    "line_delimiter": "\\x02"
}

Use JSON format

To import data in JSON format instead:

"loadProps": {
    "format": "json",
    "strip_outer_array": true
}

What's next