DataX is an open source offline data synchronization tool provided by Alibaba Group. It supports reading from relational databases such as MySQL and Oracle, Hadoop Distributed File System (HDFS), Hive, MaxCompute, HBase, FTP, and more. Use DataX Doris Writer to write the data to ApsaraDB for SelectDB.
Data flow:
Source (e.g., MySQL) -> mysqlreader -> DataX channel -> doriswriter -> ApsaraDB for SelectDB
Prerequisites
Before you begin, make sure that you have:
-
Maven installed
-
Python 3.6 or later installed
-
An ApsaraDB for SelectDB instance
Import data from MySQL to ApsaraDB for SelectDB
The following example walks through a complete import on Linux, using a MySQL database as the source.
Step 1: Set up the DataX environment
-
Download the DataX package from the Doris extension on GitHub.
-
Run the initialization script:
sh init-env.sh -
Compile the DataX project:
If you need the
hdfsreader,hdfswriter,ossreader, orosswriterplug-ins, additional JAR packages are required. To skip these plug-ins, remove their modules fromDataX/pom.xmlbefore compiling.cd DataX/ mvn package assembly:assembly -Dmaven.test.skip=trueThe compiled output is stored in the
target/datax/datax/directory. -
Compile the
mysqlreaderanddoriswriterplug-ins separately:mvn clean install -pl plugin-rdbms-util,mysqlreader,doriswriter -DskipTests
Step 2: Prepare test data in MySQL
-
Create a test table in MySQL:
CREATE TABLE `employees` ( `emp_no` int NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, PRIMARY KEY (`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3; -
Generate test data using Data Management (DMS).
Step 3: Create the destination table in ApsaraDB for SelectDB
-
Connect to your ApsaraDB for SelectDB instance over the MySQL protocol. For details, see Connect to an instance.
-
Create a database and a destination table:
CREATE DATABASE test_db; USE test_db; CREATE TABLE employees ( emp_no int NOT NULL, birth_date date, first_name varchar(20), last_name varchar(20), gender char(2), hire_date date ) UNIQUE KEY(`emp_no`) DISTRIBUTED BY HASH(`emp_no`) BUCKETS 1;
Step 4: Run the DataX import job
-
Apply for a public endpoint for your ApsaraDB for SelectDB instance. For details, see Apply for or release a public endpoint.
-
Add the public IP address of the DataX host to the IP address whitelist of the instance. For details, see Configure an IP address whitelist.
-
Create the job configuration file
mysqlToSelectDB.json:{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "column": [ "emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date" ], "where": "emp_no>0", "connection": [ { "jdbcUrl": [ "jdbc:mysql://host:port/test_db?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8" ], "table": ["employees"] } ], "password": "123456", "splitPk": "emp_no", "username": "admin" } }, "writer": { "name": "doriswriter", "parameter": { "loadUrl": [ "selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:8080" ], "loadProps": { "format": "json", "strip_outer_array": "true" }, "column": [ "emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date" ], "username": "admin", "password": "123456", "postSql": [], "preSql": [], "connection": [ { "jdbcUrl": "jdbc:mysql://selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:9030/test_db", "table": ["employees"], "selectedDatabase": "test_db" } ], "maxBatchRows": 1000000, "batchSize": 536870912000 } } } ], "setting": { "errorLimit": { "percentage": 0.02, "record": 0 }, "speed": { "channel": 5 } } } } -
Submit the job:
cd target/datax/datax/bin python datax.py ../mysqlToSelectDB.json
Parameters
The following table describes the doriswriter parameters. The jdbcUrl and loadUrl parameters accept either the virtual private cloud (VPC) endpoint or the public endpoint of the ApsaraDB for SelectDB instance. Use the VPC endpoint if the DataX host and the instance are in the same VPC. Otherwise, use the public endpoint.
To find the endpoints, log on to the ApsaraDB for SelectDB console, go to the Instance Details page, and check the Network Information section on the Basic Information tab.
| Parameter | Required | Default value | Description |
|---|---|---|---|
jdbcUrl |
Yes | — | Java Database Connectivity (JDBC) URL for the instance. Format: jdbc:mysql://<endpoint>:<MySQL-port>/<database>. Example: jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030/test_db |
loadUrl |
Yes | — | HTTP endpoint for Stream Load. Format: <endpoint>:<HTTP-port>. Example: selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080 |
username |
Yes | — | Username for the instance. |
password |
Yes | — | Password for the username. |
connection.selectedDatabase |
Yes | — | Target database name. |
connection.table |
Yes | — | Target table name. |
column |
Yes | — | Columns to write data to. Separate multiple columns with commas. Example: ["id", "name", "age"]. |
preSql |
No | — | SQL statements to run before writing data. |
postSql |
No | — | SQL statements to run after writing data. |
maxBatchRows |
No | 500000 | Maximum rows per batch. A batch is flushed when either maxBatchRows or batchSize is reached first. |
batchSize |
No | 104857600 (100 MB) | Maximum data size per batch. A batch is flushed when either batchSize or maxBatchRows is reached first. |
maxRetries |
No | 3 | Maximum retries after a batch import fails. |
labelPrefix |
No | datax_doris_writer_ |
Prefix for the import label. Each batch uses a unique label composed of this prefix and a UUID to prevent duplicate imports. |
loadProps |
No | — | Additional Stream Load parameters. For the full list of supported parameters, see the Parameters section in "Import data by using Stream Load". |
flushInterval |
No | 30000 ms | Interval between batch writes. |
Configure data formats
By default, DataX Doris Writer converts data to CSV format, using \t as the column delimiter and \n as the line delimiter.
Use custom delimiters
To change the delimiters, set them in loadProps. Use non-printable characters such as \x01 and \x02 to avoid conflicts with data that contains tabs or newlines:
"loadProps": {
"format": "csv",
"column_separator": "\\x01",
"line_delimiter": "\\x02"
}
Use JSON format
To import data in JSON format instead:
"loadProps": {
"format": "json",
"strip_outer_array": true
}