This topic describes how to use DataX Doris Writer to synchronize data to an ApsaraDB for SelectDB instance.
Overview
DataX is an open source tool that is provided by Alibaba Group for offline synchronization between disparate data sources, including relational databases such as MySQL and Oracle, Hadoop Distributed File System (HDFS), Hive, MaxCompute, HBase, and FTP. DataX provides stable and efficient data synchronization capabilities. You can use DataX to read data from upstream data sources and then use DataX Doris Writer to write data to ApsaraDB for SelectDB.
Prerequisites
A Maven environment is installed.
Python 3.6 or later is installed.
Example
The following example shows how to import data from a MySQL database to an ApsaraDB for SelectDB instance by using DataX in Linux.
Step 1: Configure a DataX environment
Download the DataX package. For more information, visit doris at GitHub.
Run the init-env.sh script in the DataX package to configure the DataX development environment.
sh init-env.sh
Compile the mysqlreader and doriswriter plug-ins.
Compile the DataX project.
cd DataX/ mvn package assembly:assembly -Dmaven.test.skip=true
The compilation results are stored in the target/datax/datax/. directory.
NoteIf you want to use the hdfsreader, hdfswriter, ossreader, and osswriter plug-ins, additional JAR packages are required. If you do not need these plug-ins, you can remove the modules of these plug-ins from the DataX/pom.xml file.
Separately compile the mysqlreader and selectdbwriter plug-ins.
mvn clean install -pl plugin-rdbms-util,mysqlreader,doriswriter -DskipTests
Step 2: Generate the data to be imported
Create a test MySQL table.
CREATE TABLE `employees` ( `emp_no` int NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, PRIMARY KEY (`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
Use Data Management (DMS) to generate test data. For more information, see Generate test data.
Step 3: Configure an ApsaraDB for SelectDB instance
Connect to an ApsaraDB for SelectDB instance over the MySQL protocol. For more information, see Connect to an instance.
Create a test database and a test table.
Execute the following statement to create a test database:
CREATE DATABASE test_db;
Execute the following statements to create a test table:
USE test_db; CREATE TABLE employees ( emp_no int NOT NULL, birth_date date, first_name varchar(20), last_name varchar(20), gender char(2), hire_date date ) UNIQUE KEY(`emp_no`) DISTRIBUTED BY HASH(`emp_no`) BUCKETS 1;
Step 4: Synchronize data from the MySQL database to the ApsaraDB for SelectDB instance by using DataX
Apply for a public endpoint for the ApsaraDB for SelectDB instance. For more information, see Apply for or release a public endpoint.
Add the public IP address of the DataX host to the IP address whitelist of the instance. For more information, see Configure an IP address whitelist.
Create the
mysqlToSelectDB.json
configuration file and configure the information about an import job in the file. Sample code:{ "job":{ "content":[ { "reader":{ "name": "mysqlreader", "parameter": { "column": [ "emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date" ], "where": "emp_no>0", "connection": [ { "jdbcUrl": [ "jdbc:mysql://host:port/test_db?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8" ], "table": [ "employees" ] } ], "password": "123456", "splitPk": "emp_no", "username": "admin" } }, "writer":{ "name":"doriswriter", "parameter":{ "loadUrl":[ "selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:8080" ], "loadProps":{ "format":"json", "strip_outer_array":"true" }, "column":[ "emp_no", "birth_date", "first_name", "last_name", "gender", "hire_date" ], "username":"admin", "password":"123456", "postSql":[ ], "preSql":[ ], "connection":[ { "jdbcUrl":"jdbc:mysql://selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:9030/test_db", "table":[ "employees" ], "selectedDatabase":"test_db" } ], "maxBatchRows":1000000, "batchSize":536870912000 } } } ], "setting":{ "errorLimit":{ "percentage":0.02, "record":0 }, "speed":{ "channel":5 } } } }
Parameters
Parameter
Required
Default value
Description
jdbcUrl
Yes
N/A
The Java Database Connectivity (JDBC) URL that is used to connect to the ApsaraDB for SelectDB instance, which is in the
jdbc:mysql://<ip>:<port>
format.To obtain the endpoint of the ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page. On the Basic Information page, view the endpoint and MySQL port number in the Network Information section.
ip: the virtual private cloud (VPC) endpoint or public endpoint of the instance.
port: the MySQL port number of the instance.
Example:
jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030
.NoteIf the DataX host and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint of the instance. If the DataX host and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint of the instance.
loadUrl
Yes
N/A
The URL that is used to connect to the ApsaraDB for SelectDB instance over HTTP, which is in the
<ip>:<port>
format.To obtain the endpoint of the ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page. On the Basic Information page, view the endpoint and HTTP port number in the Network Information section.
ip: the VPC endpoint or public endpoint of the instance.
port: the HTTP port number of the instance.
Example:
selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080
.NoteIf the DataX host and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint of the instance. If the DataX host and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint of the instance.
username
Yes
N/A
The username that is used to connect to the ApsaraDB for SelectDB instance.
password
Yes
N/A
The password of the username that is used to connect to the ApsaraDB for SelectDB instance.
connection.selectedDatabase
Yes
N/A
The name of the database to which you want to write data in the ApsaraDB for SelectDB instance.
connection.table
Yes
N/A
The name of the table to which you want to write data in the ApsaraDB for SelectDB instance.
column
Yes
N/A
The fields to which you want to write data in the destination table. These fields are used in the generated JSON-formatted data. Separate multiple fields with commas (,). Example:
"column": ["id","name","age"]
.preSql
No
N/A
The standard statements to be executed before data is written to the destination table.
postSql
No
N/A
The standard statements to be executed after data is written to the destination table.
maxBatchRows
No
500000
The maximum number of rows that can be written to the destination table in a batch. This parameter is used together with the batchSize parameter to specify the amount of data that can be written in a batch. When a batch of data reaches one of the two thresholds, the data of this batch starts to be imported.
batchSize
No
104857600
The maximum amount of data that can be written to the destination table in a batch. This parameter is used together with the maxBatchRows parameter to specify the amount of data that can be written in a batch. When a batch of data reaches one of the two thresholds, the data of this batch starts to be imported. Default value: 100 MB.
maxRetries
No
3
The maximum number of retries that are allowed after a batch of data fails to be imported to the destination table.
labelPrefix
No
datax_doris_writer_
The label prefix of each batch of imported data. A unique label is composed of the label prefix and a UUID to ensure that data is not repeatedly imported.
loadProps
No
N/A
The request parameters that are the same as those of a Stream Load job. For more information, see the Parameters section of the "Import data by using Stream Load" topic. The format parameter specifies the format of the data to be imported. By default, the data to be imported is in the CSV format. You can import data in the JSON format. For more information, see the Convert data types section of this topic.
flushInterval
No
30000
The interval at which data is written in batches. Default value: 30000. Unit: millisecond.
Run the following commands to submit the job:
cd target/datax/datax/bin python datax.py ../mysqlToSelectDB.json
Convert data types
By default, the imported data is converted into strings. \t
is used as the column delimiter and \n
is used as the row delimiter to convert the data into a CSV
file to be imported to the ApsaraDB for SelectDB instance. By default, the data is imported in the CSV format. If you want to change the column or row delimiter, modify the parameters of the loadProps
parameter. Sample code:
"loadProps": {
"format": "csv",
"column_separator": "\\x01",
"line_delimiter": "\\x02"
}
If you want to import data in the JSON
format, modify the format
parameter of the loadProps
parameter. Sample code:
"loadProps": {
"format": "json",
"strip_outer_array": true
}