All Products
Search
Document Center

ApsaraDB for SelectDB:Import data by using DataX

Last Updated:Jun 21, 2024

This topic describes how to use DataX Doris Writer to synchronize data to an ApsaraDB for SelectDB instance.

Overview

DataX is an open source tool that is provided by Alibaba Group for offline synchronization between disparate data sources, including relational databases such as MySQL and Oracle, Hadoop Distributed File System (HDFS), Hive, MaxCompute, HBase, and FTP. DataX provides stable and efficient data synchronization capabilities. You can use DataX to read data from upstream data sources and then use DataX Doris Writer to write data to ApsaraDB for SelectDB.

Prerequisites

  • A Maven environment is installed.

  • Python 3.6 or later is installed.

Example

The following example shows how to import data from a MySQL database to an ApsaraDB for SelectDB instance by using DataX in Linux.

Step 1: Configure a DataX environment

  1. Download the DataX package. For more information, visit doris at GitHub.

  2. Run the init-env.sh script in the DataX package to configure the DataX development environment.

    sh init-env.sh
  3. Compile the mysqlreader and doriswriter plug-ins.

    1. Compile the DataX project.

      cd DataX/
      mvn package assembly:assembly -Dmaven.test.skip=true

      The compilation results are stored in the target/datax/datax/. directory.

      Note

      If you want to use the hdfsreader, hdfswriter, ossreader, and osswriter plug-ins, additional JAR packages are required. If you do not need these plug-ins, you can remove the modules of these plug-ins from the DataX/pom.xml file.

    2. Separately compile the mysqlreader and selectdbwriter plug-ins.

      mvn clean install -pl plugin-rdbms-util,mysqlreader,doriswriter -DskipTests

Step 2: Generate the data to be imported

  1. Create a test MySQL table.

    CREATE TABLE `employees` (
      `emp_no` int NOT NULL,
      `birth_date` date NOT NULL,
      `first_name` varchar(14) NOT NULL,
      `last_name` varchar(16) NOT NULL,
      `gender` enum('M','F') NOT NULL,
      `hire_date` date NOT NULL,
      PRIMARY KEY (`emp_no`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
  2. Use Data Management (DMS) to generate test data. For more information, see Generate test data.

Step 3: Configure an ApsaraDB for SelectDB instance

  1. Connect to an ApsaraDB for SelectDB instance over the MySQL protocol. For more information, see Connect to an instance.

  2. Create a test database and a test table.

    1. Execute the following statement to create a test database:

      CREATE DATABASE test_db;
    2. Execute the following statements to create a test table:

      USE test_db;
      CREATE TABLE employees (
          emp_no       int NOT NULL,
          birth_date   date,
          first_name   varchar(20),
          last_name    varchar(20),
          gender       char(2),
          hire_date    date
      )
      UNIQUE KEY(`emp_no`)
      DISTRIBUTED BY HASH(`emp_no`) BUCKETS 1;

Step 4: Synchronize data from the MySQL database to the ApsaraDB for SelectDB instance by using DataX

  1. Apply for a public endpoint for the ApsaraDB for SelectDB instance. For more information, see Apply for or release a public endpoint.

  2. Add the public IP address of the DataX host to the IP address whitelist of the instance. For more information, see Configure an IP address whitelist.

  3. Create the mysqlToSelectDB.json configuration file and configure the information about an import job in the file. Sample code:

    {
      "job":{
        "content":[
          {
            "reader":{
                "name": "mysqlreader",
                "parameter": {
                    "column": [
                        "emp_no",
                        "birth_date",
                        "first_name",
                        "last_name",
                        "gender",
                        "hire_date"
                    ],
                    "where": "emp_no>0",
                    "connection": [
                        {
                            "jdbcUrl": [
                                "jdbc:mysql://host:port/test_db?useUnicode=true&allowPublicKeyRetrieval=true&characterEncoding=utf-8"
                            ],
                            "table": [
                                "employees"
                            ]
                        }
                    ],
                    "password": "123456",
                    "splitPk": "emp_no",
                    "username": "admin"
                }
            },
            "writer":{
              "name":"doriswriter",
              "parameter":{
                "loadUrl":[
                  "selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:8080"
                ],
                "loadProps":{
                  "format":"json",
                  "strip_outer_array":"true"
                },
                "column":[
                    "emp_no",
                    "birth_date",
                    "first_name",
                    "last_name",
                    "gender",
                    "hire_date"
                ],
                "username":"admin",
                "password":"123456",
                "postSql":[
    
                ],
                "preSql":[
    
                ],
                "connection":[
                  {
                    "jdbcUrl":"jdbc:mysql://selectdb-cn-xxx-public.selectdbfe.rds.aliyuncs.com:9030/test_db",
                    "table":[
                      "employees"
                    ],
                    "selectedDatabase":"test_db"
                  }
                ],
                "maxBatchRows":1000000,
                "batchSize":536870912000
              }
            }
          }
        ],
        "setting":{
          "errorLimit":{
            "percentage":0.02,
            "record":0
          },
          "speed":{
            "channel":5
          }
        }
      }
    }

  4. Parameters

    Parameter

    Required

    Default value

    Description

    jdbcUrl

    Yes

    N/A

    The Java Database Connectivity (JDBC) URL that is used to connect to the ApsaraDB for SelectDB instance, which is in the jdbc:mysql://<ip>:<port> format.

    To obtain the endpoint of the ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page. On the Basic Information page, view the endpoint and MySQL port number in the Network Information section.

    ip: the virtual private cloud (VPC) endpoint or public endpoint of the instance.

    port: the MySQL port number of the instance.

    Example: jdbc:mysql://selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:9030.

    Note

    If the DataX host and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint of the instance. If the DataX host and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint of the instance.

    loadUrl

    Yes

    N/A

    The URL that is used to connect to the ApsaraDB for SelectDB instance over HTTP, which is in the <ip>:<port> format.

    To obtain the endpoint of the ApsaraDB for SelectDB instance, perform the following operations: Log on to the ApsaraDB for SelectDB console and go to the Instance Details page. On the Basic Information page, view the endpoint and HTTP port number in the Network Information section.

    ip: the VPC endpoint or public endpoint of the instance.

    port: the HTTP port number of the instance.

    Example: selectdb-cn-4xl3jv1****.selectdbfe.rds.aliyuncs.com:8080.

    Note

    If the DataX host and the ApsaraDB for SelectDB instance reside in the same VPC, use the VPC endpoint of the instance. If the DataX host and the ApsaraDB for SelectDB instance reside in different VPCs, use the public endpoint of the instance.

    username

    Yes

    N/A

    The username that is used to connect to the ApsaraDB for SelectDB instance.

    password

    Yes

    N/A

    The password of the username that is used to connect to the ApsaraDB for SelectDB instance.

    connection.selectedDatabase

    Yes

    N/A

    The name of the database to which you want to write data in the ApsaraDB for SelectDB instance.

    connection.table

    Yes

    N/A

    The name of the table to which you want to write data in the ApsaraDB for SelectDB instance.

    column

    Yes

    N/A

    The fields to which you want to write data in the destination table. These fields are used in the generated JSON-formatted data. Separate multiple fields with commas (,). Example: "column": ["id","name","age"].

    preSql

    No

    N/A

    The standard statements to be executed before data is written to the destination table.

    postSql

    No

    N/A

    The standard statements to be executed after data is written to the destination table.

    maxBatchRows

    No

    500000

    The maximum number of rows that can be written to the destination table in a batch. This parameter is used together with the batchSize parameter to specify the amount of data that can be written in a batch. When a batch of data reaches one of the two thresholds, the data of this batch starts to be imported.

    batchSize

    No

    104857600

    The maximum amount of data that can be written to the destination table in a batch. This parameter is used together with the maxBatchRows parameter to specify the amount of data that can be written in a batch. When a batch of data reaches one of the two thresholds, the data of this batch starts to be imported. Default value: 100 MB.

    maxRetries

    No

    3

    The maximum number of retries that are allowed after a batch of data fails to be imported to the destination table.

    labelPrefix

    No

    datax_doris_writer_

    The label prefix of each batch of imported data. A unique label is composed of the label prefix and a UUID to ensure that data is not repeatedly imported.

    loadProps

    No

    N/A

    The request parameters that are the same as those of a Stream Load job. For more information, see the Parameters section of the "Import data by using Stream Load" topic. The format parameter specifies the format of the data to be imported. By default, the data to be imported is in the CSV format. You can import data in the JSON format. For more information, see the Convert data types section of this topic.

    flushInterval

    No

    30000

    The interval at which data is written in batches. Default value: 30000. Unit: millisecond.

  5. Run the following commands to submit the job:

    cd target/datax/datax/bin
    python datax.py ../mysqlToSelectDB.json

Convert data types

By default, the imported data is converted into strings. \t is used as the column delimiter and \n is used as the row delimiter to convert the data into a CSV file to be imported to the ApsaraDB for SelectDB instance. By default, the data is imported in the CSV format. If you want to change the column or row delimiter, modify the parameters of the loadProps parameter. Sample code:

"loadProps": {
    "format": "csv",
    "column_separator": "\\x01",
    "line_delimiter": "\\x02"
}

If you want to import data in the JSON format, modify the format parameter of the loadProps parameter. Sample code:

"loadProps": {
    "format": "json",
    "strip_outer_array": true
}