DataX 101: What Is It and How Does It Work?

DataX is a widely used tool/platform for offline data synchronization in Alibaba Group. It implements efficient data synchronization between various heterogeneous data sources, including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, and ODPS.

Features

As a data synchronization framework, DataX abstracts the synchronization of different data sources as a Reader plug-in for reading data from the data source and a Writer plug-in for writing data to the target end. In theory, DataX can support the data synchronization of any data source. At the same time, the DataX plug-in system serves as an ecosystem. When a new data source is added, it can communicate with the existing data sources.

System Requirements

Linux
JDK (1.8 or later versions. 1.8 is recommended)
Python (2 or 3)
Apache Maven 3.x (Compile DataX)

Quick Start

Install DataX:

Method 1: Download the DataX toolkit: DataX direct download

After downloading the file, decompress it to a local directory and enter the bin directory to run the synchronization task:

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

Self-test script: python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

Method 2: Download the DataX source code and compile it: DataX source code

(1) Download the DataX source code:

$ git clone git@github.com:alibaba/DataX.git

(2) Package the code through Maven:

$ cd  {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true

After being packaged, the following log is presented:

[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

After packaging, the DataX packet is located in {DataX_source_code_home}/target/datax/datax/. The structure is listed below:

$ cd  {DataX_source_code_home}
$ ls ./target/datax/datax/
bin        conf        job        lib        log        log_perf    plugin        script        tmp

Configuration Example: Read the data from the stream and print to the console.

Step 1: Create a configuration file (in JSON format)

You can view the configuration template using the following command: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}.

$ cd  {YOUR_DATAX_HOME}/bin
$  python datax.py -r streamreader -w streamwriter
DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
    https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
 
Please save the following configuration as a json file and use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader", 
                    "parameter": {
                        "column": [], 
                        "sliceRecordCount": ""
                    }
                }, 
                "writer": {
                    "name": "streamwriter", 
                    "parameter": {
                        "encoding": "", 
                        "print": true
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

Configure the JSON file according to the template:

#stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello，你好，世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

Step 2: Start DataX

$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json

The synchronization ends. The following log is presented:

...
2015-12-17 11:20:25.263 [job-0] INFO  JobContainer -
Task start time                   : 2015-12-17 11:20:15
Task end time                    : 2015-12-17 11:20:25
Total time consumption     :                 10s
Average task traffic            :                205B/s
Record write speed             :               5rec/s
Total read records               :                   50
Write and Read Failures     :                   0

Contact Us

Google Groups: DataX-user

Community

DataX 101: What Is It and How Does It Work?

Features

System Requirements

Quick Start

Contact Us

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for MySQL

Time Series Database (TSDB)

Database Security Solutions

AnalyticDB for MySQL