×
Community Blog DataX 101: What Is It and How Does It Work?

DataX 101: What Is It and How Does It Work?

This short article explains what DataX is and an overview of its features.

DataX is a widely used tool/platform for offline data synchronization in Alibaba Group. It implements efficient data synchronization between various heterogeneous data sources, including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, and ODPS.

Features

As a data synchronization framework, DataX abstracts the synchronization of different data sources as a Reader plug-in for reading data from the data source and a Writer plug-in for writing data to the target end. In theory, DataX can support the data synchronization of any data source. At the same time, the DataX plug-in system serves as an ecosystem. When a new data source is added, it can communicate with the existing data sources.

System Requirements

Quick Start

Install DataX:

After downloading the file, decompress it to a local directory and enter the bin directory to run the synchronization task:

$ cd  {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}

Self-test script: python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

(1) Download the DataX source code:

$ git clone git@github.com:alibaba/DataX.git

(2) Package the code through Maven:

$ cd  {DataX_source_code_home}
$ mvn -U clean package assembly:assembly -Dmaven.test.skip=true

After being packaged, the following log is presented:

[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:12 min
[INFO] Finished at: 2015-12-13T16:26:48+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

After packaging, the DataX packet is located in {DataX_source_code_home}/target/datax/datax/. The structure is listed below:

$ cd  {DataX_source_code_home}
$ ls ./target/datax/datax/
bin        conf        job        lib        log        log_perf    plugin        script        tmp

Configuration Example: Read the data from the stream and print to the console.

  • Step 1: Create a configuration file (in JSON format)

You can view the configuration template using the following command: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}.

$ cd  {YOUR_DATAX_HOME}/bin
$  python datax.py -r streamreader -w streamwriter
DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
    https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
 
Please save the following configuration as a json file and use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader", 
                    "parameter": {
                        "column": [], 
                        "sliceRecordCount": ""
                    }
                }, 
                "writer": {
                    "name": "streamwriter", 
                    "parameter": {
                        "encoding": "", 
                        "print": true
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

Configure the JSON file according to the template:

#stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}
  • Step 2: Start DataX
$ cd {YOUR_DATAX_DIR_BIN}
$ python datax.py ./stream2stream.json

The synchronization ends. The following log is presented:

...
2015-12-17 11:20:25.263 [job-0] INFO  JobContainer -
Task start time                   : 2015-12-17 11:20:15
Task end time                    : 2015-12-17 11:20:25
Total time consumption     :                 10s
Average task traffic            :                205B/s
Record write speed             :               5rec/s
Total read records               :                   50
Write and Read Failures     :                   0

Contact Us

Google Groups: DataX-user

0 0 0
Share on

ApsaraDB

231 posts | 18 followers

You may also like

Comments