Gordon
Assistant Engineer
Assistant Engineer
  • UID622
  • Fans3
  • Follows0
  • Posts52
Reads:2513Replies:0

[Others]Programming Guide for DataX Plugin

Created#
More Posted time:Sep 13, 2016 16:23 PM
Introduction
DataX is the tool/platform for offline data synchronization widely used in Alibaba Group to realize efficient synchronization functions among various heterogeneous data sources, including MySQL, Oracle, HDFS, Hive, OceanBase, HBase, OTS, ODPS, etc. DataX adopts framework + plug-in mode and is open-source, with the code hosted on Github.


Address of hosted code:
https://github.com/alibaba/DataX
For development of plug-ins, see:
https://github.com/alibaba/DataX/wiki/DataX%E6%8F%92%E4%BB%B6%E5%BC%80%E5%8F%91%E5%AE%9D%E5%85%B8


New modules
First, clone code from https://github.com/alibaba/DataX, and execute the following command:
mvn clean package -DskipTests assembly:assembly

mvn clean package -DskipTests assembly:assembly
The command will generate datax.tar.gz. We can deploy the generated file to the corresponding environment for data import and export.
After development and testing, we can use the command above to integrate our plug-ins into DataX.


Then, import the code to Eclispe (or other IDE), and create a new Maven module under the project.
Next, make some configurations and then it is ready for development.


Configuration
DataX uses Maven to manage projects, and plug-ins and frameworks are organized together through multiple modules. All modules are packaged with maven-assembly-plugin.
The configuration can be conducted from two aspects during development
1. The module configuration: directory and files should comply with the protocol
2. The configuration of DataX main modules: 2 files should be configured: ./pom.xml and ./package.xml
Plug-in configurations:
The results of source code shall meet the requirements of the following architecture upon the establishment of one plug-in module:


POM configurations:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <groupId>com.alibaba.datax</groupId>
        <artifactId>datax-all</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>qvxfilereader</artifactId>

    <properties>
        <datax-project-version>0.0.1-SNAPSHOT</datax-project-version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>com.alibaba.datax</groupId>
            <artifactId>datax-common</artifactId>
            <version>${datax-project-version}</version>
        </dependency>
        <dependency>
         <groupId>junit</groupId>
         <artifactId>junit</artifactId>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptors>
                        <descriptor>src/main/assembly/package.xml</descriptor>
                    </descriptors>
                    <finalName>datax</finalName>
                </configuration>
                <executions>
                    <execution>
                        <id>dwzip</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>


Notes:
1. The dependence on datax-common is added and it depends on the frame of DataX,
2. The maven-assembly-plugin is configured and src/main/assembly/package.xml is applied here. The package.xml defines the directory structure after the package, and the results of this directory is agreed on by DataX. Details are given below:
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/reader/qvxfilereader</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>qvxfilereader-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/reader/qvxfilereader</outputDirectory>
</fileSet>
</fileSets>

<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/reader/qvxfilereader/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>


Besides two .xml files above, two additional .json files are required.
The plugin.json is a description of the plug-in. In the framework, this file is used to load the plug-in, for example:
{
    "name": "qvxfilereader",
    "class": "com.alibaba.datax.plugin.reader.qvxfilereader.QvxFileReader",
    "description": "useScene: test. mechanism: use datax framework to transport data from qvx file. warn: The more you know about the data, the less problems you encounter.",
    "developer": "dtstack.com"
}


plugin_job_template.json: The plug-in configuration module. During the use of plug-ins, users make configurations according to the template. For example,
{
    "name": "qvxfilereader",
    "parameter": {
        "path": [],
        "fieldDelimiter": ""
    }
}


Configurations of DataX main modules
In pom.xml, add the module name of the plug-in. The module name is artifactId.
<modules>
        <module>qvxfilereader</module>    
 </modules>


Add the package content of this plug-in in package.xml, so that DataX can include the plug-in into the entire plug-in system.
<fileSet>

<directory>qvxfilereader/target/datax/</directory>

<includes>

<include>**/*.*</include>

</includes>

<outputDirectory>datax</outputDirectory>

</fileSet>


Development
Below is the pseudocode on how plug-ins write data into channel through RecordSender:
public void startRead(RecordSender recordSender) {
Record record=recordSender.createRecord();
record.addColumn(new LongColumn(1));
record.addColumn(new StringColumn("hello,world!"));
recordSender.sendToWriter(record);
recordSender.flush();
}


Testing
When development is completed, the following command can be executed to generate DataX:
mvn clean package -DskipTests assembly:assembly

and then testing can be conducted.
Because each execution of this command will re-compile and re-package all the plug-ins in the DataX, the speed will be relatively slow. So you can modify the pom.xml and package.xml in Data X to keep only the common and in-development plug-ins.
Guest