This topic describes the data types and parameters supported by HDFS Writer and how to configure it by using the code editor.

HDFS Writer allows you to write text, Optimized Row Columnar (ORC), or Parquet files to the specified directory in Hadoop Distributed File System (HDFS). In addition, you can associate the fields in the files with those in Hive tables. You must configure a connection before configuring HDFS Writer. For more information, see Configure an HDFS connection.

Note Currently, HBase Writer for HBase 1.1.x is compatible with HBase 2.0. If you have any issues in using HBase Writer with HBase 2.0, submit a ticket.

Limits

  • Currently, HDFS Writer can write only text, ORC, and Parquet files that store logical two-dimensional tables to HDFS.
  • HDFS is a distributed file system and does not have a schema. Therefore, you cannot write only some columns in a file to HDFS.
  • Currently, Hive data types such as DECIMAL, BINARY, ARRAYS, MAPS, STRUCTS, and UNION are not supported.
  • HDFS Writer can write data to only one partition in a partitioned Hive table at a time.
  • To write a text file to HDFS, make sure that the delimiter in the file is the same as that in the Hive table to be associated with the file. Otherwise, you cannot associate the fields in the file stored in HDFS with those in the Hive table.
  • Currently, HDFS Writer can be used in the environment where Hive 1.1.1 and Hadoop 2.7.1 (JDK version: 1.7) are installed. HDFS Writer can write files to HDFS properly in testing environments where Hadoop 2.5.0, Hadoop 2.6.0, or Hive 1.2.0 is installed.

How it works

HDFS Writer writes files to HDFS in the following way:
  1. Creates a temporary directory that does not exist in HDFS based on the path parameter you specified.

    The name of the temporary directory is in the format of path_Random suffix.

  2. Writes files that are read by a Data Integration reader to the temporary directory.
  3. After all the files are written, moves the files in the temporary directory to the specified directory in HDFS. HDFS Writer guarantees that the file names do not conflict with existing files in HDFS when moving the files.
  4. Deletes the temporary directory. If the deletion is interrupted because HDFS Writer fails to connect to HDFS, you must manually delete the temporary directory and files that are written to the directory.
Note To synchronize data, use an administrator account with the read and write permissions.

Data types

HDFS Writer supports most Hive data types. Make sure that your data types are supported.

The following table lists the Hive data types supported by HDFS Writer.

Note The types of the specified columns must be the same as those of columns in the Hive table.
Category Hive data type
Integer TINYINT, SMALLINT, INT, and BIGINT
Floating point FLOAT and DOUBLE
String CHAR, VARCHAR, and STRING
Boolean BOOLEAN
Date and time DATE and TIMESTAMP

Parameters

Parameter Description Required Default value
defaultFS The address of the HDFS Namenode, such as hdfs://127.0.0.1:9000. The default resource group does not support configuring advanced Hadoop parameters related to the high availability feature. In this case, you can add a custom resource group. For more information, see Add a custom resource group. Yes None
fileType The format of the files to be written to HDFS. Valid values:
  • text: the text file format.
  • orc: the ORC file format.
  • parquet: the common Parquet file format.
Yes None
path The directory in HDFS to which the files are written. HDFS Writer writes multiple files to the directory concurrently based on the concurrency setting.

To associate the fields in a file with those in a Hive table, set the path parameter to the storage path of the Hive table in HDFS. Assume that the storage path specified for the data warehouse of Hive is /user/hive/warehouse/. The storage path of the hello table created in the test database is /user/hive/warehouse/test.db/hello.

Yes None
fileName The name prefix of the files to be written to HDFS. A random suffix is appended to the specified prefix to form the actual file name used by each thread. Yes None
column The columns to be written to HDFS. You cannot write only some of the columns in a file to HDFS.

To associate the fields in a file with those in a Hive table, specify the name and type parameters for each field.

You can also specify the column parameter in the following way:
"column": 
[
    {
        "name": "userName",
        "type": "string"
    },
    {
        "name": "age",
        "type": "long"
    }
]
Yes (Not required if the filetype parameter is set to parquet.) None
writeMode The mode in which HDFS Writer writes the files. Valid values:
  • append: writes the files based on the specified file name prefix and guarantees that the actual file names do not conflict with those of existing files.
  • nonConflict: returns an error if a file with the specified file name prefix exists in the destination directory.
  • truncate: deletes all existing files with the specified file name prefix in the destination directory before writing files to the directory. For example, if you set the filename parameter to abc, all files whose names start with abc are deleted.
Note Parquet files do not support the append mode. They support only the nonConflict mode.
Yes None
fieldDelimiter The column delimiter used in the files to be written to HDFS. Make sure that you use the same delimiter as that in the Hive table. Otherwise, you cannot query data in the Hive table. Yes (Not required if the filetype parameter is set to parquet.) None
compress The compression format of the files to be written to HDFS. By default, this parameter is left empty, that is, files are not compressed.

For a text file, the GZIP and BZIP2 compression formats are supported. For an ORC file, the SNAPPY compression format is supported. To compress an ORC file, you must install SnappyCodec.

No None
encoding The encoding format of the files to be written to HDFS. No None
parquetSchema The schema of the files to be written to HDFS. This parameter is required only when the fileType parameter is set to parquet. Format:
message messageTypeName {
required, dataType, columnName;
...................... ;
}

The parameters are described as follows:

  • messageTypeName: the name of the MessageType object.
  • required: specifies whether the field is required or optional. We recommend that you set the parameter to optional for all fields.
  • dataType: the type of the field. valid values: BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY, and FIXED_LEN_BYTE_ARRAY. Set this parameter to BINARY if the field stores strings.
Note Each line, including the last one, must end with a semicolon (;).
Example:
message m {
optional int64 id;
optional int64 date_id;
optional binary datetimestring;
optional int32 dspId;
optional int32 advertiserId;
optional int32 status;
optional int64 bidding_req_num;
optional int64 imp;
optional int64 click_num;
}
No None
hadoopConfig The advanced parameter settings of Hadoop, such as those related to high availability. The default resource group does not support configuring advanced Hadoop parameters related to the high availability feature. In this case, you can add a custom resource group. For more information, see Add a custom resource group.
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.youkuDfs.namenode1": "",
"dfs.namenode.rpc-address.youkuDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
No None
dataxParquetMode The synchronization mode for Parquet files. If the dataxParquetMode parameter is set to fields, you can write data of complex types, such as ARRAY, MAP, and STRUCT. Valid values: fields and columns.
If the dataxParquetMode parameter is set to fields, HDFS Writer supports HDFS over Object Storage Service (OSS). That is, HDFS uses OSS as the storage service and HDFS Writer writes Parquet files to OSS. In this case, you can add the following OSS-related parameters in the hadoopConfig parameter:
  • fs.oss.accessKeyId: the AccessKey ID for accessing OSS.
  • fs.oss.accessKeySecret: the AccessKey secret for accessing OSS.
  • fs.oss.endpoint: the endpoint for accessing OSS.
Example:
```json
    "writer": {
    "name": "hdfswriter",
    "parameter": {
        "defaultFS": "oss://test-bucket",
        "fileType": "parquet",
        "path": "/datasets/oss_demo/kpt",
        "fileName": "test",
        "writeMode": "truncate",
        "compress": "SNAPPY",
        "encoding": "UTF-8",
        "hadoopConfig": {
            "fs.oss.accessKeyId": "the-access-id",
            "fs.oss.accessKeySecret": "the-access-key",
            "fs.oss.endpoint": "oss-cn-hangzhou.aliyuncs.com"
            },
            "parquetSchema": "message test {\n    required int64 id;\n    optional binary name (UTF8);\n    optional int64 gmt_create;\n    required group map_col (MAP) {\n        repeated group key_value {\n            required binary key (UTF8);\n            required binary value (UTF8);\n        }\n    }\n    required group array_col (LIST) {\n        repeated group list {\n            required binary element (UTF8);\n        }\n    }\n    required group struct_col {\n        required int64 id;\n        required binary name (UTF8);\n    }    \n}",
            "dataxParquetMode": "fields"
            }
        }
    ```
No columns
haveKerberos Specifies whether Kerberos authentication is required. Default value: false. If you set this parameter to true, you must also set the kerberosKeytabFilePath and kerberosPrincipal parameters. No false
kerberosKeytabFilePath The absolute path of the keytab file for Kerberos authentication. Required if the haveKerberos parameter is set to true None
kerberosPrincipal The Kerberos principal to which Kerberos can assign tickets. Example: ****/hadoopclient@**. ***. Required if the haveKerberos parameter is set to true
Note The absolute path of the keytab file is required for Kerberos authentication. Therefore, you can configure Kerberos authentication only on a custom resource group. Example:
"haveKerberos":true,
"kerberosKeytabFilePath":"/opt/datax/**.keytab",
"kerberosPrincipal":"**/hadoopclient@**. **"
No None

Configure HDFS Writer by using the codeless UI

Currently, the codeless user interface (UI) is not supported for HDFS Writer.

Configure HDFS Writer by using the code editor

In the following code, a node is configured to write files to HDFS. For more information about the parameters, see the preceding parameter description.
{
    "type": "job",
    "version": "2.0",// The version number.
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hdfs",// The reader type.
            "parameter": {
                "path": "",// The directory in HDFS to which the files are written.
                "fileName": "",// The name prefix of the files to be written to HDFS.
                "compress": "",// The compression format of the files.
                "datasource": "",// The connection name.
                "column": [
                    {
                        "name": "col1",// The name of the column.
                        "type": "string"// The data type of the column.
                    },
                    {
                        "name": "col2",
                        "type": "int"
                    },
                    {
                        "name": "col3",
                        "type": "double"
                    },
                    {
                        "name": "col4",
                        "type": "boolean"
                    },
                    {
                        "name": "col5",
                        "type": "date"
                    }
                ],
                "writeMode": "",// The write mode.
                "fieldDelimiter": ",",// The column delimiter.
                "encoding": "",// The encoding format.
                "fileType": "text"// The file format.
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed.
        },
        "speed": {
            "concurrent": 3,// The maximum number of concurrent threads.
            "throttle": false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}