This topic describes the data types and parameters that are supported by HDFS Writer and how to configure HDFS Writer by using the codeless user interface (UI) and code editor.

HDFS Writer can write text, Optimized Row Columnar (ORC), or Parquet files to a specified directory in Hadoop Distributed File System (HDFS). You can associate the columns in the files with the columns in Hive tables. Before you configure HDFS Writer, you must configure a Hive data source. For more information, see Configure an HDFS connection.
Note HDFS Writer supports only exclusive resource groups for Data Integration, but not the shared resource group or custom resource groups for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration, Use a shared resource group, and Create a custom resource group for Data Integration.

Limits

  • HDFS Writer can write only text, ORC, and Parquet files that store logical two-dimensional tables to HDFS.
  • HDFS is a distributed file system and does not have a schema. Therefore, you cannot write only data in some columns of a file to HDFS.
  • Hive data types, such as DECIMAL, BINARY, ARRAYS, MAPS, STRUCTS, and UNION, are not supported.
  • HDFS Writer can write data to only one partition in a partitioned Hive table at a time.
  • To write a text file to HDFS, make sure that the delimiter in the file is the same as that in the Hive table that you want to associate with the file. This way, you can associate the columns in the file that is written to HDFS with those in the Hive table.
  • You can use HDFS Writer in the environment in which Hive 1.1.1 and Hadoop 2.7.1 (JDK version: 1.7) are installed. JDK is short for Java Development Kit. HDFS Writer can write files to HDFS in test environments in which Hive 1.2.0 and Hadoop 2.5.0 or Hadoop 2.6.0 are installed.

How it works

HDFS Writer writes files to HDFS in the following way:
  1. Creates a temporary directory that does not exist in HDFS based on the path parameter you specified.

    The temporary directory is specified in the format of path_Random suffix.

  2. Writes files that are obtained from a reader to the temporary directory.
  3. Moves the files from the temporary directory to the specified directory after all the files are written. The names of the files that you want to write to HDFS must be different from those of existing files in HDFS.
  4. Deletes the temporary directory. If HDFS Writer fails to connect to HDFS due to a network interruption, you must manually delete the temporary directory and all the files in the temporary directory.
Note To synchronize data, you must use an administrator account that has read and write permissions on the specific files.

Data types

HDFS Writer supports most Hive data types. Make sure that the data types of your system are supported.

The following table lists the Hive data types that are supported by HDFS Writer.

Note The data types of the specified columns in the file must be the same as those of the columns in the Hive table.
Category Hive data type
Integer TINYINT, SMALLINT, INT, and BIGINT
Floating point FLOAT and DOUBLE
String CHAR, VARCHAR, and STRING
Boolean BOOLEAN
Date and time DATE and TIMESTAMP

Parameters

Parameter Description Required Default value
defaultFS The address of the NameNode node in HDFS, such as hdfs://127.0.0.1:9000. Yes No default value
fileType The format of the files that you want to write to HDFS. Valid values: text, orc, and parquet.
  • text: a text file that maps a storage table in Hive
  • orc: an ORC file that maps a compressed table in Hive
  • parquet: a common Parquet file
Yes No default value
path The directory in HDFS to which you want to write files. HDFS Writer writes multiple files to the directory based on the configuration of parallel threads.

To associate the columns in a file with those in a Hive table, set the path parameter to the storage path of the Hive table in HDFS. For example, the storage path that is specified for the Hive data warehouse is /user/hive/warehouse/. In this case, the storage path of the hello table that is created in the test database is /user/hive/warehouse/test.db/hello.

Yes No default value
fileName The name prefix of the files that you want to write to HDFS. A random suffix is appended to the specified prefix to form the actual file name that is used by each thread. Yes No default value
column The names of the columns to which you want to write data. You cannot write data only to some columns in the Hive table.

To associate the columns in a file with those in a Hive table, configure the name and type parameters for each column. The name parameter specifies the name of the column, and the type parameter specifies the data type of the column.

You can specify the column parameter in the following format:
"column": 
[
    {
        "name": "userName",
        "type": "string"
    },
    {
        "name": "age",
        "type": "long"
    }
]
Required if the fileType parameter is set to text or orc No default value
writeMode The write mode. Valid values:
  • append: HDFS Writer writes the files based on the specified file name prefix and ensures that the actual file names do not conflict with those of existing files.
  • nonConflict: HDFS Writer returns an error if a file with the specified file name prefix exists in the destination directory.
  • truncate: HDFS Writer deletes all existing files whose names start with the specified file name prefix from the destination directory before files are written to the directory. For example, if you set fileName to abc, all existing files whose names start with abc are deleted from the destination directory.
Note Parquet files do not support the append mode. To write Parquet files, you must set the writeMode parameter to nonConflict.
Yes No default value
fieldDelimiter The column delimiter that is used in the files you want to write to HDFS. Make sure that you use the same delimiter as that in the Hive table. Otherwise, you cannot query data in the Hive table. Required if the fileType parameter is set to text or orc No default value
compress The compression format of the files that you want to write to HDFS. By default, this parameter is left empty, which indicates that the files are not compressed.

For a text file, the GZIP and BZIP2 compression formats are supported. For an ORC file, the Snappy compression format is supported. To compress an ORC file, you must install SnappyCodec. To install SnappyCodec, submit a ticket.

No No default value
encoding The encoding format of the files that you want to write to HDFS. No UTF-8
parquetSchema The schema of the Parquet files that you want to write to HDFS. This parameter is available only if the fileType parameter is set to parquet. Format:
message messageTypeName {
required, dataType, columnName;
......................;
}
Parameters:
  • messageTypeName: the name of the MessageType object.
  • required: indicates that the column cannot be left empty. optional: indicates that the column can be left empty. We recommend that you set this parameter to optional for all columns.
  • dataType: Parquet files support various data types, such as BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BINARY, and FIXED_LEN_BYTE_ARRAY. Set this parameter to BINARY if the column stores strings.
Note Each line, including the last line, must end with a semicolon (;).
Example:
message m {
optional int64 id;
optional int64 date_id;
optional binary datetimestring;
optional int32 dspId;
optional int32 advertiserId;
optional int32 status;
optional int64 bidding_req_num;
optional int64 imp;
optional int64 click_num;
}
No No default value
hadoopConfig The settings of the advanced Hadoop parameters that are related to high availability. If you use the shared resource group for Data Integration, you cannot configure the advanced Hadoop parameters that are related to high availability. If you want to configure these parameters, you must use a custom resource group for Data Integration. For more information, see Create a custom resource group for Data Integration.
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.youkuDfs.namenode1": "",
"dfs.namenode.rpc-address.youkuDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
No No default value
dataxParquetMode The synchronization mode for Parquet files. Valid values: fields and columns. If you set this parameter to fields, HDFS Writer can write data of complex data types, such as ARRAY, MAP, and STRUCT.
If you set this parameter to fields, HDFS Writer supports HDFS over Object Storage Service (OSS). In this case, HDFS uses OSS as the storage service, and HDFS Writer writes Parquet files to OSS. You can add the following OSS-related parameters in the hadoopConfig parameter:
  • fs.oss.accessKeyId: the AccessKey ID of the account that you can use to connect to OSS
  • fs.oss.accessKeySecret: the AccessKey secret of the account that you can use to connect to OSS
  • fs.oss.endpoint: the endpoint of OSS
The following sample code provide an example on how to connect to OSS:
```json
    "writer": {
    "name": "hdfswriter",
    "parameter": {
        "defaultFS": "oss://test-bucket",
        "fileType": "parquet",
        "path": "/datasets/oss_demo/kpt",
        "fileName": "test",
        "writeMode": "truncate",
        "compress": "SNAPPY",
        "encoding": "UTF-8",
        "hadoopConfig": {
            "fs.oss.accessKeyId": "the-access-id",
            "fs.oss.accessKeySecret": "the-access-key",
            "fs.oss.endpoint": "oss-cn-hangzhou.aliyuncs.com"
            },
            "parquetSchema": "message test {\n    required int64 id;\n    optional binary name (UTF8);\n    optional int64 gmt_create;\n    required group map_col (MAP) {\n        repeated group key_value {\n            required binary key (UTF8);\n            required binary value (UTF8);\n        }\n    }\n    required group array_col (LIST) {\n        repeated group list {\n            required binary element (UTF8);\n        }\n    }\n    required group struct_col {\n        required int64 id;\n        required binary name (UTF8);\n    }    \n}",
            "dataxParquetMode": "fields"
            }
        }
    ```
No columns
haveKerberos Specifies whether Kerberos authentication is required. If you set this parameter to true, the kerberosKeytabFilePath and kerberosPrincipal parameters are required. No false
kerberosKeytabFilePath The absolute path of the keytab file for Kerberos authentication. Required if the haveKerberos parameter is set to true No default value
kerberosPrincipal The Kerberos principal, such as ****/hadoopclient@**.***. This parameter is required if the haveKerberos parameter is set to true.
The absolute path of the keytab file is required for Kerberos authentication. To use Kerberos authentication, you must configure Kerberos authentication on a custom resource group. The following code provides a configuration example:
"haveKerberos":true,
"kerberosKeytabFilePath":"/opt/datax/**.keytab",
"kerberosPrincipal":"**/hadoopclient@**.**"
No No default value

Configure HDFS Writer by using the codeless UI

This method is not supported.

Configure HDFS Writer by using the code editor

For more information about how to configure a synchronization node by using the code editor, see Create a sync node by using the code editor.

In the following code, a synchronization node is configured to write data to HDFS. For more information about the parameters, see the preceding parameter description.
{
    "type": "job",
    "version": "2.0",// The version number. 
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "hdfs",// The writer type. 
            "parameter": {
                "path": "",// The directory in HDFS to which the files are written. 
                "fileName": "",// The name prefix of the files that you want to write to HDFS. 
                "compress": "",// The compression format of the files that you want to write to HDFS. 
                "datasource": "",// The name of the data source. 
                "column": [
                    {
                        "name": "col1",// The name of a column. 
                        "type": "string"// The data type of a column. 
                    },
                    {
                        "name": "col2",
                        "type": "int"
                    },
                    {
                        "name": "col3",
                        "type": "double"
                    },
                    {
                        "name": "col4",
                        "type": "boolean"
                    },
                    {
                        "name": "col5",
                        "type": "date"
                    }
                ],
                "writeMode": "",// The write mode. 
                "fieldDelimiter": ",",// The column delimiter. 
                "encoding": "",// The encoding format. 
                "fileType": "text"// The format of the files that you want to write to HDFS. 
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed. 
        },
        "speed": {
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":3, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}