This topic describes the data types and parameters that are supported by HDFS Writer and how to configure HDFS Writer by using the codeless user interface (UI) and code editor.
Limits
- Alibaba Cloud Apsara File Storage for HDFS is not supported.
- HDFS Writer can write only text, ORC, and Parquet files that store logical two-dimensional tables to HDFS.
- HDFS is a distributed file system and does not have a schema. Therefore, you cannot write only data in some columns of a file to HDFS.
- Hive data types, such as DECIMAL, BINARY, ARRAYS, MAPS, STRUCTS, and UNION, are not supported.
- HDFS Writer can write data to only one partition in a partitioned Hive table at a time.
- To write a text file to HDFS, make sure that the delimiter in the file is the same as that in the Hive table that you want to associate with the file. This way, you can associate the columns in the file that is written to HDFS with those in the Hive table.
- You can use HDFS Writer in the environment in which Hive 1.1.1 and Hadoop 2.7.1 (JDK version: 1.7) are installed. JDK is short for Java Development Kit. HDFS Writer can write files to HDFS in test environments in which Hive 1.2.0 and Hadoop 2.5.0 or Hadoop 2.6.0 are installed.
How it works
- Creates a temporary directory that does not exist in HDFS based on the path parameter
you specified.
The temporary directory is specified in the format of path_Random suffix.
- Writes files that are obtained from a reader to the temporary directory.
- Moves the files from the temporary directory to the specified directory after all the files are written. The names of the files that you want to write to HDFS must be different from those of existing files in HDFS.
- Deletes the temporary directory. If HDFS Writer fails to connect to HDFS due to a network interruption, you must manually delete the temporary directory and all the files in the temporary directory.
Data types
HDFS Writer supports most Hive data types. Make sure that the data types of your system are supported.
The following table lists the Hive data types that are supported by HDFS Writer.
Category | Hive data type |
---|---|
Integer | TINYINT, SMALLINT, INT, and BIGINT |
Floating point | FLOAT and DOUBLE |
String | CHAR, VARCHAR, and STRING |
Boolean | BOOLEAN |
Date and time | DATE and TIMESTAMP |
Parameters
Parameter | Description | Required | Default value |
---|---|---|---|
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
defaultFS | The address of the NameNode node in HDFS, such as hdfs://127.0.0.1:9000 . If you use the shared resource group for Data Integration, you cannot configure
the advanced Hadoop parameters that are related to high availability. If you want
to configure these parameters, you must use a custom resource group for Data Integration.
For more information, see Create a custom resource group for Data Integration.
|
Yes | No default value |
fileType | The format of the files that you want to write to HDFS. Valid values: text, orc, and parquet.
|
Yes | No default value |
path | The directory in HDFS to which you want to write files. HDFS Writer writes multiple
files to the directory based on the configuration of parallel threads.
To associate the columns in a file with those in a Hive table, set the path parameter
to the storage path of the Hive table in HDFS. For example, the storage path that
is specified for the Hive data warehouse is |
Yes | No default value |
fileName | The name prefix of the files that you want to write to HDFS. A random suffix is appended to the specified prefix to form the actual file name that is used by each thread. | Yes | No default value |
column | The names of the columns to which you want to write data. You cannot write data only
to some columns in the Hive table.
To associate the columns in a file with those in a Hive table, configure the name and type parameters for each column. The name parameter specifies the name of the column, and the type parameter specifies the data type of the column. You can specify the column parameter in the following format:
|
Required if the fileType parameter is set to text or orc | No default value |
writeMode | The write mode. Valid values:
Note Parquet files do not support the append mode. To write Parquet files, you must set
the writeMode parameter to nonConflict.
|
Yes | No default value |
fieldDelimiter | The column delimiter that is used in the files you want to write to HDFS. Make sure that you use the same delimiter as that in the Hive table. Otherwise, you cannot query data in the Hive table. | Required if the fileType parameter is set to text or orc | No default value |
compress | The compression format of the files that you want to write to HDFS. By default, this
parameter is left empty, which indicates that the files are not compressed.
For a text file, the GZIP and BZIP2 compression formats are supported. For an ORC file, the Snappy compression format is supported. To compress an ORC file, you must install SnappyCodec. To install SnappyCodec,submit a ticket. |
No | No default value |
encoding | The encoding format of the files that you want to write to HDFS. | No | UTF-8 |
parquetSchema | The schema of the Parquet files that you want to write to HDFS. This parameter is
available only if the fileType parameter is set to parquet. Format:
Parameters:
Note Each line, including the last line, must end with a semicolon (;).
Example:
|
No | No default value |
hadoopConfig | The settings of the advanced Hadoop parameters that are related to high availability.
If you use the shared resource group for Data Integration, you cannot configure the
advanced Hadoop parameters that are related to high availability. If you want to configure
these parameters, you must use a custom resource group for Data Integration. For more
information, see Create a custom resource group for Data Integration.
|
No | No default value |
dataxParquetMode | The synchronization mode for Parquet files. Valid values: fields and columns. If you
set this parameter to fields, HDFS Writer can write data of complex data types, such
as ARRAY, MAP, and STRUCT.
If you set this parameter to fields, HDFS Writer supports HDFS over Object Storage
Service (OSS). In this case, HDFS uses OSS as the storage service, and HDFS Writer
writes Parquet files to OSS. You can add the following OSS-related parameters in the
hadoopConfig parameter:
The following sample code provides an example on how to connect to OSS:
|
No | columns |
haveKerberos | Specifies whether Kerberos authentication is required. If you set this parameter to true, the kerberosKeytabFilePath and kerberosPrincipal parameters are required. | No | false |
kerberosKeytabFilePath | The absolute path of the keytab file for Kerberos authentication. | Required if the haveKerberos parameter is set to true | No default value |
kerberosPrincipal | The Kerberos principal, such as ****/hadoopclient@**.***. This parameter is required
if the haveKerberos parameter is set to true.
The absolute path of the keytab file is required for Kerberos authentication. To use
Kerberos authentication, you must configure Kerberos authentication on a custom resource
group. The following code provides a configuration example:
|
No | No default value |
Configure HDFS Writer by using the codeless UI
- Configure data sources.
Configure Source and Target for the synchronization node.
Parameter Description Connection The name of the data source to which you want to write data. This parameter is equivalent to the datasource parameter that is described in the preceding section. File path The directory in HDFS to which you want to write files. This parameter is equivalent to the path parameter that is described in the preceding section. File type The format of the files that you want to write to HDFS. This parameter is equivalent to the fileType parameter that is described in the preceding section. Valid values: text, orc, and parquet. File name The name prefix of the files that you want to write to HDFS. This parameter is equivalent to the fileName parameter that is described in the preceding section. WriteMode The write mode. This parameter is equivalent to the writeMode parameter that is described in the preceding section. Valid values: - append: HDFS Writer writes the files based on the specified file name prefix and ensures that the actual file names do not conflict with the names of existing files.
- nonConflict: HDFS Writer returns an error if a file with the specified file name prefix exists in the destination directory.
Note Parquet files do not support the append mode. They support only the nonConflict mode.FieldDelimiter The column delimiter that is used in the files you want to write to HDFS. This parameter is equivalent to the fieldDelimiter parameter that is described in the preceding section. Make sure that you use the same delimiter as that in the Hive table. Otherwise, you cannot query data in the Hive table. Encoding The encoding format. This parameter is equivalent to the encoding parameter that is described in the preceding section. Default value: UTF-8. Kerberos authentication Specifies whether Kerberos authentication is required. Default value: No. If you set this parameter to Yes, you must also specify the KeyTab file path and Principal Name parameters. For more information, see Configure Kerberos authentication. HadoopConfig The settings of the advanced Hadoop parameters that are related to high availability. If you use the shared resource group for Data Integration, you cannot configure the advanced Hadoop parameters that are related to high availability. - Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section. By default, Map Fields in the Same Line is used to establish mappings between fields. You can click the
icon to edit fields. Each field occupies a row. The first and the last blank rows are included, whereas other blank rows are ignored. Make sure that the numbers of fields in the source and destination tables match.
Configure HDFS Writer by using the code editor
For more information about how to configure a synchronization node by using the code editor, see Create a synchronization node by using the code editor.
{
"type": "job",
"version": "2.0",// The version number.
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "hdfs",// The writer type.
"parameter": {
"path": "",// The directory in HDFS to which the files are written.
"fileName": "",// The name prefix of the files that you want to write to HDFS.
"compress": "",// The compression format of the files that you want to write to HDFS.
"datasource": "",// The name of the data source.
"column": [
{
"name": "col1",// The name of a column.
"type": "string"// The data type of a column.
},
{
"name": "col2",
"type": "int"
},
{
"name": "col3",
"type": "double"
},
{
"name": "col4",
"type": "boolean"
},
{
"name": "col5",
"type": "date"
}
],
"writeMode": "",// The write mode.
"fieldDelimiter": ",",// The column delimiter.
"encoding": "",// The encoding format.
"fileType": "text"// The format of the files that you want to write to HDFS.
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""// The maximum number of dirty data records allowed.
},
"speed": {
"throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true.
"concurrent":3, // The maximum number of parallel threads.
"mbps":"12"// The maximum transmission rate.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}