edit-icon download-icon

Configure HDFS writer

Last Updated: Apr 03, 2018

HDFS Writer is used to write TextFile, ORCFile, and ParquetFile to the specified path in the HDFS file system. The files can be associated with Hive tables. You must configure the data source before configuring the HDFS Writer plug‑in. For more information, see HDFS data source config.

How to implement HDFS Writer:

  1. Create a temp directory that does not exist in the HDFS file system based on the path you specified. Naming rule: path_random 1. Write the files that have been read to this temp directory.

  2. When all the files are written to the temp directory, move these files to the directory you specified. (Make sure no conflict of file names occurs when you create files.). Delete the temp directory. If you are unable to connect to the HDFS for some reasons such as network interruption during the process, delete the temp directory and the files written to it manually.

    Note:

    For data synchronization, admin account and read/write permissions for the files are required, as shown in the following figure:

    Create an admin user and home directory, specify a user group and additional group, and grant the permissions to the files.

The HDFS Writer currently has the following limitations:

  • It only supports TextFile, ORCFile, and ParquetFile formats, and what is stored in the file must be a two-dimensional table in a logic sense.

  • HDFS is a file system and does not involve schema. Therefore, it does not support writing columns partially.

  • Only the following Hive data types are supported:

    • Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, and DOUBLE
    • String: STRING, VARCHAR, and CHAR
    • Boolean: BOOLEAN
    • Time: DATE and TIMESTAMP
  • Unsupported Hive data types are as follows:

    decimal, binary, arrays, maps, structs, and union

  • For Hive partition tables, the data can only be written to one partition at a time.

  • For the TextFile format, make sure the delimiters in the files written to the HDFS are identical to the ones used in the tables created in the Hive, so that the data written to HDFS is associated with the Hive table fields.

  • In the current plug‑in, the Hive version is 1.1.1, and the Hadoop version is 2.7.1. (Apache is compatible with JDK1.7.) Data can be written normally in the testing environments of Hadoop 2.5.0, Hadoop 2.6.0, and Hive 1.2.0. For other versions, further test is needed.

Currently, HDFS Writer supports most data types in Hive. Check whether the Hive type you are using is supported.

HDFS Writer converts the data types in Hive as follows:

Data integration category HDFS/Hive data type
long TINYINT, SMALLINT, INT, BIGINT
double FLOAT, DOUBLE
string STRING, VARCHAR, CHAR
boolean BOOLEAN
date DATE, TIMESTAMP

Parameter description

  • defaultFS

    • Description: The namenode node address in the Hadoop HDFS file system. For example: hdfs://127.0.0.1:9000.

    • Required: Yes

    • Default value: None

  • fileType

    • Description: File type. Currently, only text, orc or parquet are supported.

      • text: The format of TextFile.
      • orc: The format of ORCFile.
      • parquet: The format of ParquetFile.
    • Required: Yes

    • Default value: None

  • path

    • Description: The path under which the files are written to Hadoop HDFS file system. HDFS Writer writes multiple files under the path based on the concurrent writing configurations.

      To associate with a Hive table, enter the path under the Hive table stored in HDFS. For example, if the path to the data warehouse set in Hive is /user/hive/warehouse/ and you have created the database test table “hello”, the path of the Hive table is /user/hive/warehouse/test.db/hello.

    • Required: Yes

    • Default value: None

  • fileName

    • Description: Name of the file written by HDFS Writer. A random suffix is appended to the file name to form the actual name of the file written using each thread.
    • Required: Yes

    • Default value: None

  • column

    • Description: Fields of the written data. Writing part of columns is not allowed.

      To associate with a Hive table, you must specify all the field names and types in the table, with name and type specifying the field name and field type respectively. You can configure the column field as follows:

      1. "column":
      2. [
      3. {
      4. "name": "userName",
      5. "type": "string"
      6. },
      7. {
      8. "name": "age",
      9. "type": "long"
      10. }
      11. ]
    • Required: Yes. (If the filetype is parquet, it is optional.)

    • Default value: None

  • writeMode

    • Description: The mode in which HDFS Writer clears the existing data before data writing:

      • append: No processing is done before writing, and Data Integration HDFS Writer writes data directly using fileName without conflict of file names.

      • nonConflict: An error is reported if a file with a prefix of fileName exists under the path.

      • Note: Parquet files only support nonConflict.

    • Required: Yes

    • Default value: None

  • fieldDelimiter

    • Description: The field delimiter used for the fields written by HDFS Writer. Make sure it is identical to the one used in the Hive table created. Otherwise, you are unable to locate the data in the Hive table.

    • Required: Yes. (If the filetype is parquet, it is optional.)

    • Default value: None

  • compress

    • Description: Compression type of HDFS files. It is left blank by default, which means no compression is performed.

      Text files support gzip and bzip2 compression types. Orc files support SNAPPY compression. (SnappyCodec is needed.)

    • Required: No

    • Default value: No compression

  • encoding

    • Description: Encoding of the written files.

    • Required: No

    • Default value: utf-8 (Do not change it unless it is necessary to do so.)

  • parquetSchema

    • Description: Required when the file is in parquet format. It is used to specify the structure of the target file, and takes effect only when the fileType is parquet. The format is as follows:

      1. message MessageType {
      2. Required, data type, column name;
      3. ......................;
      4. }

      Configuration item description:

      • MessageType: Any supported value.

      • Required: Required or Optional. Optional is recommended.

      • Data Type: Parquet files support the following data types: boolean, int32, int64, int96, float, double, binary (select binary if the data type is string), and fixed_len_byte_array.

        Note:

        Each configuration row and column, including the last one, must end with a semicolon.

      The following shows an example:

      1. message m {
      2. optional int64 id;
      3. optional int64 date_id;
      4. optional binary datetimestring;
      5. optional int32 dspId;
      6. optional int32 advertiserId;
      7. optional int32 status;
      8. optional int64 bidding_req_num;
      9. optional int64 imp;
      10. optional int64 click_num;
      11. }
    • Required: No

    • Default value: None

Development in wizard mode

Currently, development in wizard mode is not supported.

Development in script mode

The following is a script configuration sample. For relevant parameters, see Parameter Description.

  1. "type": "job",
  2. "version": "1.0",
  3. "configuration": {
  4. "reader": {},
  5. "writer": {
  6. "plugin": "hdfs",
  7. "parameter": {
  8. "defaultFS": "hdfs://localhost:9000",
  9. "fileType": "text",
  10. "path": "/user/hive/warehouse/writerorc.db/orcfull",
  11. "fileName": "hello",
  12. "column": [
  13. {
  14. "name": "col1",
  15. "type": "string"
  16. },
  17. {
  18. "name": "col2",
  19. "type": "long"
  20. },
  21. {
  22. "name": "col3",
  23. "type": "double"
  24. },
  25. {
  26. "name": "col4",
  27. "type": "boolean"
  28. },
  29. {
  30. "name": "col5",
  31. "type": "date"
  32. }
  33. ],
  34. "writeMode": "append",
  35. "fieldDelimiter": ",",
  36. "compress": "",
  37. "encoding": "UTF-8"
  38. }
  39. }
  40. }
  41. }
Thank you! We've received your feedback.