edit-icon download-icon

Configure HDFS reader

Last Updated: Apr 12, 2018

HDFS Reader provides the ability to read the data stored by distributed file systems. At the underlying implementation level, HDFS Reader retrieves the file data on the distributed file system, coverts the data into Data Integration transport protocol and transfers it to the Writer.

HDFS Reader provides the ability to read file data from the Hadoop distributed file system HDFS and coverts the data into Data Integration transport protocol.

The following shows an example.

TextFile is the default storage format for creating Hive tables without data compression. Essentially, TextFile stores data in the HDFS as text, and the implementation of HDFS Reader is quite similar to that of OSS Reader for Data Integration. ORCFile refers to Optimized Row Columnar File in full name, which is the optimized RCFile. This file format provides an efficient method for storing Hive data. HDFS Reader utilizes the OrcSerde class that provided by Hive to read and parse the data of ORCFile files.

Note:

For data synchronization, admin account and read/write permissions for files are required.

Create an admin user and home directory, specify a user group and additional group, and grant the permissions for the files.

Supported features

Currently, HDFS Reader supports the following features.

  • Supports TextFile, ORCFile, rcfile, sequence file, csv, and parquet file formats, and what is stored in the file must be a two-dimensional table in a logic sense.

  • Supports reading multiple types of data (represented by Strings) and supports column pruning and column constants.

  • Supports recursive reading and regular expressions “*” and “?”.

  • Supports ORCFile data compression, and currently supports the SNAPPY and ZLIB compression modes.

  • Supports data compression for sequence files, and currently supports the lzo compression mode.

  • Supports concurrent reading of multiple files.

  • Supports the following compression formats for the csv type: gzip, bz2, zip, lzo, lzo_deflate, and snappy.

  • In the current plug‑in, the Hive version is 1.1.1, and the Hadoop version is 2.7.1 (Apache [is compatible with JDK 1.6]). Data can be written normally in the testing environments of Hadoop 2.5.0, Hadoop 2.6.0, and Hive 1.2.0. For other versions, further test is needed.

Temporarily, HDFS Reader does not support the multi‑thread concurrent reading of a single file, which involves the internal splitting algorithm of the single file.

Supported data types

RCFile

If the file type of the HDFS file being synchronized is rcfile, you must specify the data type of the column in the Hive table in “column type”, because the data storage mode varies with the data type during rcfile underlying storage and the HDFS Reader does not support accessing and querying Hive metadata databases. If the column type is bigint, double, or float, enter bigint, double, or float accordingly as the data type. If the column type is varchar or char, enter string for the same purpose.

RCFile data types are converted into the internal types supported by Data Integration by default, as shown in the following comparison table.

Category HDFS data type
Integer TINYINT, SMALLINT, INT, BIGINT
Floating point FLOAT, DOUBLE, DECIMAL
String STRING, CHAR, VARCHAR
Date and time DATE, TIMESTAMP
Boolean BOOLEAN
Binary BINARY

ParquetFile

ParquetFile data types are converted into the internal types supported by Data Integration by default, as shown in the following comparison table.

Category HDFS data type
Integer INT32, INT64, INT96
Floating point FLOAT, DOUBLE
String FIXED_LEN_BYTE_ARRAY
Date and time DATE, TIMESTAMP
Boolean BOOLEAN
Binary BINARY

TextFile, ORCFile, SequenceFile

Given that the metadata of TextFile and ORCFile file tables is maintained by and stored in the database maintained by Hive itself (such as MySQL), HDFS Reader currently does not support the access and query to Hive metadata database, so you must specify a data type for type conversion.

TextFile, ORCFile, and SequenceFile data types are converted into the internal types that supported by Data Integration by default, as shown in the following comparison table.

Category HDFS data type
Integer TINYINT, SMALLINT, INT, BIGINT
Floating point FLOAT, DOUBLE
String STRING, CHAR, VARCHAR, STRUCT, MAP, ARRAY, UNION, BINARY
Date and time DATE, TIMESTAMP
Boolean BOOLEAN

The following information describes those data types.

  • LONG: Represents with an integer string in the HDFS file, such as 123456789.

  • DOUBLE: Represents with a double string in the HDFS file, such as 3.1415.

  • BOOLEAN: Represents with a boolean string in the HDFS file, such as true or false (case-insensitive).

  • DATE: Represents with a date and time string in the HDFS file, such as 2014-12-31 00:00:00.

    Note:

    The TIMESTAMP data type supported by Hive can be accurate to nanoseconds, so the data content of TIMESTAMP stored in TextFile and ORCFile can be in the format like “2015-08-21 22:40:47.397898389”. If the converted data type is set as Date for Data Integration, the nanosecond part is truncated after conversion. If you want to retain this part, set the converted data type as String for Data Integration.

Parameter description

  • path

    • Description: It refers to the file path to be read. If you want to read multiple files, use a regular expression to match all of them, such as /hadoop/data_201704*.

      • If a single HDFS file is specified, HDFS Reader only supports single-threaded data extraction.
      • If multiple HDFS files are specified, HDFS Reader supports multiple-threaded data extraction, and the number of concurrent threads is determined by the job speed (mbps). The actual number of initiated concurrent threads is the smaller of the number of HDFS files to be read and the set job speed.
      • When the wildcard is specified, HDFS Reader attempts to traverse multiple files. For example: When “/“ is specified, HDFS Reader reads all the files under the “/“ directory. When “/bazhen/“ is specified, HDFS Reader reads all the files under the bazhen directory. Currently, HDFS Reader only supports “*” and “?” as file wildcards, and the syntax is similar to that of the file wildcards of common Linux command lines.

        Note:

        Data Integration regards all the files to be read in the same synchronization job as one data table. For this reason, you must make sure that all those files adapt to the same schema information and grant the read permission to Data Integration.**

    • Required: Yes

    • Default value: None

    Note on reading partitions: During Hive table creation, you can specify partitions. For example, after creating the partition(day=”20150820”,hour=”09”), two directories with the name of /20150820 and /09 are created in the table catalog of the HDFS file system and /20150820 is the parent directory of /09. Given that partitions are organized in a directory structure, we can set the value of path in JSON when reading all the data of a table by partitions. For example, if you want to read all the data of the table named mytable01 on the partition day of 20150820, set as follows:

    "path": "/user/hive/warehouse/mytable01/20150820/*"

  • defaultFS

    • Description: The namenode node address in the Hadoop HDFS file system.

    • Required: Yes

    • Default value: None

  • fileType

    • Description: File type. Currently, only text, orc, rc, seq, csv, or parquet are supported. HDFS Reader can automatically identify files that are ORCFile, RCFile, Sequence File, TextFile, and csv types and use the appropriate reading policy for the corresponding file type. Before synchronizing data, HDFS Reader checks whether the types of all the files to be synchronized under the specified path are consistent with the fileType. If not, the synchronization task fails.

      The list of parameter values that can be configured by fileType is as follows.

      • text: The format of TextFile.
      • orc: The format of ORCFile.
      • rc: The format of rcfile.
      • seq: The format of sequence file.
      • csv: The format of common HDFS file (logical two-dimensional table).
      • parquet: The format of common parquet file.

        Note:

        Because TextFile and ORCFile are totally different file formats, HDFS Reader parses the two file types in different ways. For this reason, the formats of the converted results varies when converting complex compound types supported by Hive (such as map, array, struct, and union) to the String type supported by Data Integration. The following takes the map type as an example.

        • After being parsed and converted to the String type supported by Data Integration, the result of the ORCFile map type is {job=80, team=60, person=70}.

        • After being parsed and converted to the String type supported by Data Integration, the result of the TextFile map type is job:80,team:60,person:70.

        From the preceding results, the data itself remains unchanged but the representation formats differ slightly. For this reason, if the fields to be synchronized under the file path to be configured are compound in Hive, we recommended that you can set a unified file type for the files.

      Recommended best practices

      • To unify the file types parsed from compound types, we recommended that you can export TextFile tables as ORCFile tables on the Hive client.
      • If the file type is Parquet, the parquetSchema is required, which is used to describe the format of the Parquet file to be read.
    • Required: Yes

    • Default value: None

  • column

    • Description: It refers to the list of fields read, where the type indicates the type of source data, the index indicates the column in which the current column locates (starts from 0), and the value indicates that the current type is constant and the data is not read from the source file but the corresponding column is automatically generated according to the value.

      By default, you can read data by taking String as the only type. The configuration is as follows.

      "column": ["*"]

      You also can configure the column field as follows.

      1. {
      2. "type": "long",
      3. "index": 0 // Retrieves the int field from the first column of the local file text
      4. },
      5. {
      6. "type": "string",
      7. "value": "alibaba" // HDFS Reader internally generates the alibaba string field as the current field
      8. }

      For the specified column information, you must enter type and choose one from index/value.

    • Required: Yes

    • Default value: None

  • fieldDelimiter

    • Description: It refers to the field delimiter read. When HDFS Reader reads the TextFile data, a file delimiter is required, which defaults to ‘,’ if no delimiter is specified. When HDFS Reader reads the ORCFile data, no field delimiter is required. The default delimiter of Hive is \u0001. To use each row as a column of the target, use characters that not included in the content of rows as the delimiter, such as the invisible characters \u0001. Additionally, \n cannot be used as the delimiter.

    • Required: No

    • Default: comma (,)

  • encoding

    • Description: Encoding of the written files.

    • Required: No

    • Default: utf-8

  • nullFormat

    • Description: Defining null (null pointer) with a standard string is not allowed in text files. Data Integration provides nullFormat to define which strings can be expressed as null. For example, when nullFormat: null is configured, if the source data is null, it is considered as a null field in Data Integration.

    • Required: No

    • Default value: None

  • compress

    • Description: It refers to the possible file compression formats when the fileType is csv, which currently support gzip, bz2, zip, lzo, lzo_deflate, hadoop-snappy, and framing-snappy.

      Note:

      • Two lzo compression formats are available: lzo and lzo_deflate. In actual configuration scenarios, make sure to select the appropriate one.
      • Given that no unified stream format is now available to snappy, Data Integration currently only supports the most popular two compression formats: hadoop-snappy (the snappy stream format in Hadoop) and framing-snappy (the snappy stream format recommended by Google).
      • rc represents the format of rcfile.
      • No entry is required for the orc file type.
    • Required: No

    • Default value: None

  • parquetSchema

    • Description: Required when the file is in parquet format. It is used to specify the structure of the target file, and takes effect only when the fileType is parquet. The format is as follows.

      1. message MessageType {
      2. Required, data type, column name;
      3. ......................;
      4. }

      Description is shown as follows.

      • MessageType: Any supported value

      • Required: Required or Optional. Optional is recommended.

      • Data type: Parquet files support the following data types: boolean, int32, int64, int96, float, double, binary (select binary if the data type is string), and fixed_len_byte_array.

      Note that each configuration row and column, including the last one, must end with a semicolon.

      Configuration example is shown as follows.

      1. message m {
      2. optional int64 id;
      3. optional int64 date_id;
      4. optional binary datetimestring;
      5. optional int32 dspId;
      6. optional int32 advertiserId;
      7. optional int32 status;
      8. optional int64 bidding_req_num;
      9. optional int64 imp;
      10. optional int64 click_num;
      11. }
    • Required: No

    • Default value: None

  • csvReaderConfig

    • Description: Reads parameter configurations of the CSV file. It is the Map type. This reading is performed by the CsvReader for reading CSV files and involves many configuration items, whose defaults are used if they are not configured.

    • Required: No

    • Default value: None

    Common configuration is shown as follows.

    1. "csvReaderConfig":{
    2. "safetySwitch": false,
    3. "skipEmptyRecords": false,
    4. "useTextQualifier": false
    5. }

    For all the configuration items and default values, you must configure the map of csvReaderConfig strictly in accordance with the following field names.

    1. boolean caseSensitive = true;
    2. char textQualifier = 34;
    3. boolean trimWhitespace = true;
    4. boolean useTextQualifier = true;//Whether to use csv escape characters
    5. char delimiter = 44;// It refers to the delimiter.
    6. char recordDelimiter = 0;
    7. char comment = 35;
    8. boolean useComments = false;
    9. int escapeMode = 1;
    10. boolean safetySwitch = true;//Whether the length of each column is limited to 100,000 characters
    11. boolean skipEmptyRecords = true;//Whether to skip null rows
    12. boolean captureRawRecord = true;
  • hadoopConfig

    • Description: Certain Hadoop-related advanced parameters can be configured in hadoopConfig, such as the configuration of HA.

      1. "hadoopConfig":{
      2. "dfs.nameservices": "testDfs",
      3. "dfs.ha.namenodes.testDfs": "namenode1,namenode2",
      4. "dfs.namenode.rpc-address.youkuDfs.namenode1": "",
      5. "dfs.namenode.rpc-address.youkuDfs.namenode2": "",
      6. "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      7. }
    • Required: No

    • Default value: None

Development in wizard mode

Development in wizard mode is not supported currently.

Development in script mode

The following is a script configuration sample. For relevant parameters, see Parameter description.

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "reader": {
  6. "plugin": "hdfs",
  7. "parameter": {
  8. "path": "/user/hive/warehouse/mytable01/*",
  9. "defaultFS": "hdfs://127.0.0.1:9000",
  10. "column": [
  11. {
  12. "index": 0,
  13. "type": "long"
  14. },
  15. {
  16. "index": 1,
  17. "type": "boolean"
  18. },
  19. {
  20. "type": "string",
  21. "value": "hello"
  22. },
  23. {
  24. "index": 2,
  25. "type": "double"
  26. }
  27. ],
  28. "fileType": "orc",
  29. "encoding": "UTF-8",
  30. "fieldDelimiter": ","
  31. }
  32. },
  33. "writer": {}
  34. }
  35. }
Thank you! We've received your feedback.