Hive Reader allows you to read data from Hive. This topic describes how Hive Reader works, the parameters that are supported by Hive Reader, and how to configure Hive Reader.

Background information

Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table.
Notice Hive Reader supports only exclusive resource groups for Data Integration, but not the shared resource group or custom resource groups. For more information, see Exclusive resource group for Data Integration, Use the default resource group, and Create a custom resource group for Data Integration. For more information about the Hive versions that are supported by Hive Reader, see the Supported versions section of this topic.
Essentially, Hive converts Hive query language (HQL) or SQL statements to MapReduce programs.
  • Hive stores processed data in Hadoop Distributed File System (HDFS).
  • Hive uses MapReduce programs to analyze data at the underlying layer.
  • Hive runs MapReduce programs on Yarn.

How it works

Hive Reader connects to a Hive metadatabase and parses the configuration to obtain the storage path, format, and column delimiter of the HDFS file corresponding to your Hive table. Then, Hive Reader reads data from the HDFS file.

Hive Reader connects to Hive Metastore to obtain the metadata of your Hive table. You can read data based on HDFS files or Hive Java Database Connectivity (JDBC):
  • Read data based on HDFS files

    Hive Reader connects to Hive Metastore and parses the configuration to obtain the storage path, format, and column delimiter of the HDFS file corresponding to your Hive table. Then, Hive Reader reads data from the HDFS file.

    The underlying logic of Hive Reader is the same as that of HDFS Reader. After Hive Reader reads data, it synchronizes data from the HDFS file to the destination table by using Hive JDBC. You can configure the parameters of HDFS Reader in the parameters of Hive Reader. Data Integration transparently transmits the configured parameters to HDFS Reader.

  • Read data based on Hive JDBC

    Hive Reader connects to HiveServier2 by using Hive JDBC to read data. Hive Reader allows you to specify WHERE clauses to filter data and execute SQL statements to read data.

Parameters

Parameter Description Required Default value
datasource The name of the connection. It must be the same as the name of a created connection. Yes N/A
table The name of the Hive table to be synchronized.
Note The name is case-sensitive.
Yes N/A
readMode The mode in which data is read.
  • If Hive Reader reads data based on HDFS files, set this parameter to "readMode":"hdfs".
  • If Hive Reader reads data based on Hive JDBC, set this parameter to "readMode":"jdbc".
No N/A
partition The partitions in the Hive table.
  • This parameter is not required if Hive Reader reads data based on Hive JDBC.
  • If the Hive table to be read is a partitioned table, you must set the partition parameter. The sync node reads data from partitions that are specified by the partition parameter.

    Hive Reader allows you to use asterisks (*) as wildcards to specify partitions in the format of pt1=a,pt2=b,....

  • If the Hive table is a non-partitioned table, the partition parameter is not required.
No N/A
column The columns to be read. Example: "column": ["id", "name"].
  • Column pruning is supported. You can select specific columns to export.
  • The column order can be changed. You can export the specified columns in an order different from that specified in the schema of the table.
  • You can specify partition key columns.
  • Constants are supported.
  • The column parameter must explicitly specify a set of columns to be synchronized. The parameter cannot be left empty.
Yes N/A
querySql If Hive Reader reads data based on Hive JDBC, you can set the querySql parameter to read data. No N/A
where If Hive Reader reads data based on Hive JDBC, you can specify the WHERE clause to filter data. No N/A

Codeless UI mode

On the DataStudio page, double-click a sync node and configure the node on the node configuration tab on the right. For more information, see Create a sync node by using the codeless UI.

Perform the following steps on the configuration tab of the sync node:
  1. Configure the connections.

    Configure the connections to the source and destination data stores for the sync node.

    Parameter Description
    Connection The datasource parameter in the preceding parameter description. Select the name of a connection that you have configured in DataWorks.
    Table The table parameter in the preceding parameter description.
    Read Hive method The readMode parameter in the preceding parameter description. Valid values: Read data based on HDFS file and Read data based on Hive Jdbc (supports data condition filtering).
    Partition Key Column The partition parameter in the preceding parameter description.
    Note This parameter is displayed only when you set the Read Hive method parameter to Read data based on HDFS file.
    Data filter The filter condition for the data to be synchronized. Filtering based on the limit keyword is not supported. The SQL syntax is determined by the selected connection.
    Note This parameter is displayed only when you set the Read Hive method parameter to Read data based on Hive Jdbc (supports data condition filtering).
    Data query querySql The querySql parameter in the preceding parameter description.
    Note This parameter is displayed only when you set the Read Hive method parameter to Read data based on Hive Jdbc (supports data condition filtering).
  2. Configure field mapping. It is equivalent to setting the column parameter in the preceding parameter description.
    Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right. You can click Add to add a field. To remove a field, move the pointer over the field and click the Remove icon.Mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish a mapping between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish a mapping between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove mappings that have been established.
    Auto Layout You can sort the fields based on specified rules.
    Change Fields You can manually edit fields in the source table. Each field occupies a row. The first and the last blank rows are included, whereas other blank rows are ignored.
    Add Click Add to add a field. Take note of the following rules when you add a field:
    • You can enter constants. Each constant must be enclosed in single quotation marks (' '), such as 'abc' and '123'.
    • You can use scheduling parameters, such as ${bizdate}.
    • You can enter functions that are supported by relational databases, such as now() and count(1).
    • Fields that cannot be parsed are indicated by Unidentified.
    • You can enter the name of a partition key column to synchronize the partition key column.
  3. Configure channel control policies.Channel
    Parameter Description
    Expected Maximum Concurrency The maximum number of concurrent threads that the sync node uses to read data from the source data store and write data to the destination data store. You can configure the concurrency for the sync node on the codeless user interface (UI).
    Bandwidth Throttling You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source data store. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value based on the configurations of the source data store.
    Dirty Data Records Allowed The maximum number of dirty data records that are allowed.

Code editor mode

For more information about how to use the code editor, see Create a sync node by using the code editor.
Note Delete the comments from the following code before you run the code.
You can read data based on HDFS files or Hive JDBC:
  • Read data based on HDFS files
    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "partition": "pt1=a,pt2=b,pt3=c", // The partitions in the Hive table.
                    "datasource": "hive_not_ha_****", // The name of the connection.
                    "column": [ // The columns to be read.
                        "id",
                        "pt2",
                        "pt1"
                    ],
                    "readMode": "hdfs", // The mode in which data is read.
                    "table": "part_table_1"
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": "" // The maximum number of dirty data records that are allowed.
            },
            "speed": {
                "concurrent": 2, // The maximum number of concurrent threads.
                "throttle": false 
            }
        }
    }
  • Read data based on Hive JDBC
    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "querySql": "select id,name,age from part_table_1 where pt2='B'",
                    "datasource": "hive_not_ha_****", // The name of the connection.
                    "column": [ // The columns to be read.
                        "id",
                        "name",
                        "age"
                    ],
                    "where": "",
                    "table": "part_table_1",
                    "readMode": "jdbc" // The mode in which data is read.
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""
            },
            "speed": {
                "throttle": false,
                "concurrent": 2 // The maximum number of concurrent threads.
            }
        }
    }

Configure the resource group for Data Integration

On the node configuration tab, click the Resource Group configuration tab in the right-side navigation pane. Configure the resource group on which the sync node is run. You can use only exclusive resource groups or custom resource groups to synchronize data from Hive. For more information, see DataWorks exclusive resources.Resource Group configuration

Supported versions

Hive Reader supports the following Hive versions:
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3