Hive Reader can read data from Hive. This topic describes the working principles and parameters that are supported by Hive Reader and how to configure Hive Reader by using the codeless user interface (UI) and code editor.

Background information

Hive is a Hadoop-based data warehouse tool that is used to process large amounts of structured logs. Hive maps structured data files to a table and allows you to execute SQL statements to query data in the table. Essentially, Hive is an engine that is used to parse SQL statements. It uses MapReduce to analyze data at the underlying layer and uses Hadoop Distributed File System (HDFS) to store the processed data. Hive converts Hibernate Query Language (HQL) and SQL statements to MapReduce programs and runs the programs on Yarn.

Hive Reader connects to a Hive metadatabase and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.

Hive Reader connects to Hive Metastore to obtain the metadata of your Hive table. Hive Reader can read data based on HDFS files or by using Hive Java Database Connectivity (JDBC):
  • HDFS files

    Hive Reader connects to Hive Metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.

    The underlying logic of Hive Reader is the same as that of HDFS Reader. After Hive Reader reads data, it synchronizes data from the HDFS file to the destination table by using Hive JDBC. You can configure the parameters of HDFS Reader in the parameters of Hive Reader. Data Integration transparently transmits the configured parameters to HDFS Reader.

  • Hive JDBC

    Hive Reader connects to HiveServer2 by using Hive JDBC to read data. Hive Reader allows you to specify WHERE clauses to filter data and execute SQL statements to read data.

Limits

Supported data types

Category S3 data type
String CHAR, VARCHAR, and STRING
Integer TINYINT, SMALLINT, INT, INTEGER, and BIGINT
Floating point FLOAT, DOUBLE, and DECIMAL
Date and time TIMESTAMP and DATE
Boolean BOOLEAN

Parameters

Parameter Description Required Default value
datasource The name of the data source. It must be the same as the name of the added data source. Yes N/A
table The name of the table from which you want to read data.
Note The name is case-sensitive.
Yes N/A
readMode The mode in which Hive Reader reads data.
  • If Hive Reader reads data based on HDFS files, set this parameter to hdfs.
  • If Hive Reader reads data by using Hive JDBC, set this parameter to jdbc.
Note
  • If Hive Reader reads data by using Hive JDBC, you can specify a WHERE clause to filter data. If Hive Reader reads data based on HDFS files, you cannot specify a WHERE clause to filter data.
  • If Hive Reader reads data by using Hive JDBC, you cannot customize the parallelism. If Hive Reader reads data based on HDFS files, you can customize the parallelism.
No N/A
partition The partitions in the Hive table.
  • This parameter is not required if Hive Reader reads data by using Hive JDBC.
  • If the Hive table from which you want to read data is a partitioned table, you must set the partition parameter. Hive Reader reads data from the partitions that are specified by the partition parameter.

    You must set this parameter in the pt1=a,pt2=b,... format. You can also use asterisks (*) as wildcards to set this parameter.

  • If the Hive table from which you want to read data is a non-partitioned table, you do not need to set the partition parameter.
No N/A
column The columns to be read. Example: "column": ["id", "name"].
  • You can select specific columns to read.
  • The column order can be changed. You can configure Hive Reader to synchronize the specified columns in an order different from that specified in the schema of the table.
  • You can specify partition key columns.
  • You can configure constants.
  • The column parameter must explicitly specify all the columns that you want to synchronize. This parameter cannot be left empty.
Yes N/A
querySql If Hive Reader reads data by using Hive JDBC, you can set the querySql parameter for Hive Reader to read data. No N/A
where If Hive Reader reads data by using Hive JDBC, you can specify the WHERE clause for Hive Reader to filter data. No N/A

Configure Hive Reader by using the codeless UI

On the DataStudio page, double-click the sync node that you created. On the node configuration tab that appears, set the parameters for the node. For more information, see Configure a sync node by using the codeless UI.

Perform the following steps on the configuration tab of the sync node:
  1. Configure the data sources.
    Configure the source and destination for the sync node. Configure data sources
    Parameter Description
    Connection The name of the data source from which you want to read data. This parameter is equivalent to the datasource parameter that is described in the preceding section.
    Table This parameter is equivalent to the table parameter that is described in the preceding section.
    Read Hive method The mode in which Hive Reader reads data. This parameter is equivalent to the readMode parameter that is described in the preceding section. Valid values: Read data based on HDFS file and Read data based on Hive Jdbc (supports data condition filtering).
  2. Configure field mappings. This operation is equivalent to setting the column parameter that is described in the preceding section.
    Fields in the source on the left have a one-to-one mapping with fields in the destination on the right. You can click Add to add a field. To remove an added field, move the pointer over the field and click the Remove icon. Mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish mappings between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish mappings between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove the mappings that are established.
    Auto Layout Click Auto Layout. Then, the system automatically sorts the fields based on specific rules.
    Change Fields Click the Change Fields icon. In the Change Fields dialog box, you can manually edit the fields in the source table. Each field occupies a row. The first and the last blank rows are included, whereas other blank rows are ignored.
    Add Click Add to add a field. You can add fields of the following types:
    • You can enter constants. Each constant must be enclosed in single quotation marks ('), such as 'abc' and '123'.
    • You can use scheduling parameters such as ${bizdate}.
    • You can enter functions that are supported by relational databases, such as now() and count(1).
    • If the value that you entered cannot be parsed, the value of Type for the field is Custom.
    • You can enter the name of a partition key column to synchronize the partition key column.
  3. Configure channel control policies. Channel section
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the sync node uses to read data from the source or write data to the destination. You can configure the parallelism for the sync node on the codeless UI.
    Note If Hive Reader reads data by using Hive JDBC, you cannot customize the parallelism. If Hive Reader reads data based on HDFS files, you can customize the parallelism.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Distributed Execution

    The distributed execution mode that allows you to split your node into pieces and distribute them to multiple Elastic Compute Service (ECS) instances for parallel execution. This speeds up synchronization. If you use a large number of parallel threads to run your synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access load on the data sources. You can enable this mode only if you use an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.

Configure Hive Reader by using the code editor

For more information about how to configure a sync node by using the code editor, see Create a sync node by using the code editor.
Note Delete the comments from the following code before you run the code:
Hive Reader can read data based on HDFS files or by using Hive JDBC.
  • HDFS files
    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "partition": "pt1=a,pt2=b,pt3=c",// The partitions in the Hive table.
                    "datasource": "hive_not_ha_****",// The name of the data source.
                    "column": [// The names of the columns from which you want to read data.
                        "id",
                        "pt2",
                        "pt1"
                    ],
                    "readMode": "hdfs",// The mode in which Hive Reader reads data.
                    "table": "part_table_1"
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""// The maximum number of dirty data records allowed.
            },
            "speed": {
                "concurrent": 2,// The maximum number of parallel threads.
                "throttle": true,// Specifies whether to enable bandwidth throttling. A value of false indicates that bandwidth throttling is disabled, and a value of true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only if the throttle parameter is set to true. 
                "mbps":"12"// The maximum transmission rate.
            }
        }
    }
  • Hive JDBC
    {
        "type": "job",
        "steps": [
            {
                "stepType": "hive",
                "parameter": {
                    "querySql": "select id,name,age from part_table_1 where pt2='B'",
                    "datasource": "hive_not_ha_****",// The name of the data source.
                    "column": [// The names of the columns from which you want to read data.
                        "id",
                        "name",
                        "age"
                    ],
                    "where": "",
                    "table": "part_table_1",
                    "readMode": "jdbc"// The mode in which Hive Reader reads data.
                },
                "name": "Reader",
                "category": "reader"
            },
            {
                "stepType": "hive",
                "parameter": {
                },
                "name": "Writer",
                "category": "writer"
            }
        ],
        "version": "2.0",
        "order": {
            "hops": [
                {
                    "from": "Reader",
                    "to": "Writer"
                }
            ]
        },
        "setting": {
            "errorLimit": {
                "record": ""
            },
            "speed": {
                "concurrent": 2,// The maximum number of parallel threads.
                "throttle": true,// Specifies whether to enable bandwidth throttling. A value of false indicates that bandwidth throttling is disabled, and a value of true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only if the throttle parameter is set to true. 
                "mbps":"12"// The maximum transmission rate.            
                
            }
        }
    }

Configure the resource group for Data Integration

  1. On the configuration tab of the sync node, click the Resource Group configuration tab in the right-side navigation pane.
  2. In the Resource Group configuration panel, select the desired exclusive resource group for Data Integration from the Exclusive Resource Groups drop-down list. Exclusive resource group for Data Integration
    Note By default, you can select an exclusive resource group for Data Integration for your sync node in the Resource Group configuration panel. To ensure the stability of and meet the performance requirements of data synchronization, we recommend that you use an exclusive resource group for Data Integration. Hive data sources do not support the shared resource group or custom resource groups for Data Integration.

Hive versions supported by Hive Reader

Hive Reader supports the following Hive versions:
0.8.0
0.8.1
0.9.0
0.10.0
0.11.0
0.12.0
0.13.0
0.13.1
0.14.0
1.0.0
1.0.1
1.1.0
1.1.1
1.2.0
1.2.1
1.2.2
2.0.0
2.0.1
2.1.0
2.1.1
2.2.0
2.3.0
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
3.0.0
3.1.0
3.1.1
3.1.2
0.8.1-cdh4.0.0
0.8.1-cdh4.0.1
0.9.0-cdh4.1.0
0.9.0-cdh4.1.1
0.9.0-cdh4.1.2
0.9.0-cdh4.1.3
0.9.0-cdh4.1.4
0.9.0-cdh4.1.5
0.10.0-cdh4.2.0
0.10.0-cdh4.2.1
0.10.0-cdh4.2.2
0.10.0-cdh4.3.0
0.10.0-cdh4.3.1
0.10.0-cdh4.3.2
0.10.0-cdh4.4.0
0.10.0-cdh4.5.0
0.10.0-cdh4.5.0.1
0.10.0-cdh4.5.0.2
0.10.0-cdh4.6.0
0.10.0-cdh4.7.0
0.10.0-cdh4.7.1
0.12.0-cdh5.0.0
0.12.0-cdh5.0.1
0.12.0-cdh5.0.2
0.12.0-cdh5.0.3
0.12.0-cdh5.0.4
0.12.0-cdh5.0.5
0.12.0-cdh5.0.6
0.12.0-cdh5.1.0
0.12.0-cdh5.1.2
0.12.0-cdh5.1.3
0.12.0-cdh5.1.4
0.12.0-cdh5.1.5
0.13.1-cdh5.2.0
0.13.1-cdh5.2.1
0.13.1-cdh5.2.2
0.13.1-cdh5.2.3
0.13.1-cdh5.2.4
0.13.1-cdh5.2.5
0.13.1-cdh5.2.6
0.13.1-cdh5.3.0
0.13.1-cdh5.3.1
0.13.1-cdh5.3.2
0.13.1-cdh5.3.3
0.13.1-cdh5.3.4
0.13.1-cdh5.3.5
0.13.1-cdh5.3.6
0.13.1-cdh5.3.8
0.13.1-cdh5.3.9
0.13.1-cdh5.3.10
1.1.0-cdh5.3.6
1.1.0-cdh5.4.0
1.1.0-cdh5.4.1
1.1.0-cdh5.4.2
1.1.0-cdh5.4.3
1.1.0-cdh5.4.4
1.1.0-cdh5.4.5
1.1.0-cdh5.4.7
1.1.0-cdh5.4.8
1.1.0-cdh5.4.9
1.1.0-cdh5.4.10
1.1.0-cdh5.4.11
1.1.0-cdh5.5.0
1.1.0-cdh5.5.1
1.1.0-cdh5.5.2
1.1.0-cdh5.5.4
1.1.0-cdh5.5.5
1.1.0-cdh5.5.6
1.1.0-cdh5.6.0
1.1.0-cdh5.6.1
1.1.0-cdh5.7.0
1.1.0-cdh5.7.1
1.1.0-cdh5.7.2
1.1.0-cdh5.7.3
1.1.0-cdh5.7.4
1.1.0-cdh5.7.5
1.1.0-cdh5.7.6
1.1.0-cdh5.8.0
1.1.0-cdh5.8.2
1.1.0-cdh5.8.3
1.1.0-cdh5.8.4
1.1.0-cdh5.8.5
1.1.0-cdh5.9.0
1.1.0-cdh5.9.1
1.1.0-cdh5.9.2
1.1.0-cdh5.9.3
1.1.0-cdh5.10.0
1.1.0-cdh5.10.1
1.1.0-cdh5.10.2
1.1.0-cdh5.11.0
1.1.0-cdh5.11.1
1.1.0-cdh5.11.2
1.1.0-cdh5.12.0
1.1.0-cdh5.12.1
1.1.0-cdh5.12.2
1.1.0-cdh5.13.0
1.1.0-cdh5.13.1
1.1.0-cdh5.13.2
1.1.0-cdh5.13.3
1.1.0-cdh5.14.0
1.1.0-cdh5.14.2
1.1.0-cdh5.14.4
1.1.0-cdh5.15.0
1.1.0-cdh5.16.0
1.1.0-cdh5.16.2
1.1.0-cdh5.16.99
2.1.1-cdh6.1.1
2.1.1-cdh6.2.0
2.1.1-cdh6.2.1
2.1.1-cdh6.3.0
2.1.1-cdh6.3.1
2.1.1-cdh6.3.2
2.1.1-cdh6.3.3