HBase Reader reads data from HBase. This topic describes the data types and parameters that are supported by HBase Reader and how to configure HBase Reader by using the codeless user interface (UI) and code editor.

HBase Reader connects to a remote HBase database by using a Java client of HBase, scans and reads data based on a specific rowkey range, assembles the data into abstract datasets of the data types supported by Data Integration, and then sends the datasets to a writer.

Limits

Supported features

  • HBase Reader can read data from HBase 0.94.X, HBase 1.1.X, and HBase 2.X.
    • If you use HBase 0.94.X, set the plugin parameter to 094x.
      "reader": {
              "plugin": "094x"
          }
    • If you use HBase 1.1.X or HBase 2.X, set the plugin parameter to 11x.
      "reader": {
              "plugin": "11x"
          }
      Note HBase 1.1.X Reader is compatible with HBase 2.0. If you have questions when you use HBase Reader, submit a ticket.
  • HBase Reader supports normal and multiVersionFixedColumn modes.
    • In normal mode, HBase Reader reads only the latest version of data from an HBase table and converts the data to a two-dimensional table (wide table).
      hbase(main):017:0> scan 'users'
      ROW                                   COLUMN+CELL
      lisi                                 column=address:city, timestamp=1457101972764, value=beijing
      lisi                                 column=address:country, timestamp=1457102773908, value=china
      lisi                                 column=address:province, timestamp=1457101972736, value=beijing
      lisi                                 column=info:age, timestamp=1457101972548, value=27
      lisi                                 column=info:birthday, timestamp=1457101972604, value=1987-06-17
      lisi                                 column=info:company, timestamp=1457101972653, value=baidu
      xiaoming                             column=address:city, timestamp=1457082196082, value=hangzhou
      xiaoming                             column=address:country, timestamp=1457082195729, value=china
      xiaoming                             column=address:province, timestamp=1457082195773, value=zhejiang
      xiaoming                             column=info:age, timestamp=1457082218735, value=29
      xiaoming                             column=info:birthday, timestamp=1457082186830, value=1987-06-17
      xiaoming                             column=info:company, timestamp=1457082189826, value=alibaba
      2 row(s) in 0.0580 seconds }
      HBase Reader converts the data that is read from the HBase table to the following table.
      rowKey address:city address:country address:province info:age info:birthday info:company
      lisi beijing china beijing 27 1987-06-17 baidu
      xiaoming hangzhou china zhejiang 29 1987-06-17 alibaba
    • In multiVersionFixedColumn mode, HBase Reader reads data from an HBase table and converts the data to a narrow table. The narrow table contains four columns rowKey, family:qualifier, timestamp, and value. Before you use HBase Reader to read data, you must specify the columns from which you want to read data. When HBase Reader reads data, it converts each cell in each version of the table to a data record.
      hbase(main):018:0> scan 'users',{VERSIONS=>5}
      ROW                                   COLUMN+CELL
      lisi                                 column=address:city, timestamp=1457101972764, value=beijing
      lisi                                 column=address:country, timestamp=1457102773908, value=china
      lisi                                 column=address:province, timestamp=1457101972736, value=beijing
      lisi                                 column=info:age, timestamp=1457101972548, value=27
      lisi                                 column=info:birthday, timestamp=1457101972604, value=1987-06-17
      lisi                                 column=info:company, timestamp=1457101972653, value=baidu
      xiaoming                             column=address:city, timestamp=1457082196082, value=hangzhou
      xiaoming                             column=address:country, timestamp=1457082195729, value=china
      xiaoming                             column=address:province, timestamp=1457082195773, value=zhejiang
      xiaoming                             column=info:age, timestamp=1457082218735, value=29
      xiaoming                             column=info:age, timestamp=1457082178630, value=24
      xiaoming                             column=info:birthday, timestamp=1457082186830, value=1987-06-17
      xiaoming                             column=info:company, timestamp=1457082189826, value=alibaba
      2 row(s) in 0.0260 seconds }
      HBase Reader converts the data that is read from the HBase table to the following table.
      rowKey column:qualifier timestamp value
      lisi address:city 1457101972764 beijing
      lisi address:country 1457102773908 china
      lisi address:province 1457101972736 beijing
      lisi info:age 1457101972548 27
      lisi info:birthday 1457101972604 1987-06-17
      lisi info:company 1457101972653 beijing
      xiaoming address:city 1457082196082 hangzhou
      xiaoming address:country 1457082195729 china
      xiaoming address:province 1457082195773 zhejiang
      xiaoming info:age 1457082218735 29
      xiaoming info:age 1457082178630 24
      xiaoming info:birthday 1457082186830 1987-06-17
      xiaoming info:company 1457082189826 alibaba

Data types

The following table lists the data types that are supported by HBase Reader.
Category Data Integration data type HBase data type
Integer LONG SHORT, INT, and LONG
Floating point DOUBLE FLOAT and DOUBLE
String STRING BINARY_STRING and STRING
Date and time DATE DATE
Byte BYTES BYTES
Boolean BOOLEAN BOOLEAN

Parameters

Parameter Description Required Default value
haveKerberos Specifies whether Kerberos authentication is required. Valid values: true and false.
Note
  • If you set this parameter to true, Kerberos authentication is required, and you must configure the following parameters that are related to Kerberos authentication:
    • kerberosKeytabFilePath
    • kerberosPrincipal
    • hbaseMasterKerberosPrincipal
    • hbaseRegionserverKerberosPrincipal
    • hbaseRpcProtection
  • If you set this parameter to false, Kerberos authentication is not required, and you do not need to configure the preceding parameters.
No false
hbaseConfig The properties of the HBase cluster, in the JSON format. The hbase.zookeeper.quorum parameter is required. It specifies the ZooKeeper address of the HBase cluster. You can also configure other properties, such as those related to the cache and batch for scan operations.
Note You must use an internal endpoint to access an ApsaraDB for HBase database.
Yes No default value
mode The mode in which HBase Reader reads data from HBase. Valid values: normal and multiVersionFixedColumn. Yes No default value
table The name of the HBase table from which you want to read data. The name is case-sensitive. Yes No default value
encoding The encoding format that is used to convert binary data in the HBase byte[] format to strings. Valid values: utf-8 and gbk. No utf-8
column The names of the columns from which you want to read data.
  • In normal mode:
    The name parameter specifies the name of the column from which you want to read data. Specify the column in the columnFamily:columnName format, except for the rowkey column. The type parameter specifies the source data type. The format parameter specifies the date format. The value parameter specifies the column value if the column is a constant column. When HBase Reader reads data, it does not read data from the constant column, but uses the settings of the value parameter. The following code provides an example:
    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "value": "test",
      "type": "string"
    }
    ]

    In the column parameter, you must specify the type parameter and specify either the name or value parameter.

  • In multiVersionFixedColumn mode:
    The name parameter specifies the name of the column from which you want to read data. Specify the column in the columnFamily:columnName format, except for the rowkey column. The type parameter specifies the source data type. The format parameter specifies the date format. Constant columns are not supported in multiVersionFixedColumn mode. The following code provides an example:
    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "name": "info:age",
      "type": "string"
    }
    ]
Yes No default value
maxVersion The number of versions that are read by HBase Reader when multiple versions are available. Valid values: -1 and integers greater than 1. The value -1 indicates that all versions are read. Required in multiVersionFixedColumn mode No default value
range The rowkey range based on which HBase Reader reads data.
  • startRowkey: the start rowkey.
  • endRowkey: the end rowkey.
  • isBinaryRowkey: the method that is used to convert the specified start and end rowkeys to the byte[] format. Default value: false. If you set this parameter to true, the Bytes.toBytesBinary(rowkey) method is used. If you set this parameter to false, the Bytes.toBytes(rowkey) method is used. The following code provides an example:
    "range": {
    "startRowkey": "aaa",
    "endRowkey": "ccc",
    "isBinaryRowkey":false
    }
No No default value
scanCacheSize The number of rows that HBase Reader reads from the HBase table each time. No 256
scanBatchSize The number of columns that HBase Reader reads from the HBase table each time. No 100

Configure HBase Reader by using the codeless UI

This method is not supported.

Configure HBase Reader by using the code editor

In the following code, a synchronization node is configured to read data from HBase in normal mode. For more information about how to configure a synchronization node by using the code editor, see Create a synchronization node by using the code editor.
{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"hbase",// The reader type. 
            "parameter":{
                "mode":"normal",// The mode in which HBase Reader reads data. Valid values: normal and multiVersionFixedColumn. 
                "scanCacheSize":"256",// The number of rows that HBase Reader reads from the HBase table each time. 
                "scanBatchSize":"100",// The number of columns that HBase Reader reads from the HBase table each time.  
                "hbaseVersion":"094x/11x",// The HBase version. 
                "column":[// The columns from which you want to read data. 
                    {
                        "name":"rowkey",// The name of a column. 
                        "type":"string"// The data type. 
                    },
                    {
                        "name":"columnFamilyName1:columnName1",
                        "type":"string"
                    },
                    {
                        "name":"columnFamilyName2:columnName2",
                        "format":"yyyy-MM-dd",
                        "type":"date"
                    },
                    {
                        "name":"columnFamilyName3:columnName3",
                        "type":"long"
                    }
                ],
                "range":{// The rowkey range based on which HBase Reader reads data. 
                    "endRowkey":"",// The end rowkey. 
                    "isBinaryRowkey":true,// The method that is used to convert the specified start and end rowkeys to the byte[] format. Default value: false. 
                    "startRowkey":""// The start rowkey. 
                },
                "maxVersion":"",// The number of versions that are read by HBase Reader when multiple versions are available. 
                "encoding":"UTF-8",// The encoding format. 
                "table":"",// The name of the table from which you want to read data. 
                "hbaseConfig":{// The properties of the HBase cluster, in the JSON format. 
                    "hbase.zookeeper.quorum":"hostname",
                    "hbase.rootdir":"hdfs://ip:port/database",
                    "hbase.cluster.distributed":"true"
                }
            },
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1,// The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}