HBase Reader allows you to read data from HBase.

HBase Reader connects to a remote HBase database through a Java client. Then, it uses the scan method to read data based on the specified rowkey range, converts the data to a dataset based on Data Integration data types, and sends the dataset to a writer.

Supported features

  • Supports HBase 0.94.x, HBase 1.1.x, and HBase 2.x.
    • If you use HBase 0.94.x, set the plugin parameter to 094x.
      "reader": {
              "plugin": "094x"
          }
    • If you use HBase 1.1.x or HBase2.x, set the plugin parameter to 11x.
      "reader": {
              "plugin": "11x"
          }
      Note Currently, HBase 1.1.x Reader is compatible with HBase 2.0 Reader. If you have any issues in using HBase Reader, submit a ticket.
  • Supports the normal and multiVersionFixedColumn modes.
    • In normal mode, HBase Reader reads only the latest version of data from an HBase table and converts it to a wide table, that is, a binary table.
      hbase(main):017:0> scan 'users'
      ROW                                   COLUMN+CELL
      lisi                                 column=address:city, timestamp=1457101972764, value=beijing
      lisi                                 column=address:contry, timestamp=1457102773908, value=china
      lisi                                 column=address:province, timestamp=1457101972736, value=beijing
      lisi                                 column=info:age, timestamp=1457101972548, value=27
      lisi                                 column=info:birthday, timestamp=1457101972604, value=1987-06-17
      lisi                                 column=info:company, timestamp=1457101972653, value=baidu
      xiaoming                             column=address:city, timestamp=1457082196082, value=hangzhou
      xiaoming                             column=address:contry, timestamp=1457082195729, value=china
      xiaoming                             column=address:province, timestamp=1457082195773, value=zhejiang
      xiaoming                             column=info:age, timestamp=1457082218735, value=29
      xiaoming                             column=info:birthday, timestamp=1457082186830, value=1987-06-17
      xiaoming                             column=info:company, timestamp=1457082189826, value=alibaba
      2 row(s) in 0.0580 seconds }
      HBase Reader converts the data read from HBase to the following table.
      rowKey address:city address:contry address:province info:age info:birthday info:company
      lisi beijing china beijing 27 1987-06-17 baidu
      xiaoming hangzhou china zhejiang 29 1987-06-17 alibaba
    • In multiVersionFixedColumn mode, HBase Reader reads data from an HBase table and converts it to a narrow table. Each data record consists of the four columns: rowKey, family:qualifier, timestamp, and value. You need to specify the columns from which HBase Reader reads data, and HBase Reader converts each version of a table cell to a data record.
      hbase(main):018:0> scan 'users',{VERSIONS=>5}
      ROW                                   COLUMN+CELL
      lisi                                 column=address:city, timestamp=1457101972764, value=beijing
      lisi                                 column=address:contry, timestamp=1457102773908, value=china
      lisi                                 column=address:province, timestamp=1457101972736, value=beijing
      lisi                                 column=info:age, timestamp=1457101972548, value=27
      lisi                                 column=info:birthday, timestamp=1457101972604, value=1987-06-17
      lisi                                 column=info:company, timestamp=1457101972653, value=baidu
      xiaoming                             column=address:city, timestamp=1457082196082, value=hangzhou
      xiaoming                             column=address:contry, timestamp=1457082195729, value=china
      xiaoming                             column=address:province, timestamp=1457082195773, value=zhejiang
      xiaoming                             column=info:age, timestamp=1457082218735, value=29
      xiaoming                             column=info:age, timestamp=1457082178630, value=24
      xiaoming                             column=info:birthday, timestamp=1457082186830, value=1987-06-17
      xiaoming                             column=info:company, timestamp=1457082189826, value=alibaba
      2 row(s) in 0.0260 seconds }
      HBase Reader converts the data read from HBase to the following table.
      rowKey column:qualifier timestamp value
      lisi address:city 1457101972764 beijing
      lisi address:contry 1457102773908 china
      lisi address:province 1457101972736 beijing
      lisi info:age 1457101972548 27
      lisi info:birthday 1457101972604 1987-06-17
      lisi info:company 1457101972653 beijing
      xiaoming address:city 1457082196082 hangzhou
      xiaoming address:contry 1457082195729 china
      xiaoming address:province 1457082195773 zhejiang
      xiaoming info:age 1457082218735 29
      xiaoming info:age 1457082178630 24
      xiaoming info:birthday 1457082186830 1987-06-17
      xiaoming info:company 1457082189826 alibaba

Data types

The following table lists the data types supported by HBase Reader.
Category Data Integration data type HBase data type
Integer LONG SHORT, INT, and LONG
Floating point DOUBLE FLOAT and DOUBLE
String STRING BINARY_STRING and STRING
Date and time DATE DATE
Byte BYTES BYTES
Boolean BOOLEAN BOOLEAN

Parameters

Parameter Description Required Default value
haveKerberos Specifies whether Kerberos authentication is required. A value of true indicates that Kerberos authentication is required.
Note
  • If the value is true, the following five Kerberos-related parameters must be specified:
    • kerberosKeytabFilePath
    • kerberosPrincipal
    • hbaseMasterKerberosPrincipal
    • hbaseRegionserverKerberosPrincipal
    • hbaseRpcProtection
  • If the value is false, Kerberos authentication is not required and you do not need to specify the preceding parameters.
No false
hbaseConfig The properties of the HBase cluster, in JSON format. The hbase.zookeeper.quorum parameter is required. It specifies the ZooKeeper ensemble servers. You can also configure other properties, such as those related to the cache and batch for scan operations.
Note You must use the internal endpoint to access an ApsaraDB for HBase database.
Yes None
mode The mode in which data is read from HBase. Valid values: normal and multiVersionFixedColumn. Yes None
table The name of the HBase table from which data is read. The name is case-sensitive. Yes None
encoding The encoding format, by using which binary data stored in byte[] format is converted to strings. Valid values: UTF-8 and GBK. No UTF-8
column The HBase columns from which data is read. This parameter is required in both normal and multiVersionFixedColumn modes.
  • In normal mode:
    The name parameter specifies the name of the column in the HBase table. The format must be columnFamily:columnName except for the rowkey. The type parameter specifies the source data type. The format parameter specifies the date format. The value parameter specifies the column value if the column is a constant column. An example is provided as follows:
    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "value": "test",
      "type": "string"
    }
    ]

    For the column parameter, you must specify the type parameter and specify one of the name and value parameters.

  • In multiVersionFixedColumn mode:
    The name parameter specifies the name of the column in the HBase table. The format must be columnFamily:columnName except for the rowkey. The type parameter specifies the source data type. The format parameter specifies the date format. You cannot create constant columns in multiVersionFixedColumn mode. An example is provided as follows:
    "column": 
    [
    {
      "name": "rowkey",
      "type": "string"
    },
    {
      "name": "info:age",
      "type": "string"
    }
    ]
Yes None
maxVersion The number of versions read by HBase Reader when multiple versions are available. Valid values: -1 and integers greater than 1. A value of -1 indicates that all versions are read. This parameter is required in multiVersionFixedColumn mode. None
range The rowkey range that HBase Reader reads.
  • startRowkey: the start rowkey.
  • endRowkey: the end rowkey.
  • isBinaryRowkey: the operation called by byte[] to convert the specified start and end rowkeys. Default value: false. If the value is true, Bytes.toBytesBinary(rowkey) is called. If the value is false, Bytes.toBytes(rowkey) is called. An example is provided as follows:
    "range": {
    "startRowkey": "aaa",
    "endRowkey": "ccc",
    "isBinaryRowkey":false
    }
No None
scanCacheSize The number of rows read by an HBase client with each remote procedure call (RPC) connection. No 256
scanBatchSize The number of columns read by an HBase client with each RPC connection. No 100

Configure HBase Reader by using the codeless UI

Currently, the codeless user interface (UI) is not supported for HBase Reader.

Configure HBase Reader by using the code editor

In the following code, a node is configured to read data from HBase in normal mode.

{
    "type":"job",
    "version":"2.0",// The version number.
    "steps":[
        {
            "stepType":"hbase",// The reader type.
            "parameter":{
                "mode": "normal",// The mode in which data is read from the HBase database. Valid values: normal and multiVersionFixedColumn.
                "scanCacheSize": 256,// The number of rows read by an HBase client with each RPC connection.
                "scanBatchSize": 100 ",// The number of columns read by an HBase client with each RPC connection. 
                "hbaseVersion":"094x/11x",// The HBase version.
                "column":[// The columns to be synchronized.
                    {
                        "name":"rowkey",// The rowkey name.
                        "type":"string"// The data type.
                    },
                    {
                        "name":"columnFamilyName1:columnName1",
                        "type":"string"
                    },
                    {
                        "name":"columnFamilyName2:columnName2",
                        "format":"yyyy-MM-dd",
                        "type":"date"
                    },
                    {
                        "name":"columnFamilyName3:columnName3",
                        "type":"long"
                    }
                ],
                "range":{// The rowkey range that HBase Reader reads.
                    "endRowkey":"",// The end rowkey.
                    "isBinaryRowKey":true// The method used to convert the specified start and end rowkeys to the byte[] format. Default value: false.
                    "startRowkey":"// The start rowkey.
                },
                "maxVersion":"",// The number of versions read by HBase Reader when multiple versions are available.
                "encoding":"UTF-8",// The encoding format.
                "table":"",// The name of the table to be synchronized.
                "hbaseConfig":{// The properties of the HBase cluster, in JSON format.
                    "hbase.zookeeper.quorum":"hostname",
                    "hbase.rootdir":"hdfs://ip:port/database",
                    "hbase.cluster.distributed":"true"
                }
            },
            "name":"Reader",
            "category":"reader"
        },
        {// The following template is used to configure Stream Writer. For more information, see the corresponding topic.
            "stepType":"stream",
            "parameter":{},
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1,// The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}