edit-icon download-icon

Configure HBase reader

Last Updated: Apr 12, 2018

The HBase Reader plug‑in provides the ability to read data from HBase. At the underlying implementation level, HBase Reader connects to remote HBase service with HBase’s Java client, reads data within the rowkey range you specified by means of Scan, then assembles data into an abstract dataset using custom data type for Data Integration, and passes the dataset to downstream Writer for processing.

Supported features

HBase0.94.x and HBase1.1.x versions are supported

  • If HBase version is HBase0.94.x, choose HBase094x as the Reader plug‑in, as shown in the following figure.

    1. "reader": {
    2. "plugin": "hbase094x"
    3. }
  • If HBase version is HBase1.1.x, choose HBase11x as the Reader plug‑in, as shown in the following figure.

    1. "reader": {
    2. "plugin": "hbase11x"
    3. }

normal and multiVersionFixedColumn modes are supported

  • normal mode: Read the latest version of data from an HBase table which is used as an ordinary two‑dimensional table (horizontal table). See the following.

    1. hbase(main):017:0> scan 'users'
    2. ROW COLUMN+CELL
    3. lisi column=address:city, timestamp=1457101972764, value=beijing
    4. lisi column=address:contry, timestamp=1457102773908, value=china
    5. lisi column=address:province, timestamp=1457101972736, value=beijing
    6. lisi column=info:age, timestamp=1457101972548, value=27
    7. lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17
    8. lisi column=info:company, timestamp=1457101972653, value=baidu
    9. xiaoming column=address:city, timestamp=1457082196082, value=hangzhou
    10. xiaoming column=address:contry, timestamp=1457082195729, value=china
    11. xiaoming column=address:province, timestamp=1457082195773, value=zhejiang
    12. xiaoming column=info:age, timestamp=1457082218735, value=29
    13. xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17
    14. xiaoming column=info:company, timestamp=1457082189826, value=alibaba
    15. 2 row(s) in 0.0580 seconds

    The data read from the table is shown as follows.

rowKey addres:city address:contry address:province info:age info:birthday info:company
lisi beijing china beijing 27 1987-06-17 baidu
xiaoming hangzhou china zhejiang 29 1987-06-17 alibaba
  • multiVersionFixedColumn mode: Read data from an HBase table which is used as an vertical table. Each record read from the table is shown in the form of the following four columns: rowKey, family:qualifier, timestamp, value. You must specify the column to be read when reading data. The value of each cell is a record. Multiple records are available if multiple versions of data exist, See the following.

    1. hbase(main):018:0> scan 'users',{VERSIONS=>5}
    2. ROW COLUMN+CELL
    3. lisi column=address:city, timestamp=1457101972764, value=beijing
    4. lisi column=address:contry, timestamp=1457102773908, value=china
    5. lisi column=address:province, timestamp=1457101972736, value=beijing
    6. lisi column=info:age, timestamp=1457101972548, value=27
    7. lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17
    8. lisi column=info:company, timestamp=1457101972653, value=baidu
    9. xiaoming column=address:city, timestamp=1457082196082, value=hangzhou
    10. xiaoming column=address:contry, timestamp=1457082195729, value=china
    11. xiaoming column=address:province, timestamp=1457082195773, value=zhejiang
    12. xiaoming column=info:age, timestamp=1457082218735, value=29
    13. xiaoming column=info:age, timestamp=1457082178630, value=24
    14. xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17
    15. xiaoming column=info:company, timestamp=1457082189826, value=alibaba
    16. 2 row(s) in 0.0260 seconds

    The data read from the table (in four columns)

rowKey column:qualifier timestamp value
lisi address:city 1457101972764 beijing
lisi address:contry 1457102773908 china
lisi address:province 1457101972736 beijing
lisi info:age 1457101972548 27
lisi info:birthday 1457101972604 1987-06-17
lisi info:company 1457101972653 beijing
xiaoming address:city 1457082196082 hangzhou
xiaoming address:contry 1457082195729 china
xiaoming address:province 1457082195773 zhejiang
xiaoming info:age 1457082218735 29
xiaoming info:age 1457082178630 24
xiaoming info:birthday 1457082186830 1987-06-17
xiaoming info:company 1457082189826 alibaba

Configure HBase client

A required configuration item in HBase Reader is hbaseConfig. You must contact HBase PE to extract the configuration items related to the connection to HBase from hbase-site.xml and specify these items in json format. In addition, you can add more HBase client configurations, for example, to configure cache (hbase.client.scanner.caching) and batch of scan to optimize the interaction with servers.

Note: Currently, the settings of HBase client are implemented in hbaseConfig configuration items. For example, hbase-site.xml is configured as follows.

  1. <configuration>
  2. <property>
  3. <name>hbase.rootdir</name>
  4. <value>hdfs://10.101.85.161:9000/hbase</value>
  5. </property>
  6. <property>
  7. <name>hbase.cluster.distributed</name>
  8. <value>true</value>
  9. </property>
  10. <property>
  11. <name>hbase.zookeeper.quorum</name>
  12. <value>v101085161.sqa.zmf</value>
  13. </property>
  14. </configuration>

The converted json value is shown as follows.

  1. "hbaseConfig": {
  2. "hbase.rootdir": "hdfs: //10.101.85.161:9000/hbase",
  3. "hbase.cluster.distributed": "true",
  4. "hbase.zookeeper.quorum": "v101085161.sqa.zmf"
  5. }

HBase Reader supports HBase data types and converts HBase data types as follows.

Internal data integration type HBase data type
Long int, short, long
Double float, double
String string, binarystring
Date date
Boolean boolean

Parameter description

  • haveKerberos

    • Description: If haveKerberos is true, the HBase cluster needs to be authenticated using kerberos. Note: If this value is configured as true, the following five parameters related to kerberos authentication must be configured: kerberosKeytabFilePath, kerberosPrincipal, hbaseMasterKerberosPrincipal, hbaseRegionserverKerberosPrincipal, and hbaseRpcProtection. If the HBase cluster is not authenticated using kerberos, these six parameters are not required.

    • Required: No

    • Default value: False

  • hbaseConfig

    • Description: The configuration information provided by each HBase cluster for the connection to the Data Integration client is stored in hbase-site.xml. Contact your HBase PE for the configuration information and convert it into JSON format. In addition, more HBase client configurations can be added, for example, to configure the cache and batch of scan to optimize the interaction with servers.

    • Required: Yes

    • Default value: None

  • mode

    • Description: Read mode of HBase. “normal” and “multiVersionFixedColumn” modes are supported, i.e. normal/multiVersionFixedColumn.

    • Required: Yes

    • Default value: None

  • table

    • Description: Name of HBase table to be read (case‑sensitive)

    • Required: Yes

    • Default value: None

  • encoding

    • Description: Encoding method (UTF-8 or GBK). This is used when HBase byte[] stored in binary form is converted into String.

    • Required: No

    • Default value: utf-8

  • column

    • Description: HBase field to be read. This is required in both normal and multiVersionFixedColumn modes.

      In normal mode: Except for rowkey, the HBase columns specified by “name” for reading must be in the format of column family:column name. “type” specifies the type of source data. “format” specifies the format of date; “value” specifies the current type as a constant. The system does not read data from HBase, but generate corresponding columns based on “value”. The configuration format is shown as follows.

      1. "column":
      2. [
      3. {
      4. "name": "rowkey",
      5. "type": "string"
      6. },
      7. {
      8. "value": "test",
      9. "type": "string"
      10. }
      11. ]

      In normal mode, for the specified Column information, you must enter type and choose one from name/value.

      In multiVersionFixedColumn mode: Except for rowkey, the HBase columns specified by “name” for reading must be in the format of column family:column name. “type” specifies the type of source data, and “format” is specifies the data format. Constant column is not supported in multiVersionFixedColumn mode. The configuration format is shown as follows.

      1. "column":
      2. [
      3. {
      4. "name": "rowkey",
      5. "type": "string"
      6. },
      7. {
      8. "name": "info: age",
      9. "type": "string"
      10. }
      11. ]
    • Required: Yes

    • Default value: None

  • maxVersion

    • Description: Specify the number of versions of data to be read by HBase Reader in multi‑version mode, which can only be -1 (to read all versions) or a number larger than 1.

    • Required: This is required in multiVersionFixedColumn mode.

    • Default value: None

  • range

    • Description: Specify the range of rowkey from which HBase Reader reads data.

      • startRowkey: Specify start rowkey.
      • endRowkey: Specify end rowkey.
      • isBinaryRowkey: Specify the method for converting configured startRowkey and endRowkey to byte[]. Default is false. If it is true, Bytes.toBytesBinary(rowkey) is called for conversion. If it is false, Bytes.toBytes(rowkey) is called.

        The configuration format is shown as follows.

        1. "range": {
        2. "startRowkey": "aaa",
        3. "endRowkey": "ccc",
        4. "isBinaryRowkey":false
        5. }
    • Required: No

    • Default value: None

  • scanCacheSize

    • Description: Number of lines read by HBase client from server every time when rpc is performed.

    • Required: No

    • Default value: 256

  • scanBatchSize

    • Description: Number of columns read by HBase client from server every time when rpc is performed.

    • Required: No

    • Default value: 100

Development in wizard mode

Development in wizard mode is not supported currently.

Development in script mode

Configure a job to extract data from HBase to local machine (normal mode).

  1. {
  2. "type": "job",
  3. "traceId": "your traceId",
  4. "version": "1.0",
  5. "configuration": {
  6. "setting": {
  7. "errorLimit": {
  8. "record": "0"
  9. },
  10. "speed": {
  11. "mbps": "1"
  12. }
  13. },
  14. "transformer": [],
  15. "reader": {
  16. "plugin": "hbase094x",
  17. "parameter": {
  18. "haveKerberos": true,
  19. "kerberosKeytabFilePath": "/opt/datax/xxx.keytab",
  20. "kerberosPrincipal": "xxx/hadoopclient@xxx.xxx",
  21. "hbaseMasterKerberosPrincipal": "xxx",
  22. "hbaseRegionserverKerberosPrincipal": "xxx",
  23. "hbaseRpcProtection": "privacy",
  24. "hbaseConfig": {
  25. "hbase.rootdir": "hdfs: //10.101.85.161:9000/hbase",
  26. "hbase.cluster.distributed": "true",
  27. "hbase.zookeeper.quorum": "v101085161.sqa.zmf"
  28. },
  29. "table": "users",
  30. "encoding": "utf-8",
  31. "mode": "normal",
  32. "column": [
  33. {
  34. "name": "rowkey",
  35. "type": "string"
  36. },
  37. {
  38. "name": "info: age",
  39. "type": "string"
  40. },
  41. {
  42. "name": "info: birthday",
  43. "type": "date",
  44. "format": "yyyy-MM-dd"
  45. },
  46. {
  47. "name": "info: company",
  48. "type": "string"
  49. },
  50. {
  51. "name": "address: contry",
  52. "type": "string"
  53. },
  54. {
  55. "name": "address: province",
  56. "type": "string"
  57. },
  58. {
  59. "name": "address: city",
  60. "type": "string"
  61. }
  62. ],
  63. "range": {
  64. "startRowkey": "",
  65. "endRowkey": "",
  66. "isBinaryRowkey": true
  67. }
  68. }
  69. },
  70. "writer": {}
  71. }
  72. }

Configure a job to extract data from HBase to local machine (multiVersionFixedColumn mode).

  1. {
  2. "type": "job",
  3. "traceId": "your traceId",
  4. "version": "1.0",
  5. "configuration": {
  6. "setting": {
  7. "errorLimit": {
  8. "record": "0"
  9. },
  10. "speed": {
  11. "mbps": "1"
  12. }
  13. },
  14. "transformer": [],
  15. "reader": {
  16. "plugin": "hbase094x",
  17. "parameter": {
  18. "haveKerberos": true,
  19. "kerberosKeytabFilePath": "/opt/datax/xxx.keytab",
  20. "kerberosPrincipal": "xxx/hadoopclient@xxx.xxx",
  21. "hbaseMasterKerberosPrincipal": "xxx",
  22. "hbaseRegionserverKerberosPrincipal": "xxx",
  23. "hbaseRpcProtection": "xxx",
  24. "hbaseConfig": {
  25. "hbase.rootdir": "hdfs: //10.101.85.161:9000/hbase",
  26. "hbase.cluster.distributed": "true",
  27. "hbase.zookeeper.quorum": "v101085161.sqa.zmf"
  28. },
  29. "table": "users",
  30. "encoding": "utf-8",
  31. "mode": "multiVersionFixedColumn",
  32. "maxVersion": "-1",
  33. "column": [
  34. {
  35. "name": "rowkey",
  36. "type": "string"
  37. },
  38. {
  39. "name": "info: age",
  40. "type": "string"
  41. },
  42. {
  43. "name": "info: birthday",
  44. "type": "date",
  45. "format": "yyyy-MM-dd"
  46. },
  47. {
  48. "name": "info: company",
  49. "type": "string"
  50. },
  51. {
  52. "name": "address: contry",
  53. "type": "string"
  54. },
  55. {
  56. "name": "address: province",
  57. "type": "string"
  58. },
  59. {
  60. "name": "address: city",
  61. "type": "string"
  62. }
  63. ],
  64. "range": {
  65. "startRowkey": "",
  66. "endRowkey": ""
  67. }
  68. }
  69. },
  70. "writer": {}
  71. }
  72. }
Thank you! We've received your feedback.