DataWorks provides HBase Reader and HBase Writer for you to read data from and write data to HBase data sources. This topic describes the capabilities of synchronizing data from or to HBase data sources.
Supported versions
HBase 0.94.x, HBase 1.1.x, HBase 2.x, and Phoenix 5.x
If you use HBase 0.94.x, set the hbaseVersion parameter to 094x for HBase Reader and HBase Writer.
"reader": { "plugin": "094x" }
"writer": { "hbaseVersion": "094x" }
If you use HBase 1.1.x or HBase 2.x, set the hbaseVersion parameter to 11x for HBase Reader and HBase Writer.
"reader": { "plugin": "11x" }
"writer": { "hbaseVersion": "11x" }
NoteHBase 1.1.x Reader and HBase 1.1.x Writer are compatible with HBase 2.0.
HBase11xsql Writer writes multiple data records to an HBase table at a time that is created based on Phoenix. Phoenix can encode the primary key to the rowkey. If you use an HBase API to write data to an HBase table that is created based on Phoenix, you must manually convert data, which is time-consuming and error-prone. HBase11xsql Writer allows you to write data to an HBase table that packs all values into a single cell per column family.
NoteHBase11xsql Writer connects to an HBase table by using the Phoenix Java Database Connectivity (JDBC) driver, and executes an UPSERT statement to write multiple data records to the table at a time. Phoenix can synchronously update indexed tables when HBase11xsql Writer writes data to an HBase table.
Limits
HBase Reader | HBase20xsql Reader | HBase11xsql Writer |
|
|
|
Features
HBase Reader
HBase Reader supports normal and multiVersionFixedColumn modes.
In normal mode, HBase Reader reads the latest version of data from an HBase table and converts data in the HBase table into data in a standard two-dimensional table (wide table).
hbase(main):017:0> scan 'users' ROW COLUMN+CELL lisi column=address:city, timestamp=1457101972764, value=beijing lisi column=address:contry, timestamp=1457102773908, value=china lisi column=address:province, timestamp=1457101972736, value=beijing lisi column=info:age, timestamp=1457101972548, value=27 lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 lisi column=info:company, timestamp=1457101972653, value=baidu xiaoming column=address:city, timestamp=1457082196082, value=hangzhou xiaoming column=address:contry, timestamp=1457082195729, value=china xiaoming column=address:province, timestamp=1457082195773, value=zhejiang xiaoming column=info:age, timestamp=1457082218735, value=29 xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 xiaoming column=info:company, timestamp=1457082189826, value=alibaba 2 row(s) in 0.0580 seconds }
The following table describes the data reading result.
rowKey
address:city
address:contry
address:province
info:age
info:birthday
info:company
lisi
beijing
china
beijing
27
1987-06-17
baidu
xiaoming
hangzhou
china
zhejiang
29
1987-06-17
alibaba
In multiVersionFixedColumn mode, HBase Reader reads data from an HBase table and converts data in the HBase table into data in a narrow table. The narrow table contains four columns rowKey, family:qualifier, timestamp, and value. Before you use HBase Reader to read data from an HBase table, you must specify the columns from which you want to read data. HBase Reader converts the value in each cell into a data record for each version of the HBase table.
hbase(main):018:0> scan 'users',{VERSIONS=>5} ROW COLUMN+CELL lisi column=address:city, timestamp=1457101972764, value=beijing lisi column=address:contry, timestamp=1457102773908, value=china lisi column=address:province, timestamp=1457101972736, value=beijing lisi column=info:age, timestamp=1457101972548, value=27 lisi column=info:birthday, timestamp=1457101972604, value=1987-06-17 lisi column=info:company, timestamp=1457101972653, value=baidu xiaoming column=address:city, timestamp=1457082196082, value=hangzhou xiaoming column=address:contry, timestamp=1457082195729, value=china xiaoming column=address:province, timestamp=1457082195773, value=zhejiang xiaoming column=info:age, timestamp=1457082218735, value=29 xiaoming column=info:age, timestamp=1457082178630, value=24 xiaoming column=info:birthday, timestamp=1457082186830, value=1987-06-17 xiaoming column=info:company, timestamp=1457082189826, value=alibaba 2 row(s) in 0.0260 seconds }
The following table describes the data reading result. Four columns are included.
rowKey
column:qualifier
timestamp
value
lisi
address:city
1457101972764
beijing
lisi
address:contry
1457102773908
china
lisi
address:province
1457101972736
beijing
lisi
info:age
1457101972548
27
lisi
info:birthday
1457101972604
1987-06-17
lisi
info:company
1457101972653
beijing
xiaoming
address:city
1457082196082
hangzhou
xiaoming
address:contry
1457082195729
china
xiaoming
address:province
1457082195773
zhejiang
xiaoming
info:age
1457082218735
29
xiaoming
info:age
1457082178630
24
xiaoming
info:birthday
1457082186830
1987-06-17
xiaoming
info:company
1457082189826
alibaba
HBase Writer
Multiple fields of a source table can be concatenated as a rowkey.
HBase Writer can concatenate multiple fields of a source table to generate the rowkey of an HBase table.
You can specify the version of each HBase cell.
Information that can be used as the version of an HBase cell:
Current time
Specific source column
Specific time
Data type mappings
Batch data read
The following table lists the data type mappings based on which HBase Reader converts data types.
Category | Data type supported by Data Integration | Data type supported by your database |
Integer | long | SHORT, INT, and LONG |
Floating point | double | FLOAT and DOUBLE |
String | string | BINARY_STRING and STRING |
Date and time | date | date |
Byte | bytes | bytes |
Boolean | boolean | boolean |
HBase20xsql Reader supports most Phoenix data types. Make sure that the data types of your database are supported.
The following table lists the data type mappings based on which HBase20xsql Reader converts data types.
Data Integration data type | Phoenix data type |
long | INTEGER, TINYINT, SMALLINT, and BIGINT |
double | FLOAT, DECIMAL, and DOUBLE |
string | CHAR and VARCHAR |
date | DATE, TIME, and TIMESTAMP |
bytes | BINARY and VARBINARY |
boolean | BOOLEAN |
Batch data write
The following table lists the data type mappings based on which HBase Writer converts data types.
The data types of specified columns must be the same as those in an HBase table.
Data types that are not listed in the following table are not supported.
Category | Data type supported by your database |
Integer | INT, LONG, and SHORT |
Floating point | FLOAT and DOUBLE |
Boolean | BOOLEAN |
String | STRING |
Precautions
If the "tried to access method com.google.common.base.Stopwatch" error message is displayed when you perform a connectivity test, you can add the "hbaseVersion": "" field for the Configuration information parameter in the HBase data source configuration dialog box. This field is used to specify the HBase version. For example, you can add "hbaseVersion": "2.0.14".
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
If you configure a batch synchronization node by using the codeless UI, when you add an HBase data source to the node, the field mappings are not displayed by default. It is because HBase is a data source without a fixed structure. Therefore, you must manually configure the field mappings.
If you add an HBase data source as a source, you must specify the source fields in the
Field type|Column family:Column name
format.If you add an HBase data source as a destination, you must specify the destination fields in the
Index sequence number of a source field|Field type|Column family:Column name
format and rowkey in theIndex sequence number of the primary key from the source|Field type
format.
NoteEach field occupies a separate line.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
FAQ
Q: What is the appropriate number of parallel threads? Can I increase the number of parallel threads to speed up the data synchronization?
A: The recommended number of parallel threads is 5 to 10. In the data import process, the default size of a Java virtual machine (JVM) heap is 2 GB. Parallel synchronization requires multiple threads. However, if excessive threads are run at the same time, data synchronization cannot speed up and the job performance may deteriorate due to frequent garbage collection (GC). We recommend that you set the number of parallel threads in the range of 5 to 10.
Q: What is the appropriate value for the batchSize parameter?
A: The default value of the batchSize parameter is 256. You can set the batchSize parameter based on the amount of data in each row. In most cases, each write operation writes 2 MB to 4 MB of data. You can set this parameter to the result of the data volume of a write operation divided by the data volume of a row.
Appendix: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.