edit-icon download-icon

Configure HBase11xsql Writer

Last Updated: Mar 21, 2018

HBase11xsql Writer enables you to import data to SQL tables (Phoenix tables) in Hbase in batch. Given that Phoenix applies data coding to the rowkey, writing data directly with the HBaseAPI introduces manual data conversion, which is complicated and prone to errors. This plug-in supports importing SQL tables with single keys.

For the underlying implementation, the JDBC driver of Phoenix runs UPSERT statements to write data to Hbase.

Supported functions

Support importing tables with indexes and update all index tables synchronously.

Restrictions

  • Supports only 1.x series of Hbase.

  • Supports only the tables created with Phoenix but not native HBase tables.

  • Does not support importing data with timestamps.

How it works

The JDBC driver of Phoenix runs UPSERT statements to write data to tables in batch. Also, index tables can be updated synchronously because upper-layer interfaces are used.

Parameter description

Parameter Description Required Default value
plugin It refers to the name of the plug-in, which must be hbase11xsql Yes None
table It refers to the name of the table to be imported, which is case-sensitive. Typically, the names of Phoenix tables are in upper case Yes None
column It refers to the column name, which is case-sensitive. Typically, Phoenix column names are in upper case.
Also, the order of columns must be consistent with that of the columns output by Reader.
You do not have to enter data types but the metadata of columns can be automatically retrieved from Phoenix
Yes None
hbaseConfig It refers to the address of the Hbase cluster where zk is required and the format is : ip1,ip2,ip3.
Note: Multiple IP addresses are separated with commas.
, znode is optional and defaults to /hbase
Yes None
batchSize It refers to the maximum number of rows of batch writing No 256
nullMode It refers to the processing method when the value of the read column is null. Currently, two methods are available:
- skip: Skips this column, namely, does not insert this column (if this column of the row already exists, it is deleted.)
- empty: Inserts the null value, which is 0 for the value type and the empty string for varchar
No skip

Development in script mode

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "setting": {
  6. "errorLimit": {
  7. "record": "0"
  8. },
  9. "speed": {
  10. "mbps": "1",
  11. "concurrent": "1"
  12. }
  13. },
  14. "reader": {
  15. "plugin": "odps",
  16. "parameter": {
  17. "datasource": "",
  18. "table": "",
  19. "column": [],
  20. "partition": ""
  21. }
  22. },
  23. "plugin": "hbase11xsql",
  24. "parameter": {
  25. "table": "Name of the target Hbase table, which is case-sensitive",
  26. "hbaseConfig": {
  27. "hbase.zookeeper.quorum": "ZK server address of the target Hbase cluster, which can be consulted with PE",
  28. "zookeeper.znode.parent": "znode of the target Hbase cluster, which can be consulted with PE",
  29. },
  30. "column": [
  31. "columnName"
  32. ],
  33. "batchSize": 256,
  34. "nullMode": "skip"
  35. }
  36. }
  37. }

Limits

The order of columns in Writer must be consistent with that of columns in Reader. The column order in Reader defines the arrangement of columns in each output row while that in Writer defines the column order of the received data expected by Writer. For example:

The column order in Reader is: c1, c2, c3, c4.

The column order in Writer is: x1, x2, x3, x4.

Then, column c1 output by Reader is assigned to column x1 in Writer. If the column order in Writer is x1, x2, x4, x3, then column c3 is assigned to column x4 and column c4 is assigned to column x3.

FAQs

What must be the number of concurrent settings, and can I increase the number of concurrents when the speed is low?

The default JVM heap size of the data import process is 2 GB, and the concurrency (number of channels) is implemented with multiple threads. Thus, creating too many threads does not improve the import speed sometimes but may degrade the performance because of frequent GCs. Generally, we recommend to set the concurrency (number of channels) to a value between 5 and 10.

What must be the the batchSize value?

The value of this parameter defaults to 256, which however must be adjusted according to the actual size of each row. Normally, the data volume of each operation ranges between 2 to 4 MB, and the appropriate batchSize value can be calculated by dividing the row size by the actual data volume.

Thank you! We've received your feedback.