Migrate data from HBase to Hologres - Hologres - Alibaba Cloud Documentation Center

This topic describes how to migrate HBase syntax and data to Hologres for data development.

Background information

Hologres supports row-oriented tables, which provide high-performance point queries based on primary keys. This feature is widely used in scenarios such as Flink dimension tables, ID mapping, label retrieval, and order detail queries. Compared to traditional technologies such as HBase, Hologres retains horizontal scalability and high-performance primary key point queries. It also solves the data quality management challenges caused by a schema-free design, and its fewer external dependencies improve system stability. Hologres supports common HBase scenarios, such as wide table design, multi-stream merging, and prefix scans. It provides an SDK for a smooth migration from HBase APIs. You can migrate tables and data from HBase to Hologres for an improved data development experience.

Comparison between Hologres and HBase

The following table compares the features of Hologres and HBase.

Capabilities		HBase	Hologres
Product positioning		An open-source, distributed, column-family-oriented database.	A cloud-native, distributed, real-time data warehouse.
System architecture		Storage and computing are coupled. Storage depends on the underlying Hadoop Distributed File System (HDFS). HDFS clusters require manual scale-out. HBase tables are partitioned based on region size and stored on different nodes in the cluster.	Compute nodes use a Massively Parallel Processing (MPP) architecture. Based on the disaggregation of storage and compute, computing and storage resources can scale independently. This lets you configure computing power and storage space based on different business scenarios, which makes it more flexible and cost-effective.
Polymorphic storage		Supports only row store. Data is stored as a 4-tuple: `<rowkey, column, value, timestamp>`.	Supports row store, column store, and hybrid row-column store modes. Polymorphic storage allows for flexible storage methods based on different data access modes.
Schema expressiveness		Weak schema and weak data types.	Strong schema and rich data types. A strong schema ensures development efficiency and makes it easier to troubleshoot development issues using the schema when data quality is unreliable or data interfaces are unclear.
Global sorting		Global sorting	Local sorting and clustered indexes.
Sharding strategy		Supports presharding and automatic sharding modes.	Supports hash and random sharding modes.
Batch import		Supported. Uses the BulkLoad API.	Supported. Uses the BulkLoad API (COPY).
Real-time writes		Supported. Data is queryable immediately after being written. The write throughput is limited by compaction performance.	Supported. Data is queryable immediately after being written. Supports high-throughput writes.
Real-time updates		Supported.	Supported.
SQL support		Supported through the Phoenix extension. Features are limited, and joins are not supported. SQL performance is poor because of the key-value (KV) storage model.	Highly compatible with PostgreSQL, including protocol, syntax, and ecosystem compatibility.
Storage capacity		Based on HDFS. You must maintain the cluster yourself. The cluster automatically synchronizes data to multiple replicas. Storage capacity depends on the cluster size and supports linear scaling. It uses a Log-Structured Merge-Tree (LSM-Tree) data structure and multiple compression algorithms.	Based on the Pangu/HDFS distributed file system. Storage capacity depends on the cluster size and supports linear scaling. A single table can store over 3 PB of data. Multiple storage modes and compression algorithms enhance storage capabilities.
Query and analytics capabilities		Natively supports only point queries (`GET`) and scans (`SCAN`). It provides a high number of queries per second (QPS) for point queries, but scan performance is poor. Phoenix SQL is supported through Coprocessor, but its performance is poor. It does not support complex calculations or Online Analytical Processing (OLAP) scenarios.	Provides sub-second response times for real-time queries and analytics on hundreds of millions of data records. It has powerful join capabilities, high QPS for point queries, and supports Online Analytical Processing (OLAP) scenarios.
Federated computing		Not supported.	Supports real-time and offline federated computing. It can accelerate queries on foreign tables, such as those in MaxCompute and OSS.
Backup and disaster recovery		Supported. For data, it provides standard big data triplicate replica protection.	Supported. For data, it provides standard big data triplicate replica protection.
Query language	Query language	Java API. It must be used with other frameworks, such as Apache Phoenix.	PostgreSQL. It supports complete SQL syntax without requiring other components or frameworks. It also offers richer Data Definition Language (DDL) objects and supports full join queries.
	DDL	Supported. Keywords: create, alter, drop, describe, and list. Objects: namespace, table, and column family.	Supported. Keywords: create, alter, and drop. Objects: database, table, view, schema, cast, extension, role, user, user mapping, and group.
	DML	Supported. Keywords: put, get, scan, delete, and truncate.	Supported. Keywords: select, insert, update, and delete.
	DCL	Supported. Keywords: grant, revoke, and rollback.	Supported. Keywords: grant, revoke, and rollback.
O&M		Requires self-maintenance.	Fully managed. The system automatically detects changes in cluster topology information without affecting users.
Ecosystem		HBase is compatible with the Hadoop ecosystem.	Hologres is highly compatible with the PostgreSQL ecosystem.
Scenarios		Mass storage, unstructured storage, excellent single-point query performance, and write-intensive database.	Real-time data warehouse and online data services. It connects data silos, provides real-time query and analytics for massive amounts of data, supports scalable clusters, and offers full SQL support.
Development method		Application development is complex. You must convert business analytics concepts, such as metrics, dimensions, tables, and aggregations, into key-value (KV) storage concepts. Application-layer query filtering scenarios must be translated into byte filtering operations on keys. System efficiency heavily depends on the quality of the key design. The entire system, from data entry to data analytics and queries, relies on the application layer's use of basic KV interfaces.	Application development is simple. It is table-oriented and uses standard SQL statements. It is suitable for scenarios such as complex multidimensional analysis, nested queries, and join queries. It provides Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) interfaces and is oriented toward topic modelling for data. The development model shifts from being metric-oriented and wide-table-oriented to subject-area-oriented modeling in Hologres. This reduces heterogeneous information attenuation in the data model across the collection, processing, and analysis stages. It also reduces the number of data transformation layers and improves data usage flexibility.

Syntax migration

SQL syntax and common command mapping

Hologres supports the common basic syntax of HBase. It provides SQL interfaces and the HoloClient SDK.

HBase syntax	Hologres syntax	Client command
PUT	`insert into values on conflict`	`HoloClient.put(Put(table_name, primary_key_value, data_column_values))`
DELETE	`delete from`	`HoloClient.put(Delete(table_name, primary_key_value))`
SCAN	`select from where filter=XX`	`HoloClient.scan(Scan(table_name, primary_key_value + primary_key_range, returned_columns))`
GET	`select from where PK=XX`	`HoloClient.get(Get(table_name, primary_key_value, returned_columns))`

Wide table design
A wide table is the most basic data model in HBase. In Hologres, you can use row-oriented tables. For more information about how to design row-oriented tables, see Best practices for Key/Value query scenarios.
Row-oriented tables can support up to 3,000 columns. A column-oriented table should not have more than 1,000 columns.
The TEXT type creates bitmap indexes and dictionary encoding by default. For scenarios with a very large number of columns, do not use the default settings. You must manually configure the required bitmap indexes and dictionary encoding.
Multi-stream merging
Multi-stream merging uses the partial update capability of Hologres. Multiple data streams are written to a single table at the same time. The primary key of the table is used to achieve a merge effect, which is similar to a join.
Prefix scan
A prefix scan uses the prefix sorting feature of HBase to efficiently filter data. In Hologres, you can achieve the same filtering effect by designing a distribution key and a clustering key.
If the rowkey in HBase is [hash(pk0, pk1), pk0, pk1,..., pkN], you can set the primary key in Hologres to (pk0, pk1,..., pkN), the distribution key to (pk0, pk1), and the clustering key to be the same as the primary key.
- The following SQL statement supports a prefix equals scan:
```
WHERE pk0=? AND pk1=? AND ... AND pkX=? ;
```
- The following SQL statement supports a prefix range scan:
```
WHERE pk0=? AND ... AND pkX=? AND pkX+1 BETWEEN ? AND ?;
```
- If the primary key of the table is (rowkey, ts), the query command is as follows:
```
SELECT col1, col2, ts FROM tbl WHERE rowkey = '1234567890' AND ts BETWEEN 1637637479606 AND 1640229479607 LIMIT 100;
```
  In this case, you can set the distribution key to rowkey and the clustering key to the composite key (rowkey, ts).
- If the primary key of the table is (rowkey, ts), the query command is as follows:
```
SELECT col1, col2, ts FROM tbl WHERE rowkey = '1234567890' AND ts = 1637637479606;
```
  In this case, a suitable composite key design is to set the distribution key to (rowkey, ts) and the clustering key to (rowkey, ts).

Batch update

Hologres supports multiple update methods, such as Insert, Update, and Insert on Conflict. It also supports scenarios such as row updates, column updates, and upserts. In batch update scenarios, the write and update throughput can be high. This might affect online services that are latency-sensitive. In this case, you can implement batch updates by writing data to a temporary table and then atomically replacing the original table. This is similar to the BulkLoad mode. The following example shows how to do this.

-- Assume that the online service table is t_online.

-- Create a temporary table.
BEGIN;
CREATE TABLE t_tmp
(
  ...
)
COMMIT;

-- Write data to the temporary table.
INSERT INTO t_tmp SELECT * FROM t_offline;
-- Update statistics.
vacuum t_tmp;
analyze t_tmp;

-- Atomically replace the table within a transaction.
BEGIN;
ALTER TABLE t_online rename TO t_online_outdate;
ALTER TABLE t_tmp rename TO t_online;
COMMIT;

-- Wait for 5 seconds to allow queries on the old table to finish, then delete the obsolete data.
SELECT pg_sleep(5);
DROP TABLE t_online_outdate;

OLAP
Hologres supports not only point queries on primary keys in row-oriented tables but also OLAP scenarios. For storage, you can design a hybrid row-column store mode. When you run a query, the query engine selects the optimal storage structure based on the query features. For more information, see CREATE TABLE.