Migrate data from HBase to Hologres - Hologres - Alibaba Cloud Documentation Center

This topic describes how to migrate data from an HBase database to Hologres for data development and analytics.

Background information

Hologres supports the row-oriented storage mode. In this mode, primary key-based point queries are supported for a wide range of scenarios such as Hologres dimension tables in Realtime Compute for Apache Flink, ID mapping, tag queries, and order queries. Compared with traditional technologies such as HBase, Hologres maintains horizontal scalability and primary key-based point query capabilities, resolves the data quality management issue caused by schema-free data models, and improves system stability by using minimal external dependencies. Hologres is applicable to a wide range of scenarios in which HBase is suited, such as wide tables, stream joins, and prefix scans. Hologres also allows you to call API operations of HBase by using SDKs. You can migrate tables and data from an HBase database to Hologres for a better data development experience.

Comparison between Hologres and HBase

The following table compares the features of Hologres and HBase.

Item		HBase	Hologres
Service positioning		An open source, distributed, and column-oriented database.	A cloud native, distributed, and real-time data warehouse.
System architecture		HBase uses an architecture in which storage is coupled with computing and is dependent on the underlying Hadoop Distributed File System (HDFS). HDFS clusters must be manually scaled. HBase tables are partitioned based on the sizes of regions that are stored on different nodes of HDFS clusters.	Hologres uses a massively parallel processing (MPP) architecture for compute nodes. Computing and storage resources are independently scaled. The computing capacity and storage space can be configured to suit specific business scenarios, which improves flexibility and cost-effectiveness.
Storage mode		Supports only row-oriented storage. Data is stored in the `<rowkey, column, value, timestamp>` tuple.	Supports multiple storage modes: row-oriented storage, column-oriented storage, and row-column hybrid storage. You can select a storage mode based on the access mode of your data.
Schema expression		Supports weak schemas and weak data types.	Supports strong schemas and various data types. Strong schemas can ensure efficient development and facilitate schema-based troubleshooting when data quality is not reliable and the data source is unclear.
Global sorting		Supports global sorting.	Supports partial sorting and clustered indexes.
Sharding strategy		Supports pre-sharding and automatic sharding.	Supports hash sharding and random sharding.
Batch import		Supports the BulkLoad feature.	Supports the BulkLoad feature by executing the COPY statement.
Real-time writing		Supports real-time data writing. Data can be queried immediately after it is written. The write throughput is subject to the performance of compaction operations.	Supports real-time data writing. Data can be queried immediately after it is written. The write throughput is high.
Real-time update		Supports real-time update.	Supports real-time update.
SQL features		Supports SQL features by using Phoenix extensions but does not support JOIN operations. The SQL performance is low and subject to the storage mode of key-value pairs.	Hologres is highly compatible with the PostgreSQL protocol, syntax, and ecosystem.
Storage		Uses HDFS. HDFS clusters are manually maintained and data is automatically synchronized to multiple replicas. Storage capacity is related to cluster size. HBase supports linear scaling and uses the log-structured merge-tree (LSM tree) architecture and various data compression algorithms.	Uses Apsara Distributed File System and HDFS. Storage capacity is related to cluster size. Hologres supports linear scaling and up to 3 PB data per table. Hologres uses multiple storage modes and various data compression algorithms.
Query and analytics		Supports only point queries `(GET)` and scans `(SCAN)`. The point query performance is high but the scan performance is low. HBase supports Phoenix SQL by using coprocessors but achieves low performance. HBase does not support complex computing or online analytical processing (OLAP) scenarios.	Supports response within sub-seconds to hundreds of millions of data queries. Hologres provides powerful JOIN operations and high-performance point queries, and supports OLAP scenarios.
Federated computing		Does not support federated computing.	Supports offline real-time federated computing and accelerated queries on MaxCompute and Object Storage Service (OSS) tables.
Backup and disaster recovery		Supports data backup and disaster recovery. HBase also provides the standard triplicate storage for big data.	Supports data backup and disaster recovery. Hologres also provides the standard triplicate storage for big data.
Query language	Query language	Supports Java API. HBase requires the combination of another framework such as Apache Phoenix.	Supports PostgreSQL. Hologres supports the SQL syntax without the need to combine other components or frameworks. Hologres supports a wide range of DDL objects and join queries.
	DDL	Supports the CREATE, ALTER, DROP, DESCRIBE, and LIST keywords on the namespace, table, and column family objects.	Supports the CREATE, ALTER, and DROP keywords on the database, table, view, schema, cast, extension, role, user, user mapping, and group objects.
	DML	Supports the PUT, GET, SCAN, DELETE, and TRUNCATE keywords.	Supports the SELECT, INSERT, UPDATE, and DELETE keywords.
	DCL	Supports the GRANT, REVOKE, and ROLLBACK keywords.	Supports the GRANT, REVOKE, and ROLLBACK keywords.
O&M		Resource O&M is required.	Hologres is a fully managed service. Topology changes of clusters are automatically perceived and are imperceptible to users.
Ecosystem		HBase is compatible with the Hadoop ecosystem.	Hologres is highly compatible with the PostgreSQL ecosystem.
Scenarios		Scenarios in which a large volume of unstructured data exists, high-performance point queries, and write-intensive databases.	Real-time data warehouses, online data services, scenarios where data silos exist, real-time query and analysis of a large volume of data, elastic clusters, and scenarios where SQL support is required.
Development method		Application development is complex. Business analysis metrics, dimensions, tables, and aggregations must be converted into key-value pairs for storage. Queries and filters at the application layer must be translated into byte-based filters of keys. The system efficiency depends a lot on the design quality of the keys. In diverse complex scenarios such as data import, analytics, and queries, the use of KeyValue interfaces at the application layer assumes an important role.	Application development is simple. Hologres uses standard SQL statements to develop data on a per-table basis, which is applicable in scenarios such as complex multi-dimensional analytics, nested queries, and join queries. Hologres provides the Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) interfaces for theme-based data modeling and development. Hologres switches from metric- and wide table-based development to theme-based modeling, which reduces heterogeneous information decays of data models at the collection, processing, and analysis ends, reduces data processing layers, and improves data flexibility.

Conversion of SQL syntax

Mappings between the SQL syntax and client commands

Hologres supports the basic syntax of HBase, SQL syntax, and the Hologres client SDK.

HBase syntax	Hologres syntax	Client command
PUT	`insert into values on conflict`	`HoloClient.put(Put(Table name,Primary key value,Data column value))`
DELETE	`delete from`	`HoloClient.put(Delete(Table name,Primary key value))`
SCAN	`select from where filter=XX`	`HoloClient.scan(Scan(Table name,Primary key value+Primary key range,Columns to return))`
GET	`select from where PK=XX`	`HoloClient.get(Get(Table name,Primary key value,Columns to return))`

Wide table
A wide table is the most basic data model of an HBase database. In Hologres, row-oriented tables are used. For more information about row-oriented tables, see Query key-value pairs.
Each row-oriented table can contain up to 3,000 columns, and each column-oriented table can contain up to 1,000 columns.
By default, bitmap indexes are created and dictionary encoding is enabled for data of the TEXT type. In scenarios where a large number of columns exist, we recommend that you do not use the default settings. We recommend that you manually configure bitmap indexes and dictionary encoding.
Stream join
In stream joins, the partial update feature of Hologres is used to write multiple data streams into a single table. The data streams are merged or joined based on the primary key of the table.
Prefix scan
In prefix scans, the prefix sequence feature of HBase is used to filter data in an efficient manner. In Hologres, distribution keys and clustering keys are used to achieve a similar filter effect.
In this example, the rowkey in HBase is [hash(pk0, pk1), pk0, pk1,..., pkN]. You must set the primary key and clustering key in Hologres to (pk0, pk1,..., pkN) and set the distribution key to (pk0, pk1). Keep the clustering key consistent with the primary key.
- You can execute the following SQL statement to scan data based on the exact prefixes:
```
WHERE pk0=?  AND pk1=?  AND ... AND pkX=?  ;
```
- You can execute the following SQL statement to scan data based on a prefix range:
```
WHERE pk0=?  AND ... AND pkX=?  AND pkX+1 BETWEEN ?  AND ?;
```
- If the primary key of a table is (rowkey, ts), you can execute the following statement to query table data:
```
SELECT col1, col2, ts FROM tbl WHERE rowkey = '1234567890' AND ts BETWEEN 1637637479606 AND 1640229479607 LIMIT 100;
```
  In this case, set the distribution key to rowkey and set the clustering key to (rowkey, ts).
- If the primary key of a table is (rowkey, ts), you can execute the following statement to query table data:
```
SELECT col1, col2, ts FROM tbl WHERE rowkey = '1234567890' AND ts = 1637637479606;
```
  In this case, set both the distribution key and the clustering key to (rowkey, ts).

Batch update

Hologres supports multiple update methods, including INSERT, UPDATE, and INSERT ON CONFLICT, and various scenarios such as row update, column update, and UPSERT. In batch update scenarios, the write throughput is high, which may affect online services that are sensitive to latency. In this case, we recommend that you write the updated data to a temporary table and then perform atomic replacement for batch update. This is similar to BulkLoad. Example:

-- In this example, an online service has a table named t_online.

-- Create a temporary table.
BEGIN;
CREATE TABLE t_tmp
(
  ...
)
COMMIT;

-- Write data to the temporary table.
INSERT INTO t_tmp SELECT * FROM t_offline;
-- Update statistics.
vacuum t_tmp;
analyze t_tmp;

-- Perform atomic replacement in the transaction.
BEGIN;
ALTER TABLE t_online rename TO t_online_outdate;
ALTER TABLE t_tmp rename TO t_online;
COMMIT;

-- Wait for 5 seconds for the queries on the original table to complete. Then, delete the useless data.
SELECT pg_sleep(5);
DROP TABLE t_online_outdate;

OLAP
Hologres supports OLAP scenarios in addition to primary key-based point queries on row-oriented tables. In OLAP scenarios, you can use the row-column hybrid storage mode, and the query engine selects the optimal storage structure based on the query characteristics. For more information, see CREATE TABLE.