This topic describes how to migrate data from an HBase database to Hologres for data development and analytics.

Background information

Hologres provides the row-oriented storage mode. In this mode, primary key-based point queries are supported for a wide range of scenarios such as Flink dimension tables, ID mapping, tag queries, and order queries. Compared with traditional technologies such as HBase, Hologres resolves the data quality management issue caused by schema-free data models and improves system stability by using minimal external dependencies while maintaining horizontal scalability and primary key-based point query capabilities. Hologres is similarly applicable in a wide range of scenarios in which HBase is suited, such as wide tables, stream joins, and prefix scans. Hologres also provides SDKs to use HBase API operations. You can migrate tables and data from an HBase database to Hologres for a better data development experience.

Comparison between Hologres and HBase

The following table compares the features of Hologres and HBase.
Feature HBase Hologres
Service positioning An open source, distributed, column-oriented database. A cloud native, distributed, real-time data warehouse.
System architecture HBase uses an architecture in which storage is coupled with computing and dependent on the underlying Hadoop Distributed File System (HDFS). HDFS clusters must be manually scaled. HBase tables are partitioned based on the sizes of regions that are stored on different nodes of HDFS clusters. Hologres uses a massively parallel processing (MPP) architecture for compute nodes. Computing and storage resources are independently scaled based on storage disaggregation. The computing capacity and storage space can be configured to suit specific business scenarios, which improves flexibility and cost-effectiveness.
Storage mode Supports only row-oriented storage. Data is stored in the <rowkey, column, value, timestamp> tuple. Supports multiple storage modes: row-oriented storage, column-oriented storage, and hybrid row-column storage. You can select a storage mode based on the access mode of your data.
Schema expression Uses weak schemas and weak data types. Uses strong schemas and strong data types. Strong schemas can help ensure efficient development and facilitate schema-based troubleshooting when data quality is not reliable and the data source is not known.
Global sorting Supports global sorting. Supports partial sorting and clustered indexes.
Sharding strategy Supports pre-sharding and automatic sharding. Supports hash sharding and random sharding.
Batch import Supports the BulkLoad feature. Supports the BulkLoad feature by executing the COPY statement.
Real-time writing Supports real-time data writing. Data can be queried immediately after it is written. The write throughput is subject to the performance of compaction operations. Supports real-time data writing. Data can be queried immediately after it is written. The write throughput is high.
Real-time update Supports real-time update. Supports real-time update.
SQL features Supports SQL features by using Phoenix extensions but does not support JOIN operations. The SQL performance is low and subject to the storage mode of key-value pairs. Hologres is highly compatible with the PostgreSQL protocol, syntax, and ecosystem.
Storage Uses HDFS. HDFS clusters are manually maintained and data is automatically synchronized to multiple replicas. Storage capacity is related to cluster size. HBase supports linear scaling and uses the log-structured merge-tree (LSM tree) architecture and various data compression algorithms. Uses Apsara Distributed File System and HDFS. Storage capacity is related to cluster size. Hologres supports linear scaling and up to 3 PB data per table. Hologres uses multiple storage modes and various data compression algorithms.
Query and analytics Supports only point queries (GET) and scans (SCAN). The point query performance is high but the scan performance is low. HBase supports Phoenix SQL by using coprocessors but achieves low performance. HBase does not support complex computing or online analytical processing (OLAP) scenarios. Supports response within sub-seconds to hundreds of millions of data queries. Hologres provides powerful JOIN operations and high-performance point queries, and supports OLAP scenarios.
Federated computing Does not support federated computing. Supports offline real-time federated computing and accelerated queries on MaxCompute and Object Storage Service (OSS) tables.
Backup and disaster recovery Supports data backup and disaster recovery. HBase also provides the standard triplicate storage for big data. Supports data backup and disaster recovery. Hologres also provides the standard triplicate storage for big data.
Query language Query language Supports Java API. HBase requires the combination of another framework such as Apache Phoenix. Supports PostgreSQL. Hologres supports the SQL syntax without the need to combine other components or frameworks. Hologres supports a wide range of DDL objects and join queries.
DDL Supports the CREATE, ALTER, DROP, DESCRIBE, and LIST keywords on the namespace, table, and column family objects. Supports the CREATE, ALTER, and DROP keywords on the database, table, view, schema, cast, extension, role, user, user mapping, and group objects.
DML Supports the PUT, GET, SCAN, DELETE, and TRUNCATE keywords. Supports the SELECT, INSERT, UPDATE, and DELETE keywords.
DCL Supports the GRANT, REVOKE, and ROLLBACK keywords. Supports the GRANT, REVOKE, and ROLLBACK keywords.
O&M Requires manual O&M. Hologres is a fully managed service. Topology changes of clusters are automatically perceived and are imperceptible to users.
Ecosystem HBase is compatible with the Hadoop ecosystem. Hologres is highly compatible with the PostgreSQL ecosystem.
Scenarios Scenarios in which a large volume of unstructured data exists, high-performance point queries, and write-intensive databases. Real-time data warehouses, online data services, scenarios where data silos exist, real-time query and analysis of a large volume of data, elastic clusters, and scenarios where SQL support is required.
Development method Application development is complex. Business analysis metrics, dimensions, tables, and aggregations must be converted into key-value pairs for storage. Queries and filters at the application layer must be translated into byte-based filters of keys. The system efficiency depends a lot on the design quality of the keys. In diverse complex scenarios such as data import, analytics, and queries, the use of KeyValue interfaces at the application layer assumes an important role. Application development is simple. Hologres uses standard SQL statements to develop data on a per-table basis, which is applicable in scenarios such as complex multi-dimensional analytics, nested queries, and join queries. Hologres provides the Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) interfaces for theme-based data modeling and development. Hologres switches from metric- and wide table-based development to theme-based modeling, which reduces heterogeneous information decays of data models at the collection, processing, and analysis ends, reduces data processing layers, and improves data flexibility.

Conversion of SQL syntax

  • Mappings between the SQL syntax and client commands
    Hologres supports the basic syntax of HBase, SQL syntax, and Holo Client SDKs.
    Syntax in HBase Syntax in Hologres Client command
    PUT insert into values on conflict HoloClient.put(Put(Table name,Primary key value,Data column value))
    DELETE delete from HoloClient.put(Delete(Table name,Primary key value))
    SCAN select from where filter=XX HoloClient.scan(Scan(Table name,Primary key value+Primary key range,Columns to return))
    GET select from where PK=XX HoloClient.get(Get(Table name,Primary key value,Columns to return))
  • Wide table

    Wide tables are a basic data model of HBase. In Hologres, row-oriented tables are used. For more information about row-oriented tables, see Query key-value pairs.

    Each row-oriented table can contain up to 3,000 columns, and each column-oriented table can contain up to 1,000 columns.

    By default, bitmap indexes are created and dictionary encoding is enabled for data of the TEXT type. In scenarios where a large number of columns exist, we recommend that you do not use the default settings. We recommend that you manually configure bitmap indexes and dictionary encoding.

  • Stream join

    In stream joins, the partial update feature of Hologres is used to write multiple data streams into a single table. The data streams are merged (or joined) based on the primary key of the table.

  • Prefix scan

    In prefix scans, the prefix sequence feature of HBase is used to filter data in an efficient manner. In Hologres, distribution keys and clustering keys are used to achieve a similar filter effect.

    Assume that the rowkey in HBase is [hash(pk0, pk1), pk0, pk1,..., pkN]. You must set the primary key and clustering key in Hologres to (pk0, pk1,..., pkN) and set the distribution key to (pk0, pk1).
    • You can execute the following SQL statement to scan data based on prefixes:
      where pk0=? and pk1=? and ... and pkX=? ;
    • You can execute the following SQL statement to scan data based on a prefix range:
      where pk0=? and ... and pkX=? and pkX+1 between ? and ?;
    • If the primary key of a table is (rowkey, ts), you can execute the following statement to query table data:
      select col1, col2, ts from tbl where rowkey = '1234567890' and ts between 1637637479606 and 1640229479607 limit 100;
      In this case, set the distribution key to rowkey and set the clustering key to (rowkey, ts).
    • If the primary key of a table is (rowkey, ts), you can execute the following statement to query table data:
      select col1, col2, ts from tbl where rowkey = '1234567890' and ts = 1637637479606;
      In this case, set the distribution key and the clustering key to (rowkey, ts).
  • Batch update
    Hologres supports multiple update modes including INSERT, UPDATE, and INSERT ON CONFLICT, and various scenarios such as row update, column update, and UPSERT. In batch update scenarios, the write throughput is high, which may affect online services that are sensitive to latency. In this case, we recommend that you write data to a temporary table and then perform atomic replacement for batch update. This is similar to BulkLoad. Example:
    --An online service has a table named t_online.
    
    --Create a temporary table.
    begin;
    create table t_tmp
    (
      ...
    )
    commit;
    
    --Write data to the temporary table.
    insert into t_tmp select * from t_offline;
    --Update statistical information.
    vacumm t_tmp;
    analyze t_tmp;
    
    --Perform atomic replacement in the transaction.
    begin;
    alter table t_online rename to t_online_outdate;
    alter table t_tmp rename to t_online;
    commit;
    
    --Delete useless data 5 seconds after the query on the original table is complete.
    pg_sleep(5);
    drop table t_online_outdate;
  • OLAP

    Hologres supports OLAP scenarios in addition to primary key-based point queries on row-oriented tables. In OLAP scenarios, you can use the hybrid row-column storage mode, and the query engine selects the optimal storage structure based on the query characteristics. For more information, see CREATE TABLE.