All Products
Search
Document Center

Archive incremental data to MaxCompute

Last Updated: May 19, 2022

This topic describes how to archive incremental data of ApsaraDB for HBase clusters to MaxCompute.

Prerequisites

  • Lindorm Tunnel Service (LTS) is activated.

  • An HBase data source is added.

  • A MaxCompute data source is added.

Supported versions

  • Self-managed HBase V1.x and V2.x.

  • E-MapReduce HBase.

  • ApsaraDB for HBase Standard Edition, ApsaraDB for HBase Performance-enhanced Edition (cluster mode), and ApsaraDB for Lindorm.

Limits

  • Real-time data is archived based on HBase logs. Therefore, data that is imported by using bulk loading cannot be exported.

Lifecycle of log data

  • If log data is not consumed after you enable the archiving feature, the log data is retained for 48 hours by default. After the period expires, the subscription is automatically canceled and the retained data is automatically deleted.

  • If you release an LTS instance without stopping the synchronization tasks that are created on the LTS instance, the synchronization tasks are suspended and data is not consumed.

Submit an archiving task

  1. Log on to the LTS web UI. In the left-side navigation pane, choose Data Export > Incremental Archive to MaxCompute.

  2. Click create new job. On the page that appears, select a source HBase cluster and a destination MaxCompute project, and specify the HBase tables to export. The preceding figure provides an example on how to archive data from the wal-test HBase table to MaxCompute in real time.

    • The columns to be archived are cf1:a, cf1:b, cf1:c, and cf1:d.

    • The mergeInterval parameter specifies the archiving interval in milliseconds. The default value is 86400000.

    • Specify the mergeStartAt parameter in the format of yyyyMMddHHmmss. The value in this example specifies 00:00, September 30, 2019 as the start time. You can specify a time in the past.

  3. View the archiving progress of tables. The real-time synchronization channel shows the latency and offset of log synchronization tasks. Table Merge shows table merging tasks. After the tables are merged, you can query the new partitioned tables in MaxCompute.

  4. Query data in MaxCompute.

Parameters

The following code provides an example on the format of exported tables:

hbaseTable/odpsTable {"cols": ["cf1:a|string", "cf1:b|int", "cf1:c|long", "cf1:d|short","cf1:e|decimal", "cf1:f|double","cf1:g|float","cf1:h|boolean","cf1:i"], "mergeInterval": 86400000, "mergeStartAt": "20191008100547"}
hbaseTable/odpsTable {"cols": ["cf1:a", "cf1:b", "cf1:c"],  "mergeStartAt": "20191008000000"}
hbaseTable {"mergeEnabled": false} // No merge operation is performed on the tables.

An exported table contains three parts: hbaseTable, odpsTable, and tbConf.

  • hbaseTable: the source HBase table.

  • odpsTable: the name of the MaxCompute table. This parameter is optional. By default, the name of the MaxCompute table is the same as the name of the source HBase table. The name cannot contain periods (.) or hyphens (-). If you use periods (.) or hyphens (-), they are converted to underscores (_).

  • tbConf: the archiving actions of the table. The following table lists the supported parameters:

Parameter

Description

Example

cols

Specifies the columns to be exported and the data types of the columns. By default, the value is converted to the HexString format.

"cols": ["cf1:a", "cf1:b", "cf1:c"]

mergeEnabled

Specifies whether to convert key-value (KV) tables to wide tables. Default value: true.

"mergeEnabled": false

mergeStartAt

The start time for table merging tasks. You can specify a time in the past in the yyyyMMddHHmmss format.

"mergeStartAt": "20191008000000"

mergeInterval

The interval of table merging tasks. Unit: milliseconds. The default value is one day. This specifies that data is archived on a daily basis.

"mergeInterval": 86400000