Hologres V3.2 introduces data lake table mirroring to accelerate data lake queries. This feature mirrors metadata and data by synchronizing changes from external data sources to Hologres in near real-time or at set intervals. You can mirror entire tables or specific partitions of partitioned tables. You can use parameters to control the scope of the mirrored data.
Feature details
When you access a data lake table through Hologres, the system automatically triggers a metadata mirror sync task. This task periodically synchronizes the latest metadata from the source table and caches it in Hologres. The default interval is one minute. On subsequent access, Hologres automatically uses the mirrored metadata to accelerate metadata access.
To mirror the data of a data lake table, you must manually enable the feature using the ALTER EXTERNAL TABLE command. The supported capabilities are as follows:
Supports enabling mirroring for Paimon append-only tables and primary key tables with deletion vectors.
Supports mirroring for entire partitioned and non-partitioned tables, and for specific partitions.
Supports schema evolution for external tables. Supported operations include adding, deleting, reordering, and renaming columns. For unsupported operations, the system automatically falls back to reading the external table directly.
Supports automatic discovery and synchronization of data lake table partitions.
Supports setting common internal table indexes on the table mirror. This allows SQL queries to quickly hit the data, which reduces I/O consumption and achieves faster query performance with fewer computing resources. For more information, see Set table properties and indexes. The supported indexes include the following:
data_mirroring_clustering_key
data_mirroring_bitmap_columns
data_mirroring_dictionary_encoding_columns
Supports querying mirrored data using secondary instances, compute group instances, and Serverless resources.
Supports viewing the progress of mirror synchronization.
Supports viewing the latest partition of a full mirror.
Supports viewing the mirror storage usage at the instance, DB, and table levels.
Supports modifying the data mirror configuration for a specific table.
Supports disabling the data mirroring feature for a specific table.
Limitations
External tables support metadata mapping using the
External Databasemethod, but not the Foreign Table method.Supports external tables with metadata managed by Data Lake Formation (DLF).
Mirroring data from Paimon branches is not currently supported.
Syntax
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
-- Enable table mirroring for the external table
data_mirroring_speed_up_enable = 'on|off',
[data_mirroring_partition_num='1~N|all',] |[data_mirroring_partition_list='pt1,pt2...',]
-- Specify the resources used for the mirror build process
[data_mirroring_guc_hg_computing_resource='[serverless | local]',]
[data_mirroring_guc_hg_experimental_serverless_computing_required_cores='<num>',]
-- Set indexes
[data_mirroring_clustering_key='[columnName{:asc]} [,...]]',]
[data_mirroring_dictionary_encoding_columns='[columnName [,...]]',]
[data_mirroring_bitmap_columns='[columnName [,...]]',]
);Parameters
Parameter | Required | Description |
data_mirroring_speed_up_enable | Yes | Specifies whether to enable table mirroring for the external table. Valid values:
|
data_mirroring_partition_num | No | Specifies the n most recent partitions to mirror. This parameter applies only to partitioned tables. Valid values:
|
data_mirroring_partition_list | No | Specifies a list of partitions to mirror. This parameter applies only to partitioned tables. |
data_mirroring_guc_hg_computing_resource | No | Specifies the resources to use for the mirror build process. The default value is Local. Serverless: Uses Serverless resources for data mirroring. Local: Uses the resources of the current instance. |
data_mirroring_guc_hg_experimental_serverless_computing_required_cores | No | Specifies the number of cores required for Serverless resources. Note This parameter is required only when data_mirroring_guc_hg_computing_resource is set to Serverless. |
Usage
Prerequisites
You have activated DLF V2.5 or later and created a Paimon catalog.
You have completed the following operations:
Purchase a Hologres instance and activate the data lake acceleration service. For more information, see OSS data lake acceleration.
Create an EXTERNAL DATABASE to map the data lake table metadata to Hologres. For more information, see CREATE EXTERNAL DATABASE.
Enable table mirroring for an external table
Non-partitioned tables
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
-- Enable table mirroring for the external table
data_mirroring_speed_up_enable = 'on',
-- Specify the resources for the mirror build process
data_mirroring_guc_hg_computing_resource='local'
-- Set indexes
data_mirroring_clustering_key='c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns='c_int:on,c_char:auto',
data_mirroring_bitmap_columns='c_int:on'
);Partitioned tables
Specify the number of partitions
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table> SET( -- Enable table mirroring for the external table data_mirroring_speed_up_enable = 'on', data_mirroring_partition_num='7', -- Specify the resources for the mirror build process data_mirroring_guc_hg_computing_resource='serverless' data_mirroring_guc_hg_experimental_serverless_computing_required_cores='16', -- Set indexes data_mirroring_clustering_key='c_int:asc,c_char:desc', data_mirroring_dictionary_encoding_columns='c_int:on,c_char:auto', data_mirroring_bitmap_columns='c_int:on' );Specify all partitions
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table> SET( -- Enable table mirroring for the external table data_mirroring_speed_up_enable = 'on', data_mirroring_partition_num='all', -- Specify the resources for the mirror build process data_mirroring_guc_hg_computing_resource='local', -- Set indexes data_mirroring_clustering_key='c_int:asc,c_char:desc, data_mirroring_dictionary_encoding_columns='c_int:on,c_char:auto', data_mirroring_bitmap_columns='c_int:on' );Specify a list of partitions
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table> SET( -- Enable table mirroring for the external table data_mirroring_speed_up_enable = 'on', data_mirroring_partition_list='pt1,pt2...', -- Specify the resources for the mirror build process data_mirroring_guc_hg_computing_resource='local', -- Set indexes data_mirroring_clustering_key='c_int:asc,c_char:desc, data_mirroring_dictionary_encoding_columns='c_int:on,c_char:auto', data_mirroring_bitmap_columns='c_int:on' );Specify the number of partitions and specific partitions to retain historical partition mirrors
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table> SET( -- Enable table mirroring for the external table data_mirroring_speed_up_enable = 'on', data_mirroring_partition_num='7', data_mirroring_partition_list='pt1', -- Specify the resources for the mirror build process data_mirroring_guc_hg_computing_resource='local', -- Set indexes data_mirroring_clustering_key='c_int:asc,c_char:desc, data_mirroring_dictionary_encoding_columns='c_int:on,c_char:auto', data_mirroring_bitmap_columns='c_int:on' );
View the mirror build status
Run the following command to view the mirror build status.
SELECT * FROM hologres.hg_datalake_get_mirror_status();The following table describes the fields in the output.
Field | Description |
external_db_name | The External Database where the table for the mirror build task resides. |
external_schema_name | The external schema where the table for the mirror build task resides. |
external_table_name | The external table for the mirror build task. |
partition | The partition in the external table for the mirror build task. |
mirror_data_size | The size of the mirror file for the table or partition. |
mirror_start_time | The start time of the mirror build task. |
mirror_last_update_time | The last update time of the mirrored data. |
total_file_count | Total file size |
mirrored_file_count | Mirror file size. |
Modify the mirror configuration
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
-- Enable table mirroring for the external table
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_num='12'
);Check if a query uses mirror files
explain analyze SELECT xx FROM <holo_ext_db.ext_schema.ext_table>;The following result is returned.
QUERY PLAN
Gather (cost=0.00..5.80 rows=1000 width=116)
[128:1 id=100002 dop=1 time=26/26/26ms rows=107995(107995/107995/107995) mem=0/0/0B open=0/0/0ms get_next=26/26/26ms * ]
-> Local Gather (cost=0.00..5.01 rows=1000 width=116)
[id=2 dop=128 time=13/0/0ms rows=107995(97688/843/0) mem=0/0/0B open=0/0/0ms get_next=13/0/0ms local_dop=0/0/0]
-> Seq Scan on sls_ads_table (cost=0.00..5.01 rows=1000 width=116)
Foreign Table Type: DLF
[id=1 split_count=128 time=13/0/0ms rows=107995(97688/843/0) mem=37/5/4KB open=0/0/0ms get_next=13/0/0ms physical_reads=154(140/77/14) scan_rows=107995(97688/8999/671)]
ADVICE:
"[node id : 1] Table default."sls_ads_table" Miss Stats! please run 'analyze default."sls_ads_table"'. "
Query id:[1002007457093288810]
QE version: 2.0
Query Queue: init_warehouse.default_queue
======================cost======================
Total cost:[150] ms
Fast statistics cost:[0] ms
Optimizer cost:[73] ms
Build execution plan cost:[10] ms
Init execution plan cost:[9] ms
Start query cost:[13] ms
- Queue cost: [0] ms
- Wait schema cost:[0] ms
- Lock query cost:[0] ms
- Create dataset reader cost:[0] ms
- Create split reader cost:[3] ms
Get result cost:[45] ms
- Get the first block cost:[1] ms
====================resource====================
Memory: total 7 MB. Worker stats: max 3 MB, avg 3 MB, min 3 MB, max memory worker id: 1889346452478863373.
CPU time: total 96 ms. Worker stats: max 82 ms, avg 48 ms, min 14 ms, max CPU time worker id: 1889346412692090879.
DAG CPU time stats: max 77 ms, avg 38 ms, min 0 ms, cnt 2, max CPU time dag id: 1.
Fragment CPU time stats: max 66 ms, avg 25 ms, min 0 ms, cnt 3, max CPU time fragment id: 2.
Ec wait time: total 139 ms. Worker stats: max 79 ms, max(max) 2 ms, avg 69 ms, min 60 ms, max ec wait time worker id: 1889346452478863373, max(max) ec wait time worker id: 1889346412692090879.
Physical read bytes: total 1 MB. Worker stats: max 1 MB, avg 0 MB, min 0 MB, max physical read bytes worker id: 1889346412692090879.
Read bytes: total 7 MB. Worker stats: max 7 MB, avg 3 MB, min 0 MB, max read bytes worker id: 1889346412692090879.
DAG instance count: total 3. Worker stats: max 2, avg 1, min 1, max DAG instance count worker id: 1889346452478863373.
Fragment instance count: total 257. Worker stats: max 129, avg 128, min 128, max fragment instance count worker id: 1889346452478863373.
Meta mirror table count: use 1 miss 0.
Data mirror file count: use 12 miss 0.An image hit is indicated by the last two metrics in the returned result: use and count.
Disable table mirroring for an external table
After you disable mirroring, the system stops synchronizing data to Hologres. Queries on the external table directly access the data lake data. The internally mirrored data files are asynchronously deleted after 30 minutes.
ALTER EXTERNAL TABLE <holo_ext_db.ext_schema.ext_table> SET(
data_mirroring_speed_up_enable = 'off'
);View the mirror configuration
Run the following SQL statement to view which tables have data mirroring enabled. SELECT * FROM hologres.hg_datalake_get_mirror_config();
SELECT * FROM hologres.hg_datalake_get_mirror_config();View the mirror progress
Run the following SQL statement to view the progress of the mirroring process.
SELECT *
FROM hologres.hg_user_datalake_mirror_cron_tasks
WHERE command::jsonb->>'external_db_name'='external_database_name'
ORDER BY start_time DESC;In the statement, external_database_name is the name of the External Database.