Data lake table mirroring, introduced in Hologres V3.2, accelerates queries on data lake tables by syncing metadata and data from external sources into Hologres in near real-time or at scheduled intervals. Mirror entire tables or specific partitions of partitioned tables, and control the scope of mirrored data with parameters.
How it works
Mirroring operates at two levels: metadata and data.
Metadata mirroring starts automatically the first time you query a data lake table. Hologres triggers a background sync task that pulls the latest metadata from the source table and caches it locally. The default sync interval is one minute. Subsequent queries use the cached metadata, eliminating repeated round-trips to the external catalog.
Data mirroring must be enabled manually using ALTER EXTERNAL TABLE. Once enabled, Hologres copies the actual table data—not just metadata—into local storage and keeps it in sync with the source. Data mirroring supports:
-
Paimon append-only tables and primary key tables with deletion vectors
-
Non-partitioned tables and partitioned tables, including specific partition subsets
-
Schema evolution: adding, deleting, reordering, and renaming columns. Unsupported schema changes fall back to reading the external table directly.
-
Automatic discovery and synchronization of new partitions
-
Common Hologres indexes (clustering key, dictionary encoding, bitmap) on the mirrored data to reduce I/O and accelerate queries
-
Querying mirrored data from secondary instances, compute group instances, and Serverless resources
-
Viewing the latest partition of a full mirror
-
Viewing the mirror storage usage at the instance, DB, and table levels
When to use data mirroring
Data mirroring provides the most benefit when:
-
Queries repeatedly scan the same data lake tables or partitions
-
Network latency to the remote storage is high or unstable
-
Query performance is bottlenecked by remote I/O, not by Hologres compute
For tables accessed only occasionally, metadata mirroring alone may be sufficient.
Limitations
-
External tables must use the External Database mapping method. The foreign table method is not supported.
-
Only external tables whose metadata is managed by Data Lake Formation (DLF) are supported.
-
Mirroring data from Apache Paimon branches is not currently supported.
Enable data mirroring
Before you enable data mirroring, ensure that you have:
-
Data Lake Formation (DLF) V2.5 or later activated, with an Apache Paimon catalog created
-
A Hologres instance with the data lake acceleration service enabled (see OSS data lake acceleration)
-
An External Database created to map data lake table metadata to Hologres (see CREATE EXTERNAL DATABASE)
Use ALTER EXTERNAL TABLE with data_mirroring_speed_up_enable = 'on' to enable mirroring on an external table. All examples below use this parameter.
Non-partitioned table
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
-- Enable data mirroring
data_mirroring_speed_up_enable = 'on',
-- Specify the resources for the mirror build process
data_mirroring_guc_hg_computing_resource = 'local',
-- Set indexes
data_mirroring_clustering_key = 'c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns = 'c_int:on,c_char:auto',
data_mirroring_bitmap_columns = 'c_int:on'
);
Partitioned table
Choose one of the following partition scopes based on your needs.
Mirror the N most recent partitions (recommended for large tables with time-based partitions):
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_num = '7',
-- Use Serverless resources for the mirror build process
data_mirroring_guc_hg_computing_resource = 'serverless',
data_mirroring_guc_hg_experimental_serverless_computing_required_cores = '16',
data_mirroring_clustering_key = 'c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns = 'c_int:on,c_char:auto',
data_mirroring_bitmap_columns = 'c_int:on'
);
Mirror all partitions:
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_num = 'all',
data_mirroring_guc_hg_computing_resource = 'local',
data_mirroring_clustering_key = 'c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns = 'c_int:on,c_char:auto',
data_mirroring_bitmap_columns = 'c_int:on'
);
Mirror a specific list of partitions:
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_list = 'pt1,pt2',
data_mirroring_guc_hg_computing_resource = 'local',
data_mirroring_clustering_key = 'c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns = 'c_int:on,c_char:auto',
data_mirroring_bitmap_columns = 'c_int:on'
);
Combine `partition_num` and `partition_list` to retain both recent partitions and specific historical partitions:
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_num = '7',
data_mirroring_partition_list = 'pt1',
data_mirroring_guc_hg_computing_resource = 'local',
data_mirroring_clustering_key = 'c_int:asc,c_char:desc',
data_mirroring_dictionary_encoding_columns = 'c_int:on,c_char:auto',
data_mirroring_bitmap_columns = 'c_int:on'
);
Parameters
| Parameter | Required | Description |
|---|---|---|
data_mirroring_speed_up_enable |
Yes | Enables or disables data mirroring. Valid values: on, off. |
data_mirroring_partition_num |
No | Applies to partitioned tables. Specifies the number of most recent partitions to mirror. Valid values: all (all partitions) or an integer 1 to N (the N most recent partitions). |
data_mirroring_partition_list |
No | Applies to partitioned tables. Specifies a comma-separated list of partition names to mirror. |
data_mirroring_guc_hg_computing_resource |
No | Specifies the resources used for the mirror build process. Default: local. Valid values: local (current instance resources) or serverless (Serverless resources). |
data_mirroring_guc_hg_experimental_serverless_computing_required_cores |
No | Specifies the number of cores for the Serverless build process. Required only when data_mirroring_guc_hg_computing_resource is set to serverless. |
data_mirroring_clustering_key |
No | Sets a clustering key on the mirrored data to accelerate range queries. |
data_mirroring_dictionary_encoding_columns |
No | Sets dictionary encoding on specified columns to reduce storage and speed up queries on low-cardinality data. |
data_mirroring_bitmap_columns |
No | Sets bitmap indexes on specified columns to accelerate filter queries. |
Monitor mirroring
Use the following tools depending on what you want to check.
| Goal | Tool |
|---|---|
| See which tables have mirroring enabled | hg_datalake_get_mirror_config() |
| Check the build status and progress of active mirror tasks | hg_datalake_get_mirror_status() |
| View the history and scheduling of mirror sync tasks | hg_user_datalake_mirror_cron_tasks |
| Confirm whether a query hit the mirror | EXPLAIN ANALYZE |
View mirroring configuration
Run the following statement to list all tables that have data mirroring enabled:
SELECT * FROM hologres.hg_datalake_get_mirror_config();
Check mirror build status
Run the following statement to check the build status and file sync progress of active mirror tasks:
SELECT * FROM hologres.hg_datalake_get_mirror_status();
The output includes the following fields:
| Field | Description |
|---|---|
external_db_name |
The External Database where the mirrored table resides. |
external_schema_name |
The external schema where the mirrored table resides. |
external_table_name |
The mirrored external table. |
partition |
The partition being mirrored, if applicable. |
mirror_data_size |
The size of the mirrored data for the table or partition. |
mirror_start_time |
The start time of the mirror build task. |
mirror_last_update_time |
The last time the mirrored data was updated. |
total_file_count |
Total number of files in the source table or partition. |
mirrored_file_count |
Number of files that have been mirrored so far. |
To estimate sync progress, compare mirrored_file_count against total_file_count. When both values are equal, the mirror build is complete.
View mirror sync history
Run the following statement to view the scheduling and execution history of mirror sync tasks for a specific External Database:
SELECT *
FROM hologres.hg_user_datalake_mirror_cron_tasks
WHERE command::jsonb->>'external_db_name' = '<external_database_name>'
ORDER BY start_time DESC;
Replace <external_database_name> with the name of your External Database.
Confirm query hits the mirror
Run EXPLAIN ANALYZE on your query to check whether Hologres served the results from the mirrored data:
EXPLAIN ANALYZE SELECT <columns> FROM <holo_ext_db.ext_schema.ext_table>;
At the end of the output, look for the following two lines:
Meta mirror table count: use 1 miss 0.
Data mirror file count: use 12 miss 0.
A miss value of 0 means all data was served from the mirror. If miss is greater than 0, some data was still read from the remote data lake.
Modify the mirror configuration
To update the mirror configuration—for example, to change the number of partitions—run ALTER EXTERNAL TABLE again with the updated parameters:
ALTER EXTERNAL TABLE <ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'on',
data_mirroring_partition_num = '12'
);
Disable data mirroring
To disable mirroring on a table, set data_mirroring_speed_up_enable to off:
ALTER EXTERNAL TABLE <holo_ext_db.ext_schema.ext_table>
SET(
data_mirroring_speed_up_enable = 'off'
);
After mirroring is disabled, Hologres stops syncing data and queries access the data lake directly. The locally mirrored data files are asynchronously deleted within 30 minutes.
What's next
-
Set table properties and indexes — learn more about the index types supported on mirrored data
-
OSS data lake acceleration — overview of the data lake acceleration service
-
CREATE EXTERNAL DATABASE — set up the External Database required for data mirroring