All Products
Search
Document Center

Hologres:Speed up Paimon reads

Last Updated:Mar 26, 2026

Run on-demand SQL queries against Apache Paimon data in Object Storage Service (OSS) using a Hologres shared cluster — no reserved resources required. You pay only for the data scanned by each query.

What you'll learn

In this tutorial, you'll learn how to:

  • Create an E-MapReduce (EMR) data lake cluster with Paimon and Data Lake Formation (DLF) unified metadata

  • Build a Paimon data source from TPC-H textfile data using Apache Spark

  • Connect Hologres to the Paimon data lake via a DLF foreign server

  • Run analytical queries against Paimon foreign tables in Hologres

Background

Apache Paimon is a unified lake storage format for streaming and batch processing. It supports high-throughput writes and low-latency queries, and integrates with Alibaba Cloud compute engines including Flink, Hologres, MaxCompute, and EMR Spark. For more information, see Apache Paimon.

Hologres shared clusters provide a serverless online query acceleration service designed for foreign tables in MaxCompute and OSS data lakes. Built on a cloud-native, storage-compute separation architecture, shared clusters let you query lake data on demand and pay based on the amount of data each SQL statement scans. For more information, see Overview of shared clusters.

Architecture

image.png

Choose a cluster type

Hologres offers two cluster types for querying Paimon data. Choose based on whether you need write access.

Shared cluster Dedicated instance
Read Paimon data from OSS Yes Yes
Import OSS data into Hologres internal tables No Yes
Billing Per data scanned Reserved resources
Used in this tutorial Yes (example) Same procedure
Dedicated instances use the same procedures described in this tutorial to read Paimon data.

Limitations

  • Hologres V2.1.6 or later is required to query Paimon data lakes.

  • Shared clusters support read-only access to OSS data lakes. To import OSS data into Hologres internal tables, use a dedicated instance instead.

Prerequisites

Before you begin, make sure you have:

Set up a Paimon lakehouse and query it from Hologres

This tutorial uses 10 GB of TPC-H data. The end-to-end workflow has four stages: create an EMR cluster, build the Paimon data source, connect Hologres, and run queries.

Step 1: Create an EMR data lake cluster

  1. Log on to the EMR on ECS console and create an EMR cluster. For general cluster creation steps, see Create a cluster.

  2. Set the following key parameters:

    Parameter Value
    Business scenario Select Data Lake
    Optional services Select Spark, Hive, and Paimon. Add other services as needed.
    Metadata Select DLF Unified Metadata
    DLF data catalog Select a custom catalog (for example, paimon_catalog) if you created one, or use the default catalog. If you use the default catalog, you can omit the dlf_catalog parameter when creating the foreign server later.
    Cluster storage root path Select an OSS bucket with the OSS-HDFS service enabled

Step 2: Build the Paimon data source

Build a textfile data source using EMR Hive, then use Spark to create Paimon tables and import the data.

  1. Build the data source. Follow the steps in Build data using EMR Spark, but replace the ./dbgen -vf -s 100 command with ./dbgen -vf -s 10 to generate a 10 GB dataset.

  2. Start Spark SQL with the Paimon catalog configured to use DLF as the metastore:

    spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=dlf
  3. Create a Paimon database in your OSS-HDFS bucket:

    CREATE DATABASE paimon_db location 'oss://${oss-hdfs-bucket}/tpch_10G/paimon_tpch_10g/';

    Replace ${oss-hdfs-bucket} with the name of your OSS-HDFS-enabled bucket.

  4. Switch to the new database and create Paimon tables by importing from the textfile source:

    USE paimon_db;
    
    CREATE TABLE nation_paimon
    TBLPROPERTIES ('primary-key' = 'N_NATIONKEY')
    AS SELECT * FROM ${source}.nation_textfile;
    
    CREATE TABLE region_paimon
    TBLPROPERTIES ('primary-key' = 'R_REGIONKEY')
    AS SELECT * FROM ${source}.region_textfile;
    
    CREATE TABLE supplier_paimon
    TBLPROPERTIES ('primary-key' = 'S_SUPPKEY')
    AS SELECT * FROM ${source}.supplier_textfile;
    
    CREATE TABLE customer_paimon
    PARTITIONED BY (c_mktsegment)
    TBLPROPERTIES ('primary-key' = 'C_CUSTKEY')
    AS SELECT * FROM ${source}.customer_textfile;
    
    CREATE TABLE part_paimon
    PARTITIONED BY (p_brand)
    TBLPROPERTIES ('primary-key' = 'P_PARTKEY')
    AS SELECT * FROM ${source}.part_textfile;
    
    CREATE TABLE partsupp_paimon
    TBLPROPERTIES ('primary-key' = 'PS_PARTKEY,PS_SUPPKEY')
    AS SELECT * FROM ${source}.partsupp_textfile;
    
    CREATE TABLE orders_paimon
    PARTITIONED BY (o_orderdate)
    TBLPROPERTIES ('primary-key' = 'O_ORDERKEY')
    AS SELECT * FROM ${source}.orders_textfile;
    
    CREATE TABLE lineitem_paimon
    PARTITIONED BY (l_shipdate)
    TBLPROPERTIES ('primary-key' = 'L_ORDERKEY,L_LINENUMBER')
    AS SELECT * FROM ${source}.lineitem_textfile;

    Replace ${source} with the name of the Hive database where the *_textfile tables are located.

Step 3: Connect Hologres to the Paimon data lake

Create a foreign server in your Hologres shared cluster. The server uses dlf_fdw (DLF foreign data wrapper) to authenticate with DLF and locate data in OSS.

CREATE SERVER IF NOT EXISTS dlf_server FOREIGN DATA WRAPPER dlf_fdw OPTIONS
(
    dlf_catalog 'paimon_catalog',                           -- Name of your custom DLF data catalog.
                                                             -- Omit this line if you use the default catalog.
    dlf_endpoint 'dlf-share.cn-shanghai.aliyuncs.com',     -- DLF service endpoint for your region.
    oss_endpoint 'cn-shanghai.oss-dls.aliyuncs.com'        -- OSS-HDFS endpoint for your region.
);

Replace the endpoint values with those for your region. The dlf_endpoint and oss_endpoint values must correspond to the same region as your EMR cluster and OSS bucket.

Step 4: Create foreign tables and query Paimon data

  1. Import the Paimon table schema into Hologres as a foreign table:

    IMPORT FOREIGN SCHEMA paimon_db LIMIT TO
    (
      lineitem_paimon
    )
    FROM SERVER dlf_server INTO public OPTIONS (if_table_exist 'update');

    This creates a foreign table in Hologres that maps to lineitem_paimon in your Paimon data lake. No data is copied — queries read directly from OSS.

  2. Run a query against the foreign table. The following example is TPC-H Q1:

    SELECT
      l_returnflag,
      l_linestatus,
      SUM(l_quantity)                                       AS sum_qty,
      SUM(l_extendedprice)                                  AS sum_base_price,
      SUM(l_extendedprice * (1 - l_discount))               AS sum_disc_price,
      SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
      AVG(l_quantity)                                       AS avg_qty,
      AVG(l_extendedprice)                                  AS avg_price,
      AVG(l_discount)                                       AS avg_disc,
      COUNT(*)                                              AS count_order
    FROM lineitem_paimon
    WHERE l_shipdate <= date '1998-12-01' - interval '120' DAY
    GROUP BY l_returnflag, l_linestatus
    ORDER BY l_returnflag, l_linestatus;

    A successful query returns rows grouped by l_returnflag and l_linestatus, with aggregated quantity, price, discount, and order count columns. If you see results, your foreign table is correctly mapped to the Paimon data in OSS. For the remaining 21 TPC-H queries, see 22 TPC-H query statements.

Summary

You have set up an end-to-end serverless lakehouse on Alibaba Cloud. Along the way, you:

  • Created an EMR data lake cluster with Paimon and DLF unified metadata. For more information about cluster configuration, see Create a cluster.

  • Built Paimon tables from TPC-H textfile data using Spark SQL.

  • Connected Hologres to the Paimon data lake via a DLF foreign server and dlf_fdw.

  • Ran analytical queries against Paimon foreign tables directly in OSS — no data movement required.

What's next

  • To import OSS data into Hologres internal tables for higher-performance queries, use a dedicated instance with the same connection procedure.

  • To explore more Paimon integration options, see Apache Paimon.

  • To learn about shared cluster pricing, see Overview of shared clusters.