All Products
Search
Document Center

Hologres:Paimon-based Hologres serverless data lake solution

Last Updated:Dec 03, 2025

This topic describes the Paimon-based Hologres serverless data lake solution. This solution is ideal for accelerating queries on Paimon data stored in Object Storage Service (OSS). You can use resources on demand and pay for what you use, and no resources need to be reserved. The solution provides a flexible and scalable data lake architecture to help you better manage and use data, which improves data-driven decision-making and business innovation.

Background information

Apache Paimon is a unified lake storage format for both streaming and batch processing. It supports high-throughput writes and low-latency queries. Common compute engines on Alibaba Cloud's big data platform, such as Flink, Hologres, MaxCompute, and EMR on Spark, are well integrated with Paimon. You can use Apache Paimon to quickly build your own data lake storage service on OSS and connect to these compute engines for data lake analytics. For more information, see Apache Paimon.

Hologres shared clusters are a serverless online query acceleration service designed for foreign tables in MaxCompute and OSS data lakes. Based on the cloud-native architecture of Hologres that separates storage and computing, shared clusters accelerate the analysis of lake data in OSS. You can use resources on demand and pay based on the amount of data scanned by SQL statements. For more information, see Overview of shared clusters.

Architecture

image.png

Prerequisites

Note

Hologres dedicated instances also support reading data from Paimon and other lake formats. The procedure is the same as for shared cluster instances. This topic uses a shared cluster instance as an example.

Usage notes

  • Only Hologres V2.1.6 and later support querying data from Paimon data lakes.

  • Hologres shared clusters only support reading data from OSS data lakes. They do not support importing data from OSS. To import data from OSS into Hologres internal tables, you must use a dedicated instance.

Procedure

  1. Purchase an Elastic MapReduce (EMR) data lake instance.

    Log on to the EMR on ECS console and create an EMR cluster. For more information, see Create a cluster. Key parameters are described below:

    Configuration item

    Description

    Business Scenario

    Select Data Lake.

    Optional Services

    The required services are Spark, Hive, and Paimon. Select other services as needed.

    Metadata

    Select DLF Unified Metadata.

    DLF Catalog

    • To use a custom DLF data catalog, select the created catalog, for example, paimon_catalog. For more information, see Create a data catalog.

    • You can also select the default catalog. If you use the DLF default catalog, you do not need to specify the dlf_catalog parameter when you create a foreign server in the Hologres shared cluster.

    Root Storage Directory of Cluster

    Select the path of a bucket for which the OSS-HDFS service is enabled.

  2. Build the data source.

    This example uses 10 GB of TPC-H data. Use EMR Hive to build a data source in the textfile format. For more information about how to build the data source, see Build data using EMR Spark.

    Important

    When you generate the data, replace the ./dbgen -vf -s 100 command with ./dbgen -vf -s 10.

  3. Use Spark to create a Paimon table.

    1. Log on to Spark SQL.

      spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=dlf
    2. Create a database.

      -- Create a database.
      CREATE DATABASE paimon_db location 'oss://${oss-hdfs-bucket}/tpch_10G/paimon_tpch_10g/';

      ${oss-hdfs-bucket}: the name of the bucket for which the OSS-HDFS service is enabled.

    3. Create a Paimon table and import the textfile data that you prepared in the Build the data source step.

      -- Switch to the database that you just created.
      use paimon_db;
      
      -- Create a table and import data.
      
      CREATE TABLE nation_paimon 
      TBLPROPERTIES (
          'primary-key' = 'N_NATIONKEY'
      )
      AS SELECT * from ${source}.nation_textfile;
      
      CREATE TABLE region_paimon
      TBLPROPERTIES (
          'primary-key' = 'R_REGIONKEY'
      )
      AS SELECT * FROM ${source}.region_textfile;
      
      CREATE TABLE supplier_paimon
      TBLPROPERTIES (
          'primary-key' = 'S_SUPPKEY'
      )
      AS SELECT * FROM ${source}.supplier_textfile;
      
      CREATE TABLE customer_paimon
      partitioned BY (c_mktsegment)
      TBLPROPERTIES (
          'primary-key' = 'C_CUSTKEY'
      )
      AS SELECT * FROM ${source}.customer_textfile;
      
      CREATE TABLE part_paimon
      partitioned BY (p_brand)
      TBLPROPERTIES (
          'primary-key' = 'P_PARTKEY'
      )
      AS SELECT * FROM ${source}.part_textfile;
      
      CREATE TABLE partsupp_paimon
      TBLPROPERTIES (
          'primary-key' = 'PS_PARTKEY,PS_SUPPKEY'
      )
      AS SELECT * FROM ${source}.partsupp_textfile;
      
      CREATE TABLE orders_paimon
      partitioned BY (o_orderdate)
      TBLPROPERTIES (
          'primary-key' = 'O_ORDERKEY'
      )
      AS SELECT * FROM ${source}.orders_textfile;
      
      CREATE TABLE lineitem_paimon
      partitioned BY (l_shipdate)
      TBLPROPERTIES (
          'primary-key' = 'L_ORDERKEY,L_LINENUMBER'
      )
      AS SELECT * FROM ${source}.lineitem_textfile;

      ${source}: the name of the database where the *_textfile table resides in Hive.

  4. Create a foreign server in the Hologres shared cluster.

    Note

    When you create the EMR data lake instance:

    • If you select a custom data catalog for DLF Catalog, you must set the dlf_catalog parameter to the name of the custom catalog.

    • If you select the default catalog for DLF Catalog, you do not need to configure the dlf_catalog parameter. You can delete the parameter.

    -- Create a foreign server.
    
    CREATE SERVER IF NOT EXISTS dlf_server FOREIGN data wrapper dlf_fdw options 
    (
        dlf_catalog 'paimon_catalog',
        dlf_endpoint 'dlf-share.cn-shanghai.aliyuncs.com',
        oss_endpoint 'cn-shanghai.oss-dls.aliyuncs.com'
    );
  5. Create a foreign table for the Paimon table in the Hologres shared cluster.

    IMPORT FOREIGN SCHEMA paimon_db LIMIT TO
    (
      lineitem_paimon
    )
    FROM SERVER dlf_server INTO public options (if_table_exist 'update');
  6. Query data.

    The following SQL statement is an example of Query 1 (Q1):

    SELECT
      l_returnflag,
      l_linestatus,
      SUM(l_quantity) AS sum_qty,
      SUM(l_extendedprice) AS sum_base_price,
      SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
      SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
      AVG(l_quantity) AS avg_qty,
      AVG(l_extendedprice) AS avg_price,
      AVG(l_discount) AS avg_disc,
      COUNT(*) AS count_order
    FROM
      lineitem_paimon
    WHERE
      l_shipdate <= date '1998-12-01' - interval '120' DAY
    GROUP BY
      l_returnflag,
      l_linestatus
    ORDER BY
      l_returnflag,
      l_linestatus;
    Note

    For the other 21 SQL statements, see 22 TPC-H query statements.