Run on-demand SQL queries against Apache Paimon data in Object Storage Service (OSS) using a Hologres shared cluster — no reserved resources required. You pay only for the data scanned by each query.
What you'll learn
In this tutorial, you'll learn how to:
-
Create an E-MapReduce (EMR) data lake cluster with Paimon and Data Lake Formation (DLF) unified metadata
-
Build a Paimon data source from TPC-H textfile data using Apache Spark
-
Connect Hologres to the Paimon data lake via a DLF foreign server
-
Run analytical queries against Paimon foreign tables in Hologres
Background
Apache Paimon is a unified lake storage format for streaming and batch processing. It supports high-throughput writes and low-latency queries, and integrates with Alibaba Cloud compute engines including Flink, Hologres, MaxCompute, and EMR Spark. For more information, see Apache Paimon.
Hologres shared clusters provide a serverless online query acceleration service designed for foreign tables in MaxCompute and OSS data lakes. Built on a cloud-native, storage-compute separation architecture, shared clusters let you query lake data on demand and pay based on the amount of data each SQL statement scans. For more information, see Overview of shared clusters.
Architecture
Choose a cluster type
Hologres offers two cluster types for querying Paimon data. Choose based on whether you need write access.
| Shared cluster | Dedicated instance | |
|---|---|---|
| Read Paimon data from OSS | Yes | Yes |
| Import OSS data into Hologres internal tables | No | Yes |
| Billing | Per data scanned | Reserved resources |
| Used in this tutorial | Yes (example) | Same procedure |
Dedicated instances use the same procedures described in this tutorial to read Paimon data.
Limitations
-
Hologres V2.1.6 or later is required to query Paimon data lakes.
-
Shared clusters support read-only access to OSS data lakes. To import OSS data into Hologres internal tables, use a dedicated instance instead.
Prerequisites
Before you begin, make sure you have:
-
A Hologres shared cluster instance. See Purchase a Hologres instance.
-
Data Lake Formation (DLF) activated. See DLF Quick Start.
-
(Optional) A custom DLF data catalog, if you want to use a non-default catalog when creating the foreign server. See Create a data catalog.
-
OSS activated for data lake storage. See the OSS console.
-
The OSS-HDFS service enabled on your OSS bucket. See Enable the OSS-HDFS service.
Set up a Paimon lakehouse and query it from Hologres
This tutorial uses 10 GB of TPC-H data. The end-to-end workflow has four stages: create an EMR cluster, build the Paimon data source, connect Hologres, and run queries.
Step 1: Create an EMR data lake cluster
-
Log on to the EMR on ECS console and create an EMR cluster. For general cluster creation steps, see Create a cluster.
-
Set the following key parameters:
Parameter Value Business scenario Select Data Lake Optional services Select Spark, Hive, and Paimon. Add other services as needed. Metadata Select DLF Unified Metadata DLF data catalog Select a custom catalog (for example, paimon_catalog) if you created one, or use the default catalog. If you use the default catalog, you can omit thedlf_catalogparameter when creating the foreign server later.Cluster storage root path Select an OSS bucket with the OSS-HDFS service enabled
Step 2: Build the Paimon data source
Build a textfile data source using EMR Hive, then use Spark to create Paimon tables and import the data.
-
Build the data source. Follow the steps in Build data using EMR Spark, but replace the
./dbgen -vf -s 100command with./dbgen -vf -s 10to generate a 10 GB dataset. -
Start Spark SQL with the Paimon catalog configured to use DLF as the metastore:
spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=dlf -
Create a Paimon database in your OSS-HDFS bucket:
CREATE DATABASE paimon_db location 'oss://${oss-hdfs-bucket}/tpch_10G/paimon_tpch_10g/';Replace
${oss-hdfs-bucket}with the name of your OSS-HDFS-enabled bucket. -
Switch to the new database and create Paimon tables by importing from the textfile source:
USE paimon_db; CREATE TABLE nation_paimon TBLPROPERTIES ('primary-key' = 'N_NATIONKEY') AS SELECT * FROM ${source}.nation_textfile; CREATE TABLE region_paimon TBLPROPERTIES ('primary-key' = 'R_REGIONKEY') AS SELECT * FROM ${source}.region_textfile; CREATE TABLE supplier_paimon TBLPROPERTIES ('primary-key' = 'S_SUPPKEY') AS SELECT * FROM ${source}.supplier_textfile; CREATE TABLE customer_paimon PARTITIONED BY (c_mktsegment) TBLPROPERTIES ('primary-key' = 'C_CUSTKEY') AS SELECT * FROM ${source}.customer_textfile; CREATE TABLE part_paimon PARTITIONED BY (p_brand) TBLPROPERTIES ('primary-key' = 'P_PARTKEY') AS SELECT * FROM ${source}.part_textfile; CREATE TABLE partsupp_paimon TBLPROPERTIES ('primary-key' = 'PS_PARTKEY,PS_SUPPKEY') AS SELECT * FROM ${source}.partsupp_textfile; CREATE TABLE orders_paimon PARTITIONED BY (o_orderdate) TBLPROPERTIES ('primary-key' = 'O_ORDERKEY') AS SELECT * FROM ${source}.orders_textfile; CREATE TABLE lineitem_paimon PARTITIONED BY (l_shipdate) TBLPROPERTIES ('primary-key' = 'L_ORDERKEY,L_LINENUMBER') AS SELECT * FROM ${source}.lineitem_textfile;Replace
${source}with the name of the Hive database where the*_textfiletables are located.
Step 3: Connect Hologres to the Paimon data lake
Create a foreign server in your Hologres shared cluster. The server uses dlf_fdw (DLF foreign data wrapper) to authenticate with DLF and locate data in OSS.
CREATE SERVER IF NOT EXISTS dlf_server FOREIGN DATA WRAPPER dlf_fdw OPTIONS
(
dlf_catalog 'paimon_catalog', -- Name of your custom DLF data catalog.
-- Omit this line if you use the default catalog.
dlf_endpoint 'dlf-share.cn-shanghai.aliyuncs.com', -- DLF service endpoint for your region.
oss_endpoint 'cn-shanghai.oss-dls.aliyuncs.com' -- OSS-HDFS endpoint for your region.
);
Replace the endpoint values with those for your region. The dlf_endpoint and oss_endpoint values must correspond to the same region as your EMR cluster and OSS bucket.
Step 4: Create foreign tables and query Paimon data
-
Import the Paimon table schema into Hologres as a foreign table:
IMPORT FOREIGN SCHEMA paimon_db LIMIT TO ( lineitem_paimon ) FROM SERVER dlf_server INTO public OPTIONS (if_table_exist 'update');This creates a foreign table in Hologres that maps to
lineitem_paimonin your Paimon data lake. No data is copied — queries read directly from OSS. -
Run a query against the foreign table. The following example is TPC-H Q1:
SELECT l_returnflag, l_linestatus, SUM(l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price, SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price, AVG(l_discount) AS avg_disc, COUNT(*) AS count_order FROM lineitem_paimon WHERE l_shipdate <= date '1998-12-01' - interval '120' DAY GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;A successful query returns rows grouped by
l_returnflagandl_linestatus, with aggregated quantity, price, discount, and order count columns. If you see results, your foreign table is correctly mapped to the Paimon data in OSS. For the remaining 21 TPC-H queries, see 22 TPC-H query statements.
Summary
You have set up an end-to-end serverless lakehouse on Alibaba Cloud. Along the way, you:
-
Created an EMR data lake cluster with Paimon and DLF unified metadata. For more information about cluster configuration, see Create a cluster.
-
Built Paimon tables from TPC-H textfile data using Spark SQL.
-
Connected Hologres to the Paimon data lake via a DLF foreign server and
dlf_fdw. -
Ran analytical queries against Paimon foreign tables directly in OSS — no data movement required.
What's next
-
To import OSS data into Hologres internal tables for higher-performance queries, use a dedicated instance with the same connection procedure.
-
To explore more Paimon integration options, see Apache Paimon.
-
To learn about shared cluster pricing, see Overview of shared clusters.