Apache Paimon external tables - MaxCompute - Alibaba Cloud Documentation Center

MaxCompute allows you to create an Apache Paimon external table and establish a mapping between the external table and the directory of the Apache Paimon table that is stored in Object Storage Service (OSS). You can use the Apache Paimon external table to access data in the Apache Paimon table that is stored in OSS. This topic describes how to create an Apache Paimon external table and access the Apache Paimon external table by using MaxCompute.

Background information

Apache Paimon is an integrated streaming and batch processing lake storage format that supports high-throughput writes and low-latency queries. Common compute engines such as Spark, Hive, and Trino of Alibaba Cloud Realtime Compute for Apache Flink and E-MapReduce are seamlessly integrated with Apache Paimon. Apache Paimon helps you quickly build your own data lake storage service on OSS and connect the service to MaxCompute to implement data lake analytics. For more information about Apache Paimon, see Apache Paimon.

Prerequisites

The Alibaba Cloud account that you use to perform operations has the CreateTable permission to create MaxCompute tables. For more information about table permissions, see MaxCompute permissions.
A MaxCompute project is created. For more information, see Create a MaxCompute project.
OSS is activated. A bucket and a file directory are created. For more information, see Create a bucket.
Note
MaxCompute is deployed only in specific regions. To prevent a cross-region data connectivity issue, we recommend that you use a bucket in the same region as your MaxCompute project.
Fully managed Flink is activated. For more information, see Activate Realtime Compute for Apache Flink.

Precautions

MaxCompute can only read data from Apache Paimon external tables. You can use Alibaba Cloud Realtime Compute for Apache Flink or E-MapReduce to create an Apache Paimon external table.
Apache Paimon does not support MaxCompute projects for which the schema feature is enabled.
Apache Paimon external tables do not support the clustering attribute.

Syntax

The following sample code shows the syntax for creating an Apache Paimon external table in MaxCompute.

create external table  [if not exists] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[partitioned by (<col_name> <data_type>, ...)]
stored by 'org.apache.paimon.hive.PaimonStorageHandler'
with serdeproperties (
    'odps.properties.rolearn'='acs:ram::xxxxxxxxxxxxx:role/aliyunodpsdefaultrole'
)
location '<oss_location>'
using 'paimon_maxcompute_connector.jar';

For more information about the syntax and the parameters in the syntax, see Create an OSS external table.

Procedure

Step 1: Prepare data in Realtime Compute for Apache Flink

Create an Apache Paimon catalog and an Apache Paimon table, and insert data into the table. This section describes the procedure. If an Apache Paimon table and data are available in Realtime Compute for Apache Flink, skip this step.

Log on to the Realtime Compute for Apache Flink console and create an Apache Paimon catalog. For more information, see Create an Apache Paimon catalog.
Create an Apache Paimon table. For more information, see Use an Apache Paimon catalog.
1. In the left-side navigation pane of the console of fully managed Flink, click Catalogs. In the Catalogs pane, find the Apache Paimon catalog that you create and click the default database. On the page that appears, click Create Table, as shown in the following figure.
2. On the Built-in tab of the Create Table dialog box, click Apache Paimon and click Next. In the Create Table step, enter the following table creation statement and click Confirm. In this example, an Apache Paimon table named test_tbl is created.
```
CREATE TABLE `catalogname`.`default`.test_tbl (
    dt STRING,
    id BIGINT,
    data STRING,
    PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt);
```
3. On the SQL Editor page, create an SQL draft that contains the following statement, deploy the draft, and run the deployment for the draft. For more information about how to develop and run an SQL draft, see Develop an SQL draft.
```
INSERT INTO `catalogname`.`default`.test_tbl VALUES ('2023-04-21', 1, 'AAA'), ('2023-04-21', 2, 'BBB'), ('2023-04-22', 1, 'CCC'), ('2023-04-22', 2, 'DDD');
```
  Note
  - Make sure that the engine version of the SQL draft is vvr-8.0.1-flink-1.17 or later.
  - If the SQL deployment is a bounded streaming deployment, you must go to the Deployments page, click Edit in the upper-right corner of the Parameters section of the Configuration tab, and then add execution.checkpointing.checkpoints-after-tasks-finish.enabled: true to the Other Configuration field. For example, if the INSERT INTO ... VALUES ... statement is included in an SQL deployment, the deployment is a bounded streaming deployment. For more information about how to configure the runtime parameters of a deployment, see Configure a deployment.

Step 2: Upload the Apache Paimon plug-in to your MaxCompute project

You can use one of the following methods to upload the Apache Paimon plug-in to the MaxCompute project.

Use the MaxCompute client (odpscmd)

Access the MaxCompute project on the MaxCompute client (odpscmd) and run the following code to upload the paimon_maxcompute_connector.jar package to the MaxCompute project:

ADD JAR <path_to_paimon_maxcompute_connector.jar>;

Use the DataWorks console

Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the desired workspace and choose Shortcuts > Data Development in the Actions column.
On the DataStudio page, click Create and choose Create Resource > JAR.
In the Create Resource dialog box, configure the parameters, upload the paimon_maxcompute_connector.jar package, and then click Create. For more information about how to create a resource, see Step 1: Create a resource or upload an existing resource.
After the resource is created, click the icon on the toolbar on the configuration tab of the resource to commit the resource to the development environment.

Step 3: Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements

Create an Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements. In this example, an Apache Paimon external table named oss_extable_paimon_1 is created.

create external table  oss_extable_paimon_1
(
    dt STRING,
    id BIGINT,
    data STRING
)
stored by 'org.apache.paimon.hive.PaimonStorageHandler'
with serdeproperties (
    'odps.properties.rolearn'='acs:ram::124895376754****:role/aliyunodpsdefaultrole'
)
location 'oss://oss-cn-beijing-internal.aliyuncs.com/paimon_flink/test_db_y.db/test_tbl_nopt/'
using 'paimon_maxcompute_connector.jar'
;

Step 4: Read data from the Apache Paimon external table on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements

Run the following commands on the MaxCompute client (odpscmd) or by using a tool that can execute MaxCompute SQL statements to read data from the Apache Paimon external table:
```
SET odps.sql.common.table.planner.ext.hive.bridge = true;
SET odps.sql.hive.compatible = true;
```

Run the following command to query data from the Apache Paimon external table oss_extable_paimon_1:

SELECT * FROM oss_extable_paimon_1;

The following result is returned:

+------------+------------+------------+
| id         | data       | dt         |
+------------+------------+------------+
| 1          | AAA        | 2023-04-21 |
| 2          | BBB        | 2023-04-21 |
+------------+------------+------------+