MaxCompute lets you create Paimon external tables to map to Paimon table folders in Object Storage Service (OSS) and access the data within these folders. This topic describes how to create a Paimon external table and access its data using MaxCompute.
Background information
Apache Paimon is a lake storage format for both streaming and batch processing. It provides high-throughput writes and low-latency queries. Common compute engines, such as Spark, Hive, or Trino, are fully integrated with Paimon and available in services such as Realtime Compute for Apache Flink and E-MapReduce. With Apache Paimon, you can quickly build your own data lake storage service in OSS and connect it to MaxCompute for data lake analytics. You can use metadata filtering to avoid reading unnecessary files from OSS folders during task processing.
Prerequisites
Ensure your Alibaba Cloud account, RAM user, or RAM role has the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.
(Optional) Prepare an OSS bucket, directories, and data files. For more information, see Create buckets, Manage directories, and Simple upload.
MaxCompute can automatically create directories in OSS. You can use a single SQL statement to read from or write to an external table that uses a UDF. Manual directory creation is no longer required, but the legacy method is still supported.
A MaxCompute project has been created. For more information, see Create a MaxCompute project.
MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, we recommend that you use a bucket in the same region as your MaxCompute project.
The Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.
Limits
MaxCompute currently supports only read operations for Paimon external tables. Write operations and automatic schema evolution for Paimon tables are not supported.
Paimon external tables do not support the `cluster` property.
Paimon external tables do not support features such as querying historical data versions.
Usage notes
When the schema in the Paimon file does not match the external table schema:
Mismatched column count: If the Paimon file contains fewer columns than the external table schema, the missing columns are filled with NULL values when the data is read. If the Paimon file contains more columns, the extra column data is discarded.
Mismatched column types: MaxCompute does not allow you to use the `INT` type to read `STRING` data from a Paimon file. You can use the `STRING` type to read `INT` data, but this practice is not recommended.
Supported data types
In the following table,
indicates support and
indicates no support.
For more information about MaxCompute data types, see Data types (Version 1.0) and Data types (Version 2.0).
Data type | Supported |
TINYINT |
|
SMALLINT |
|
INT |
|
BIGINT |
|
BINARY |
|
FLOAT |
|
DOUBLE |
|
DECIMAL(precision,scale) |
|
VARCHAR(n) |
|
CHAR(n) |
|
STRING |
|
DATE |
|
DATETIME |
|
TIMESTAMP |
|
TIMESTAMP_NTZ |
|
BOOLEAN |
|
ARRAY |
|
MAP |
|
STRUCT |
|
JSON |
|
Create an external table
Syntax
For more information about the syntax for external tables of different formats, see OSS external tables.
CREATE EXTERNAL TABLE [if NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
LOCATION '<oss_location>';Common parameters
For information about common parameters, see Basic syntax parameters.
Write data
For more information about the write syntax in MaxCompute, see Write syntax.
Query analysis
For more information about the SELECT syntax, see Query syntax.
For more information about optimizing query plans, see Query optimization.
Example
Step 1: Prepare data in Flink
Create a Paimon catalog and a Paimon table, and then insert data into the table. The following example shows how.
If you already have Paimon table data in OSS, you can skip this step.
Log on to the Realtime Compute for Apache Flink console and create a Paimon Filesystem Catalog. For this example, the catalog is named
paimoncatalog. For more information, see Create an Apache Paimon Filesystem catalog.Create a Paimon table. For more information, see Manage Apache Paimon tables.
In the text editor on the Data Query page, enter the following statement, select the code, and then click Run.
CREATE TABLE `paimoncatalog`.`default`.test_tbl ( id BIGINT, data STRING, dt STRING, PRIMARY KEY (dt, id) NOT ENFORCED ) PARTITIONED BY (dt);On the page, create an SQL job that contains the following statement. Then, deploy and run the job. For more information, see Develop an SQL draft.
INSERT INTO `paimoncatalog`.`default`.test_tbl VALUES (1,'CCC','2024-07-18'), (2,'DDD','2024-07-18');ImportantMake sure that the engine version for the SQL job is vvr-8.0.1-flink-1.17 or later.
If the SQL job is a bounded stream job, for example, a job that executes an
INSERT INTO ... VALUES ...statement, perform the following steps:On the O&M page, click the target job to go to its Deployment Details tab.
Edit the Runtime Parameter Configuration. In the Other Configurations section, set
execution.checkpointing.checkpoints-after-tasks-finish.enabled: true, and then save the configuration. For more information about how to configure runtime parameters for a job, see Configure job deployment information.
Step 2: Create a Paimon external table in MaxCompute
In MaxCompute, execute the following SQL statement to create a MaxCompute Paimon external table.
CREATE EXTERNAL TABLE oss_extable_paimon_pt
(
id BIGINT,
data STRING
)
PARTITIONED BY (dt STRING )
STORED BY 'org.apache.paimon.hive.PaimonStorageHandler'
WITH serdeproperties (
'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
)
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<table_path>'
;
MSCK REPAIR TABLE oss_extable_paimon_pt ADD PARTITIONS;In the preceding statement, table_path is the path of the Paimon table that you created in Flink, for example, oss-mc-test/paimon/default.db/test_tbl. To obtain the path, perform the following steps:
Log on to the Realtime Compute for Apache Flink console and click the name of your workspace.
In the navigation pane on the left, click Data Management.
On the Metadata page, click **default** in the target catalog. On the default page, click View in the target table's Actions column.
In the Table Properties section on the Table Schema Details tab, obtain the value of the path parameter. Use only the part of the path that follows
oss://as the value for table_path.
Step 3: Import partition data
If the OSS external table that you created is a partitioned table, you must also import the partition data. For more information, see Syntax for adding partition data to an OSS external table.
MSCK REPAIR TABLE oss_extable_paimon_pt ADD PARTITIONS;Step 4: Read the Paimon external table using MaxCompute
In MaxCompute, execute the following statement to query the MaxCompute Paimon external table `oss_extable_paimon_pt`.
SET odps.sql.common.table.planner.ext.hive.bridge = true;
SET odps.sql.hive.compatible = true;
SELECT * FROM oss_extable_paimon_pt WHERE dt='2024-07-18';The following result is returned:
+------------+------------+------------+
| id | data | dt |
+------------+------------+------------+
| 1 | CCC | 2024-07-18 |
| 2 | DDD | 2024-07-18 |
+------------+------------+------------+Related topics
You can also create a MaxCompute Paimon external table in Flink using a custom catalog. After data is written to the table, you can query and consume the Paimon data using MaxCompute. For more information, see Create an Apache Paimon external table by using Realtime Compute for Apache Flink.