Manage Apache Paimon catalogs - Realtime Compute for Apache Flink

After you configure an Apache Paimon catalog, you can directly access Apache Paimon tables that are stored in Alibaba Cloud Object Storage Service (OSS) buckets in the console of fully managed Flink. This topic describes how to create, view, use, and delete an Apache Paimon catalog in the console of fully managed Flink.

Important

A phased update is performed for fully managed Flink. If you cannot find the Run button or the Catalogs menu in the console of fully managed Flink, the phased update is not complete. You can contact the customer service or sales representative to perform the phased update.

Background information

Apache Paimon is a unified lake storage that allows you to process data in streaming and batch modes. Apache Paimon supports data writing with high throughput and data queries with low latency. Apache Paimon is compatible with common compute engines of Alibaba Cloud E-MapReduce (EMR), such as Flink, Spark, Hive, and Trino. You can use Apache Paimon to deploy your own data lake storage service on Hadoop Distributed File System (HDFS) or Alibaba Cloud OSS in an efficient manner, and connect to the preceding compute engines to perform data lake analytics. For more information, see Apache Paimon.

Apache Paimon catalogs allow you to manage Apache Paimon tables that are stored in OSS buckets. The created tables can also be accessed by using other compute engines. This topic describes the following operations that you can perform to manage Apache Paimon catalogs:

Create an Apache Paimon catalog
View an Apache Paimon catalog
Use an Apache Paimon catalog
Delete an Apache Paimon catalog

Prerequisites

Alibaba Cloud OSS is activated.

Important

You can use the OSS bucket that you specified when you activated the Realtime Compute for Apache Flink service. However, to better distinguish data and prevent misoperations, we recommend that you create and use an OSS bucket that resides in the same region as Realtime Compute for Apache Flink.

Limits

Only Realtime Compute for Apache Flink whose engine version is vvr-6.0.6-flink-1.15 or later supports Apache Paimon catalogs.
The OSS bucket that is used by an Apache Paimon catalog must reside in the same region as Realtime Compute for Apache Flink. The Alibaba Cloud account that is used to activate the Realtime Compute for Apache Flink service must have read and write permissions on the OSS bucket.

Precautions

After you execute an SQL statement to create or delete a catalog, database, or table, you cannot immediately view the changes on the Catalogs page due to the cache mechanism of fully managed Flink. To view the changes after you perform the operation, you must click the Refresh icon in the Catalogs pane of the Catalogs page.

image..png

Create an Apache Paimon catalog

Apache Paimon catalogs support two metadata storage types: filesystem and dlf. If you select the filesystem metadata storage type, metadata is stored only in OSS buckets. If you select the dlf metadata storage type, metadata is stored in OSS buckets and is also synchronized to Alibaba Cloud Data Lake Formation (DLF). You can select a metadata storage type based on your business requirements.

Create an Apache Paimon catalog on the UI (recommended)

Go to the Catalogs page.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. In the left-side navigation pane, click Catalogs.
On the Catalog List page, click Create Catalog.
On the Built-in Catalog tab of the Create Catalog dialog box, click Apache Paimon and click Next.

Create an Apache Paimon catalog.

Select filesystem or dlf from the metastore drop-down list based on your business requirements to specify the metadata storage type.

Configure the catalog parameters.

If you select filesystem, you must configure the following parameters.

Parameter	Description	Required	Remarks
catalog name	The name of the Apache Paimon catalog.	Yes	Enter a custom name.
metastore	The metadata storage type.	Yes	Select filesystem. Valid values: filesystem: Metadata is stored only in OSS buckets. dlf: Metadata is stored in OSS buckets and is also synchronized to Alibaba Cloud DLF.
warehouse	The data warehouse directory that is specified in OSS.	Yes	The format is oss://<bucket>/<object>. Parameters in the path: bucket: indicates the name of the OSS bucket that you created. object: indicates the path in which your data is stored. You can view the names of your bucket and object in the OSS console.

If you select dlf, you must configure the following parameters.

Parameter	Description	Required	Remarks
catalog name	The name of the Apache Paimon catalog.	Yes	Enter a custom name.
metastore	The metadata storage type.	Yes	Select dlf.
warehouse	The data warehouse directory that is specified in OSS.	Yes	The format is oss://<bucket>/<object>. Parameters in the path: bucket: indicates the name of the OSS bucket that you created. object: indicates the path in which your data is stored. You can view the names of your bucket and object in the OSS console.
dlf.catalog.id	The ID of the DLF data directory.	Yes	You can view the ID of the data directory in the DLF console.
dlf.catalog.accessKeyId	The AccessKey ID that is used to access the DLF service.	Yes	For more information about how to obtain your AccessKey ID, see Create an AccessKey pair.
dlf.catalog.accessKeySecret	The AccessKey secret that is used to access the DLF service.	Yes	For more information about how to obtain your AccessKey secret, see Create an AccessKey pair.
dlf.catalog.endpoint	The endpoint of the DLF service.	Yes	For more information, see Supported regions and endpoints.
dlf.catalog.region	The region in which the DLF service resides.	Yes	For more information, see Supported regions and endpoints. Note Make sure that the value of this parameter matches the endpoint that is specified by the dlf.catalog.endpoint parameter.

Click Confirm.

View the catalog that you create in the Catalogs pane on the left side of the Catalog List page.

Create an Apache Paimon catalog by executing an SQL statement

On the script editing page, enter a statement to create an Apache Paimon catalog.

If you select the filesystem metadata storage type, enter the following SQL statement:

CREATE CATALOG <yourcatalogname> WITH (
  'type' = 'paimon',
  'metastore' = 'filesystem',
  'warehouse' = '<warehouse>'
);

Parameter	Description	Required	Remarks
yourcatalogname	The name of the Apache Paimon catalog.	Yes	Enter a custom name. Important You must remove the angle brackets (<>) when you replace the value of the parameter with the name of your catalog. Otherwise, an error is returned during the syntax check.
type	The type of the catalog.	Yes	Set the value to paimon.
metastore	The metadata storage type.	Yes	Select filesystem.
warehouse	The data warehouse directory that is specified in OSS.	Yes	The format is oss://<bucket>/<object>. Parameters in the path: bucket: indicates the name of the OSS bucket that you created. object: indicates the path in which your data is stored. You can view the names of your bucket and object in the OSS console.

If you select the dlf metadata storage type, enter the following SQL statement:

CREATE CATALOG <yourcatalogname> WITH (
  'type' = 'paimon',
  'metastore' = 'dlf',
  'warehouse' = '<warehouse>',
  'dlf.catalog.id' = '<dlf.catalog.id>',
  'dlf.catalog.accessKeyId' = '<dlf.catalog.accessKeyId>',
  'dlf.catalog.accessKeySecret' = '<dlf.catalog.accessKeySecret>',
  'dlf.catalog.endpoint' = '<dlf.catalog.endpoint>',
  'dlf.catalog.region' = '<dlf.catalog.region>'
);

Parameter	Description	Required	Remarks
yourcatalogname	The name of the Apache Paimon catalog.	Yes	Enter a custom name. Important You must remove the angle brackets (<>) when you replace the value of the parameter with the name of your catalog. Otherwise, an error is returned during the syntax check.
type	The type of the catalog.	Yes	Set the value to paimon.
metastore	The metadata storage type.	Yes	Select dlf.
warehouse	The data warehouse directory that is specified in OSS.	Yes	The format is oss://<bucket>/<object>. Parameters in the path: bucket: indicates the name of the OSS bucket that you created. object: indicates the path in which your data is stored. You can view the names of your bucket and object in the OSS console.
dlf.catalog.id	The ID of the DLF data directory.	Yes	You can view the ID of the data directory in the DLF console.
dlf.catalog.accessKeyId	The AccessKey ID that is used to access the DLF service.	Yes	For more information about how to obtain your AccessKey ID, see Create an AccessKey pair.
dlf.catalog.accessKeySecret	The AccessKey secret that is used to access the DLF service.	Yes	For more information about how to obtain your AccessKey secret, see Create an AccessKey pair.
dlf.catalog.endpoint	The endpoint of the DLF service.	Yes	For more information, see Supported regions and endpoints.
dlf.catalog.region	The region in which the DLF service resides.	Yes	For more information, see Supported regions and endpoints. Note Make sure that the value of this parameter matches the endpoint that is specified by the dlf.catalog.endpoint parameter.

Select the code that is used to create a catalog and click Run that appears on the left side of the code.

View an Apache Paimon catalog

After you create an Apache Paimon catalog, you can perform the following steps to view the metadata of the Apache Paimon catalog.

Go to the Catalogs page.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. In the left-side navigation pane, click Catalogs.
On the Catalog List page, find the desired catalog and view the Name and Type columns of the catalog.
Note
If you want to view the databases and tables in the catalog, click View in the Actions column.

Use an Apache Paimon catalog

Create a database and a table

After the Apache Paimon catalog is configured, you can reference tables of the Apache Paimon catalog as result tables and dimension tables in deployments. You do not need to declare DDL statements for the tables.

In an SQL statement, you can use a table name of the Apache Paimon catalog in the complete format of ${Paimon-catalog-name}.${Paimon-db-name}.${Paimon-table-name}. You can also execute the use catalog ${Paimon-catalog-name} and use ${Paimon-db-name} statements to declare the catalog name and database name and then use only the table name in the ${Paimon-table-name} format in the SQL statement.

Use an Apache Paimon catalog on the UI

Go to the Catalogs page.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. In the left-side navigation pane, click Catalogs.
On the Catalog List page, find the desired catalog and click View in the Actions column.
On the page that appears, find the desired database and click View in the Actions column.
On the page that appears, click Create Table.
On the Built-in tab of the Create Table dialog box, click Apache Paimon and click Next.

Enter the table creation statement and configure related parameters. Sample code:

CREATE TABLE <catalog name>.test_db.test_tbl (
    dt STRING,
    id BIGINT,
    data STRING,
    PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt);

Click Confirm.

Use an Apache Paimon catalog by executing an SQL statement

On the script editing page, enter the table creation statement.

CREATE DATABASE paimoncatalog.test_db;

CREATE TABLE paimoncatalog.test_db.test_tbl (
    dt STRING,
    id BIGINT,
    data STRING,
    PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt);

Select the table creation statement and click Run that appears on the left side of the code.

Note

For more information about the parameters and usage of Apache Paimon tables, see Apache Paimon connector.

Read data from a catalog

SELECT * FROM `<catalog name>`.test_db.test_tbl;

Write data to a catalog

INSERT INTO `<catalog name>`.test_db.test_tbl VALUES ('2023-04-21', 1, 'AAA'), ('2023-04-21', 2, 'BBB');

Use an Apache Paimon catalog as the catalog of the destination store that is used in the CREATE TABLE AS statement

In Realtime Compute for Apache Flink whose engine version is vvr-6.0.7-flink-1.15 or later, you can use an Apache Paimon catalog as the catalog of the destination store that is used in the CREATE TABLE AS statement.

CREATE TABLE IF NOT EXISTS `<catalog name>`.`<db name>`.`<table name>`
WITH (
  'bucket' = '4' -- Specify the number of buckets for the result table.
) AS TABLE `<source table>`;

The CREATE TABLE AS statement allows you to configure physical table properties in the WITH clause. When you create a destination table, you can configure these properties for the table. For more information about the table properties supported by Apache Paimon catalogs, see Apache Paimon connector.

When you execute the CREATE TABLE AS statement, you may need to change the data type precision for the existing fields. For example, you can change the data type precision from VARCHAR(10) to VARCHAR(20). Apache Paimon catalogs allow you to configure 'enableTypeNormalization' = 'true' in the WITH clause to enable the type normalization mode. In this case, data type changes in the source table do not lead to a deployment failure only if the data types before and after the change can be converted into the same data type based on type normalization rules.

The type normalization mode has the following rules:

The TINYINT, SMALLINT, INT, and BIGINT data types are converted into the BIGINT data type.
The CHAR, VARCHAR, and STRING data types are converted into the STRING data type.
The FLOAT and DOUBLE data types are converted into the DOUBLE data type.
Other data types are not normalized.

Examples:

When the type normalization mode is enabled, the SMALLINT and INT data types are converted into the BIGINT data type. If you change the SMALLINT data type to the INT data type, the change is considered successful. Therefore, the deployment that executes the CREATE TABLE AS statement runs as expected.
When the type normalization mode is enabled, the FLOAT data type is converted into the DOUBLE data type and the BIGINT data type is converted into the BIGINT data type. If you change the FLOAT data type to the BIGINT data type, a data type incompatibility error occurs.

Use an Apache Paimon catalog as the catalog of the destination store that is used in the CREATE DATABASE AS statement

CREATE DATABASE IF NOT EXISTS `<catalog name>`.`<db name>`
WITH (
 'bucket' = '4' -- Specify the number of buckets for each result table.
) AS DATABASE `<source database>`;

The CREATE DATABASE AS statement allows you to configure physical table properties in the WITH clause for a deployment. When the deployment starts, these parameters take effect on the result tables to which you want to synchronize data. For more information about the table properties supported by Apache Paimon catalogs, see Apache Paimon connector.

When you use the CREATE DATABASE AS statement, Apache Paimon catalogs also allow you to configure 'enableTypeNormalization' = 'true' in the WITH clause to enable the type normalization mode.

Delete an Apache Paimon catalog

Delete an Apache Paimon catalog on the UI

Go to the Catalogs page.
1. Log on to the Realtime Compute for Apache Flink console.
2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
3. In the left-side navigation pane, click Catalogs.
On the Catalog List page, find the desired catalog and click Delete in the Actions column.
In the message that appears, click Delete.
View the Catalogs pane on the left side of the Catalog List page to check whether the catalog is deleted.

Delete an Apache Paimon catalog by executing an SQL statement

On the script editing page, enter the following statement:
```
DROP CATALOG <catalog name>;
```
Note
The engine version must be vvr-6.0.6-flink-1.15 or later.
<catalog name> indicates the name of the Apache Paimon catalog that you want to delete.
Right-click the statement that is used to delete the catalog and choose Run from the shortcut menu.
View the Catalogs pane on the left side of the Catalog List page to check whether the catalog is deleted.