Apache Paimon (Paimon) provides a unified storage format for different data types. Paimon can work with Apache Flink and Apache Spark to implement a real-time lakehouse architecture that supports streaming and batch operations. Paimon innovatively combines the lake format and the log-structured merge-tree (LSM) structure to support real-time streaming updates in the lake architecture. You can use Paimon tables in Realtime Compute for Apache Flink to quickly build a data lake based on cloud storage services, such as Object Storage Service (OSS).
Paimon provides the following capabilities:
Enhanced real-time data ingestion: Paimon can work with Realtime Compute for Apache Flink to ingest different types of data into a data lake that supports automatic schema change synchronization and real-time updates from various database systems, such as MySQL. Tens of millions of data records can be efficiently ingested with low latency.
Unified stream and batch processing: Paimon can work with Apache Flink to facilitate stream processing and Apache Spark to facilitate batch processing. Paimon provides a unified format for data lake storage to improve ease of use and reduces costs.
Extensive ecosystem integration: Paimon can seamlessly integrate with a variety of Alibaba Cloud compute services, such as Realtime Compute for Apache Flink, E-MapReduce (Spark, StarRocks, Hive, and Trino), and MaxCompute.
Innovative lakehouse storage: Paimon uses deletion vectors and indexes to ensure a minute-level latency for streaming, batch, and online analytical processing (OLAP) queries.
For more information, see Apache Paimon.
Usage
Getting started
The first time you use Paimon, we recommend that you start with the basic features. For more information, see Getting started with basic features of Apache Paimon.
Use a Paimon primary key table to update data by primary key. If you only need to import data that does not have a primary key, use a Paimon Append Only table (non-primary key table).
For information about how Paimon ensures data freshness and consistency, see Data latency and consistency.
For information about a step-by-step guide to build a streaming lakehouse, see Build a streaming data lakehouse using Realtime Compute for Apache Flink, Apache Paimon, and StarRocks.
Create a Paimon catalog
A Paimon catalog provides access to Paimon tables stored in external systems. It lets you manage Paimon tables in a centralized manner and can be accessed by other Alibaba Cloud services. You can use Paimon catalogs in the following ways:
Create and use a Paimon catalog. For more information, see Manage Paimon Catalogs.
Synchronize the metadata of a Paimon table to Data Lake Formation (DLF). For more information, see Create a Paimon DLF catalog.
Create a Paimon external table in MaxCompute to access the associated Paimon table. For more information, see Create a Paimon MaxCompute catalog.
Synchronize the metadata of a Paimon table to DLF and create a Paimon external table in MaxCompute. For more information, see Create a Paimon sync catalog.
Create a Paimon table
Directly create a Paimon table in a Paimon catalog. For more information, see Manage Paimon tables.
You can sync tables from data sources, such as MySQL and Kafka, to a Paimon catalog using the CREATE TABLE AS (CTAS) statement or the CREATE DATABASE AS (CDAS) statement. For more information, see Create a table using the CREATE TABLE AS (CTAS) or CREATE DATABASE AS (CDAS) statement.
Write data to a Paimon table
Insert new data to or update data in a Paimon table. For more information, see Write data to a Paimon table.
Join a Paimon table with other tables and apply aggregate functions. For more information, see Merge engine.
Partially or completely overwrite a Paimon table. For more information, see Use the INSERT OVERWRITE statement to overwrite data.
You can delete data from a Paimon table. For more information, see Delete data using the DELETE statement.
Delete partitions from a Paimon table. For more information, see Modify the table schema.
Consume data from a Paimon table
Query or consume data from a Paimon table. For more information, see Consume data from a Paimon table. If you want to consume data from a primary key table in streaming mode, make sure that you complete the changelog producer configuration.
Configure the consumer offset of a Paimon table. For more information, see Configure a consumer offset.
Save the consumer offset of a Paimon table or retain expired snapshot files that are still in use. For more information, see Specify a consumer ID.
Run a batch deployment to read the historical states of a Paimon table. For more information, see Batch Time Travel.
Maintain a Paimon table
Learn how to address common issues related to Paimon. For more information, see FAQ about connectors.
Optimize the read and write performance of Paimon tables. For more information, see Performance optimization.
Query the metadata of a Paimon table, such as the partitions and the total size of files in each partition. For more information, see System tables.
Modify the schema of a table in a Paimon Catalog. For more information, see Modify the table schema.
Delete a table from a Paimon catalog. For more information, see Delete a table.
Change the number of buckets for a Paimon table that uses fixed bucket mode. For more information, see Change the number of buckets in fixed bucket mode.
Clean up obsolete files in the directory of a Paimon table. For more information, see Clean up expired data.