What is Apache Paimon?

Apache Paimon is an advanced lake format that supports building a Realtime Lakehouse Architecture, effectively integrating with Apache Flink and Apache Spark for both streaming and batch processes. It utilizes a combination of lake format and LSM (log-structured merge-tree) to facilitate real-time streaming updates within lake architectures. Key features include real-time updates with high performance, large-scale append data processing, and comprehensive data lake capabilities such as ACID transactions and schema evolution. For developers interested in exploring Apache Paimon, there are quick-start guides available for both Apache Flink and Apache Spark.

Challenges in Streaming Computing and Storage

Building on this foundation, the challenges in streaming computing highlight the necessity for Paimon's robust capabilities:

First, they must handle high data throughput and low latency to process and store data in real-time.
Second, maintaining data consistency across distributed systems while ensuring fault tolerance is crucial, especially in scenarios involving stateful computations.
Additionally, the integration of batch and stream processing poses a complex challenge, requiring systems to seamlessly handle both without sacrificing performance or scalability.

How Apache Paimon was Invented

These challenges necessitated the invention of a new solution, leading to Jinsong Li's development of Apache Paimon, which tailored specifically to these needs. His journey through various open-source projects and the evolution of stream computing landscapes led to the inception of Apache Paimon.

Apache Hive Real-time Enhancement

In the pursuit of improving real-time data processing capabilities, one approach involved enhancing Apache Hive to handle real-time data streams. This method sought to transform Apache Hive from a batch-oriented system into one that could support streaming inputs. By leveraging the batch capabilities of Apache Hive alongside streaming data, the solution aimed to reduce storage costs and increase query flexibility.

However, the main challenge was achieving low-latency processing while maintaining the data consistency guarantees typically associated with batch processing. The real-time solution using Apache Hive, in conjunction with Flink Hive Sink, addresses several crucial aspects of data handling:

This system enables writing to Apache Hive in formats such as Parquet, ORC, and CSV, with the assurance of exactly-once write consistency.
It also supports partitioned commits, which are essential for integrating streaming data into Apache Hive's traditionally offline data warehouses.
The benefits include near real-time data warehousing with consistent and minute-level latency due to checkpointing, and low storage costs as it stores mainly raw data, providing great flexibility.
However, this approach has limitations in query performance; data is stored in columnar format on economical hardware, resulting in slower read speeds.

Apache Iceberg Real-time Enhancement

Another significant attempt was the real-time enhancement of Apache Iceberg. This approach aimed to extend Apache Iceberg's capabilities to handle real-time data flows more effectively, providing stronger ACID guarantees and improving metadata management. The real-time enhancement of Apache Iceberg focused on integrating it with Apache Flink to enhance its data processing capabilities. This collaboration aimed to allow real-time data ingestion into the data lake and facilitate streaming reads directly from Apache Iceberg, enhancing the usability of traditional Apache Hive-based data warehouses. However, this solution faced significant challenges in handling upsert scenarios, crucial for Change Data Capture (CDC) processes. It struggled with the high storage and computational costs associated with maintaining full and incremental tables and was cumbersome to manage. Moreover, the architecture had difficulties efficiently processing CDC data generated during stream computing.

Upsert Exploration with Apache Hudi

The exploration of upsert capabilities with Apache Hudi represented a pivotal development in streaming computing and storage. This initiative focused on integrating Apache Hudi to support upserts—inserts and updates—which are critical for real-time data processing where changes to data must be captured instantly. Apache Hudi's approach offered a way to manage streaming data that frequently changes, providing efficient mechanisms for handling scenarios where data mutuality and state consistency are necessary.

Advantages: Apache Hudi introduces innovative methods for handling Upserts by utilizing Apache Flink State to map keys to file groups, which automates the scaling process. Apache Hudi's Bucket Index solution further enhances performance by solving significant issues with Apache Flink State Index by dividing data into multiple buckets determined by a hash function. This simplifies the indexing process and alleviates many performance-related challenges previously encountered.

Drawbacks: The primary drawback of Apache Hudi's approach involves its complex system design, which often leads to performance degradation, especially when handling large datasets exceeding 500 million entries. This complexity also results in high storage costs as all indexes are stored in RocksDB State, making the system less efficient. Additionally, data consistency can be compromised if other engines attempt to read or write, disrupting the state’s index. Moreover, selecting an appropriate bucket number for the Bucket Index can be challenging, impacting performance and leading to issues with small file management.

Apache Hudi's system, originally designed for batch processing with Apache Spark, struggles to fully adapt to stream processing scenarios, leading to increased system complexity and maintenance difficulties. Despite improvements in stability in recent versions, Apache Hudi's adaptability issues and intricate settings make it less user-friendly, particularly for new users navigating its multiple operational modes.

The project aims to integrate streaming computations with data lake storage systems, providing real-time updates and data management solutions tailored for modern data architectures, driven by the need for a system that could handle streaming data more effectively than existing technologies.

Ideal Solution for Streaming Lake Format

The ideal solution for a streaming lake format would encompass a robust architectural foundation similar to Apache Iceberg, meeting all basic demands of lake storage:

Robust Architectural Foundation: Comparable to Apache Iceberg, ensuring reliability and scalability in lake storage.
Advanced Upsert Capabilities: Utilizing LSM structures that integrate with OLAP systems, streaming compute states, and KV systems.
Streaming-First Design: Optimized integration with Apache Flink, specifically designed for streaming contexts to avoid the pitfalls of retrofitting on complex systems.
Community Focus: Oriented towards community developers and users, with a long-term commitment to advancing streaming and lake technologies.

Key Features of Apache Paimon

Building on the ideal solution for a streaming lake format, the inception of Apache Paimon marks a significant stride towards realizing these aspirations. Emerging from discussions within the Apache Flink community, Apache Paimon integrates robust data management strategies to cater effectively to dynamic streaming environments.

Key Features of Apache Paimon:

Dynamic Table Storage: Initially conceptualized as Flink Table Store (FTS), Paimon is designed to handle real-time and near-real-time data, streamlining materialized views and querying.
Integration with Apache Flink: Paimon boasts deep integration with Apache Flink, enhancing its capabilities to process and manage streaming data efficiently.
Lakehouse Architecture: Combining the advantages of data lakes and data warehouses, it offers comprehensive storage solutions with improved query performance.
LSM Structures: Leverages Log-Structured Merge-tree structures, ensuring high throughput for data ingestion and updates.
Production-Ready: By the release of version 0.3, Paimon had evolved into a robust system recommended for production use, providing flexibility in managing data latency and costs.

The benefits of Apache Paimon include:

Strong Upsert Capabilities: It utilizes data lake architecture combined with LSM (Log-Structured Merge-tree) to support robust upsert operations and natural data skipping.
Multi-Engine Support: Originating from the Apache Flink ecosystem, Paimon supports all Flink SQL features, including Flink CDC, and is designed to integrate with multiple computing engines like Apache Spark.
Real-Time Data Lake Design: Paimon is specifically tailored for real-time data lake scenarios, significantly enhancing data freshness across the entire data pipeline and supporting rapid iteration and development.

Use Cases and Applications of Apache Paimon

Apache Paimon serves a range of use cases, particularly enhancing the functionality of streaming data architectures. One notable application is its integration in building streaming data warehouses, where it supports seamless real-time data processing and analytics. This is particularly beneficial for organizations looking to streamline their data processing frameworks and reduce the complexity typically associated with large-scale data operations.

Below are some of the industry case studies of Apache Paimon:

Tongcheng Travel

Tongcheng Travel, a significant player in the travel industry in China, embarked on a transformative journey with Apache Paimon to enhance their data management and processing capabilities. Initially, the company utilized Apache Hive and later Apache Kudu to manage its data warehouse needs, aiming to meet the increasing demands for real-time data processing. However, the latency and complexity in managing data storage posed challenges. In pursuit of a more efficient solution, Tongcheng Travel shifted to Apache Flink and Apache Hudi, which improved data reuse and streaming capabilities but still struggled with data synchronization and consistency.

Recognizing these challenges, Tongcheng Travel transitioned to Apache Paimon in 2023, leveraging its advanced features for real-time data processing and efficient state management. This shift enabled the company to process about 80% of their jobs with Apache Paimon, enhancing the performance of over 500 jobs and managing roughly 100 TB of data across various real-time and batch processing scenarios. The use of Apache Paimon led to significant improvements, including a 30% reduction in synchronization resources, a threefold increase in write speeds, and substantial query efficiency gains. This case study exemplifies the transformative impact of Apache Paimon in optimizing data lakehouse architectures, significantly enhancing data handling, and operational efficiency in large-scale business environments.

Autohome Inc.

Autohome Inc., a leader in automotive services, has significantly advanced its big data architecture by integrating Apache Paimon into its systems. Spearheaded by Di Xingxing, the head of the big data computing platform at Autohome, the company transitioned from Apache Iceberg to Apache Paimon due to the latter's superior stream processing capabilities and efficient community interaction. This shift was motivated by the need for improved data timeliness and the ability to handle real-time data updates more effectively.

Apache Paimon's integration with Apache Flink and StarRocks at Autohome has created a robust streaming lakehouse architecture that enhances real-time computing and data analysis efficiency. This system enables Autohome to update its recommendation models and other data-driven processes from daily or hourly updates to updates within minutes, significantly reducing data latency and supporting dynamic decision-making. The use of Apache Paimon at Autohome exemplifies its utility in large-scale enterprises where data timeliness and processing efficiency are crucial.

China Unicom

China Unicom has successfully integrated Apache Paimon into its streaming lakehouse architecture, spearheaded by WANG Yunpeng of China Unicom Digital Tech. The organization initially utilized Apache Spark Streaming and later transitioned to Apache Flink to overcome challenges related to high latency and state data management in real-time processing. As the data complexities and volume escalated, the existing systems struggled to meet the dynamic needs of data integration and management. This led China Unicom to adopt Apache Paimon, which facilitated a unified approach to handling streaming and batch data, significantly reducing data redundancy and inconsistencies.

The implementation of Apache Paimon allowed China Unicom to manage a vast volume of data, supporting 700 streaming tasks, and processing trillions of data points across more than 100 tables. This strategic move not only simplified their data architecture by enabling efficient real-time updates and integrations but also significantly enhanced the performance of their data operations. The architecture leverages Apache Paimon for minute-level latency requirements and complex data integration tasks, proving essential for real-time applications crucial to China Unicom's operations.

Adopt Apache Paimon on Alibaba Cloud

Apache Paimon on Alibaba Cloud offers unique features that enhance real-time data ingestion into data lakes. This integration allows for high-throughput data writing and low-latency queries, supporting both streaming and batch data processing. Apache Paimon can be easily integrated with various Alibaba Cloud services like Realtime Compute for Apache Flink, as well as with big data frameworks like Apache Spark and Apache Hive. This setup facilitates the construction of data lakes on Hadoop Distributed File System (HDFS) or Object Storage Service (OSS), enhancing data lake analytics capabilities. To experience the advanced features of Apache Paimon on Alibaba Cloud, visit Realtime Compute for Apache Flink and start your free trial here.

Summary

Apache Paimon is an advanced lake format designed for Realtime Lakehouse Architecture, seamlessly integrating with Apache Flink and Apache Spark for streaming and batch processes. It addresses challenges in streaming computing by offering real-time updates, high-performance data processing, and comprehensive data lake capabilities. Leveraging earlier enhancements in technologies like Apache Hive, Apache Iceberg, and Apache Hudi, Apache Paimon emerges as a production-ready solution with dynamic table storage and deep Apache Flink integration.

Through industry case studies, including Tongcheng Travel, Autohome Inc., and China Unicom, Apache Paimon demonstrates its transformative impact on optimizing data lakehouse architectures and enhancing operational efficiency. With integration on Alibaba Cloud, Apache Paimon facilitates seamless real-time data ingestion into data lakes, empowering organizations to leverage streaming data effectively.

Community

What is Apache Paimon?

Challenges in Streaming Computing and Storage

How Apache Paimon was Invented

Apache Hive Real-time Enhancement

Apache Iceberg Real-time Enhancement

Upsert Exploration with Apache Hudi

Ideal Solution for Streaming Lake Format

Key Features of Apache Paimon

Use Cases and Applications of Apache Paimon

Tongcheng Travel

Autohome Inc.

China Unicom

Adopt Apache Paimon on Alibaba Cloud

Summary

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Message Queue for Apache Kafka

ApsaraDB for SelectDB

ApsaraMQ for RocketMQ