×
Community Blog vivo's Lakehouse Integration Practice Based on Paimon

vivo's Lakehouse Integration Practice Based on Paimon

This article is compiled from the presentation by Xu Yu, an internet big data expert at vivo and Apache Paimon Committer, during the Flink Forward Asia 2024 Streaming Lakehouse session (Part One).

Summary: The presentation is based on vivo's real-world cases and showcases some key decisions and technical practices in building a modern data lakehouse, including component selection, architecture design, performance optimization, and data migration exploration. The content is divided into the following sections:

Component Selection and Architecture

Offline Acceleration

Unified Stream and Batch Link

Message Component Alternative

Sample Concatenation

Query Acceleration

Metadata Monitoring

Data Migration

Future Outlook

Component Selection and Architecture

Component Selection

Our technology stack uses Flink as the primary computing engine, combined with StarRocks for federated query acceleration, and Paimon as the core storage layer covering all lakehouse scenarios. In terms of storage format selection, ORC was recommended in older versions. Since version 1.0, Parquet offers more robust features, supporting complex data types. Therefore, you can flexibly choose based on the version in use.

Lakehouse Architecture

img

The unified lakehouse architecture has three major applications in realtime processing scenarios: offline acceleration, link merging, and traditional database data analysis optimization.

  • Offline Acceleration: By obtaining data from message middleware like Kafka and using append methods or leveraging Flink's intermediate state processing, functionality similar to Spark or other frameworks can be achieved. This process generates near-real-time operational data stores (ODS) and data warehouses (DW), ultimately enabling queries through Flink or StarRocks. This approach enhances the speed and efficiency of data processing.
  • Link Merging: Addressing the issue of having separate realtime and offline data processing links in the Lambda architecture, the lakehouse architecture is used to unify these paths, aiming to reduce redundant computation and storage costs while simplifying team management and maintenance tasks. This solution allows all operations to be completed within one system, avoiding data inconsistency issues caused by the need for data replenishment. For example, real-time data replenishment can be achieved based on Paimon, maintaining consistency between stream processing and batch processing.
  • DB Data Analysis Optimization: For the traditional data analysis process from databases to Hive, replacing the original method with a combination of Paimon and Flink can significantly reduce data latency (from days/hours to minutes). However, it should be noted that overly short checkpoint intervals, while providing lower latency, might lead to the generation of numerous small files, putting pressure on HDFS. Therefore, it is recommended to set checkpoint periods of at least 5 minutes to ensure effective file management. Additionally, utilizing Paimon's time travel feature allows for efficient management of historical data through regular snapshot creation, such as automatically creating snapshots daily at midnight and retaining one week's history, thereby optimizing storage space usage.

Offline Acceleration

img

Below is a detailed explanation of the benefits brought by offline acceleration. Compared to traditional data warehouses, the offline acceleration solution using a unified lakehouse architecture significantly enhances the timeliness of data processing. Specifically, this architecture achieves transformative improvements in timeliness through the following means:

  • Data Source Collection: Data sources include log-type data (such as service logs and online device data) and database data. These data are collected by sensors or other devices, continuously transmitted to service logs, and then written into Hive in an offline manner.
  • Example of Actual Production Link:

    • Traditional Data Warehouse: For instance, performing ETL processing and generating DM/DW data typically takes two hours.
    • Paimon Architecture: By adopting Paimon, since the entire link is near-real-time, processing time can be reduced from hours to minutes, usually controlled within ten minutes.
  • Link Integrity and Disaster Recovery:

    • Paimon provides excellent support for concurrent write operations. As long as the same Bucket is not written to, conflicts will not occur. In actual production, different partitions are usually involved, so there is no need to worry about concurrent conflict issues, allowing for efficient parallel data processing.
    • The data rewrite mechanism also ensures the integrity and disaster recovery capability of the link.
  • Application Scenarios:

    • For businesses requiring high timeliness, such as algorithm processing or real-time reporting, the offline acceleration solution can significantly enhance efficiency. Through this approach, data processing speed and response time can be greatly improved, better supporting business needs.

Unified Stream and Batch Link

img

Traditional data processing architectures typically consist of two separate links: one for offline processing based on Spark and Hive, and another for realtime processing based on Kafka and Flink. Although this dual-link design ensures data accuracy and real-time capabilities, it consumes significant resources and lacks flexibility. Additionally, due to Kafka's data storage characteristics, data written to Kafka is often not easily accessible, adding complexity to its usage.

After adopting the Paimon and Flink architecture, all data processing links are completely unified, allowing both realtime and offline data to be written into Paimon tables and analyzed at any time, thereby enhancing flexibility. Furthermore, the merged link can reduce computational resource requirements by approximately 30%. With unified monitoring and comparison of indicators such as memory and CPU cores, more efficient resource management and optimization can be achieved.

Message Component Alternative

img

In realtime scenarios, Kafka or PSA is typically used for data streaming and ingestion into real-time data warehouses. However, for cloud users, Kafka resources are valuable and costly, sometimes leading to resource shortages or high load issues. Paimon, as a low-cost alternative to message components, can achieve Kafka-like functionality through its Consumer mechanism. Although its latency is at the minute level, slightly higher than Kafka's second-level delay, it is sufficient for many business scenarios.

By migrating some businesses to Paimon, redundant offline resources can be effectively utilized, enhancing storage utilization and significantly reducing computational and storage costs. In vivo's internal applications, this migration not only optimized the stability of data links but also significantly reduced overall resource costs, achieving a total cost reduction of up to 50%.

Sample Concatenation

img

In sample concatenation scenarios, both realtime and offline concatenation methods typically need to be handled. Offline concatenation involves full data distribution and insertion operations in specified partitions, leading to wasted computational resources and inefficiency. Realtime concatenation faces the challenge of large state management, which may result in TB-level state data, posing risks to cluster stability.

By utilizing Paimon's Partial Update feature, efficient incremental updates can be achieved, avoiding large state issues. Specifically, A data and B data can be directly written into Paimon tables, using lightweight HASH calculations and incremental writes to ensure high-throughput writing and merging during queries. This approach not only reduces computational resource consumption but also enhances system stability and performance. Additionally, Paimon's delayed read capability can automatically synchronize dimension table data in special scenarios, ensuring data freshness.

In practical applications, this solution can reduce sample concatenation time from one to two hours to five minutes, significantly improving the effectiveness and speed of algorithm training.

Query Acceleration

img

In terms of query acceleration, Paimon provides significant performance improvements through federated queries and specific algorithms such as Zorder or Hilbert. For example, when querying different partitions or fields at different times, Paimon can optimize query performance by specifying partitions and using Procedures to merge fields. Compared to Hive, Paimon does not require deduplication and sorting across all partitions, reducing overall costs.

In practical applications, with Paimon and Spark, Flink engines, second-level point queries can be achieved on tables with tens of billions of records. Combined with MPP vectorized query technology, query times can be further compressed to the millisecond level. However, in high concurrency scenarios, lower versions of Paimon (such as version 0.7) lack the Canny Catalog, leading to frequent redundant interactions with Hive Metastore (HMS), which affects query performance. Upgrading to version 0.9 or higher with Canny Catalog allows maintaining millisecond-level responses even with over 200 concurrent queries on tables with tens of billions of records.

Additionally, Paimon supports file management after realtime data writing. Setting shorter Checkpoint times may generate many small files. To avoid putting pressure on the Hive Metastore (HMS) cluster, Paimon regularly performs file merging, ensuring stable read and write performance.

Metadata Monitoring

img

In terms of lakehouse metadata monitoring, Flink tasks may disable certain table management functions, such as setting Read Only to True, to ensure efficient data writing. However, this can lead to oversight of maintenance operations like snapshot cleaning, resulting in slower query speeds and metadata inflation. To address this, a table-level metadata monitoring system can be constructed. This system automatically activates monitoring upon table creation and provides default rules. For instance, when the snapshot count exceeds 200, the system automatically triggers an alert.

The monitoring system is based on Paimon's system tables, with scheduled queries of these system tables via Flink and StarRocks engines, importing data into StarRocks' internal tables. The intelligent diagnostic system checks relevant metrics according to user configurations or system default rules. Once an alert rule is triggered, an alert message is immediately pushed, enabling users to promptly manage and maintain tables, such as performing snapshot cleaning.

This monitoring approach allows for timely detection and resolution of issues before they occur, ensuring the performance and stability of lake tables.

Data Migration

img

In terms of data migration, Paimon offers simple and effective tools to migrate historical data from Hive tables to Paimon tables, enabling lake table capabilities. For non-Paimon tables (such as default Hive tables), you can create Paimon tables and complete the migration using INSERT INTO or other data import tools. Paimon supports both in-place migration and A-to-B migration. The latter involves moving Hive files to a temporary directory and then constructing metadata (such as Schema, snapshot types, and Manifest files) to complete the process. Once migration is complete, the temporary table is renamed to the existing table name, achieving a seamless migration with minimal user awareness.

This migration method is not only efficient but can also complete the migration of tables with tens of billions of records within minutes, with little user perception. After migration, to ensure compatibility with computing engines like Spark or Flink, it is necessary to adjust relevant dependencies and Catalog injection information to accomplish task-level migration. The overall process includes the migration of both data and tasks, ultimately enabling one-click or low-perception migration of Hive tables to Paimon tables on the platform, thereby activating stream read and write capabilities and reducing computational resource consumption.

Future Outlook

img

Finally, let's look forward to the future together. Future work will focus on algorithm requirements in AI scenarios, particularly supporting the storage, querying, and processing capabilities of unstructured and semi-structured data in AI training and inference scenarios. We will enhance Paimon's storage and query performance for handling complex data types, such as integrated data.

Additionally, we plan to improve the customization capabilities of the Merge Engine, allowing users to flexibly configure according to their specific needs, breaking through the limitations of existing fixed functions. These improvements aim to better support various special scenarios, such as algorithm journeys, thereby creating greater business value.

With this, our sharing session comes to an end. We hope that the content provided above has brought some inspiration and help. Thank you all for your reading and support!

0 1 0
Share on

Apache Flink Community

185 posts | 49 followers

You may also like

Comments

Apache Flink Community

185 posts | 49 followers

Related Products