×
Community Blog Fluss: Redefining Streaming Storage for Real-time Data Analytics and AI

Fluss: Redefining Streaming Storage for Real-time Data Analytics and AI

Explore Apache Fluss, the revolutionary streaming storage solution bridging traditional systems and lakehouse architectures for real-time data analytics and AI.

Introduction

Welcome to an in-depth exploration of Apache Fluss (Incubating), a groundbreaking streaming storage solution designed to revolutionize real-time data analytics and AI. This blog post, inspired byJark Wu's keynote at Flink Forward Asia Singapore 2025, introduces Fluss as the next-generation streaming storage, meticulously optimized for modern analytical and AI use cases. We will delve into how Fluss effectively bridges the gap between traditional streaming systems and cutting-edge lakehouse architectures, significantly enhancing storage capabilities for machine learning feature engineering and multi-modal AI data ingestion. Join us as we uncover the motivations behind Fluss, its key architectural advantages, and compelling real-world use cases.

The Challenges of Traditional Data Infrastructure

img

In today's data-driven landscape, Apache Kafka has emerged as the backbone of nearly every streaming data infrastructure. It excels in event-driven communication between microservices and high-throughput log collection. And then if you want to deliver some real time insights, you need to process the streaming data and transform it with Apache Flink. And to transform the data, you may need to write it back to Kafka topics across multiple layers, like the bronze layer, silver layer, and gold layer, which is also known as the medallion architecture.

But what happens when you need to perform a key-value lookup for data enrichment? Typically, you end up copying the data into a key-value store like Redis to support that use case. And what if you want to query Kafka topics for data exploration or debugging? Kafka isn’t designed to be queryable. To address this limitation, you’re forced to copy the data again, this time into an OLAP system like ClickHouse. Similarly, if you aim to build a data lakehouse for batch processing, you’ll need to copy the data yet again—this time into a format like Apache Iceberg.

As a result, you end up maintaining multiple copies of the same data across Kafka, Redis, ClickHouse, Iceberg, and potentially other systems. This proliferation of data introduces significant costs, complexity, and operational overhead. Worse still, it creates isolated data silos that are difficult to keep consistent, leading to challenges in data governance and reliability.

Consider the common scenarios where Kafka falls short for analytical needs:

  • KV Lookup for Enrichment: While Flink can join dimension tables, Kafka does not natively support Key-Value lookups. This forces users to copy Kafka data into a separate KV store like Redis, solely for enrichment purposes.
  • Querying and Data Exploration: Kafka topics are not directly queryable, hindering data exploration, debugging, or real-time dashboarding. Consequently, data must be copied again into an OLAP system or a lakehouse for querying.
  • Batch Processing and Data Lakes: To build a data lakehouse for batch processing, yet another copy of the data is often required in formats like Iceberg.

Ultimately, the Kafka topics in such setups often provide zero business value beyond intermediate storage.

They are not designed for querying, lookups, or long-term data retention, acting merely as a black box in the data pipeline. This isn't a fault of Kafka itself, but rather a misuse of its capabilities. Kafka is optimized for operational workloads and event-driven communication, not for the demands of analytical and AI applications. Its lack of built-in schema, absence of update support, and poor optimization for long-term data storage limit its utility in modern data warehouses and analytical use cases.

This realization led to the inception of the Fluss project two years ago, a streaming storage solution built from scratch specifically for analytics and AI. The goal was to address these inherent limitations and provide a unified, cost-effective, and highly performant solution for real-time data processing.

Fluss: A New Paradigm for Streaming Storage

Fluss represents a significant leap forward in streaming storage technology. At its core, Fluss is a streaming storage solution that supports sub-second level latency for both streaming reads and writes. It is architected as a columnar log streaming storage built on top of Apache Arrow. This foundation in Apache Arrow, a columnar format, imbues Fluss with powerful analytical capabilities.

Key advantages of Fluss include:

  • Strong Analytical Capability: By leveraging Apache Arrow, Fluss enables streaming column pruning and streaming partition pruning during read operations. This significantly reduces network costs by avoiding the transfer of unnecessary columns or partitions, making analytical queries highly efficient.
  • High-Performance Real-Time Updates and Lookups: Fluss supports very high-performance real-time updates and lookups, making it an ideal candidate for use as a dimension table with Flink to perform lookup joins. This eliminates the need for separate KV stores like Redis for enrichment purposes.
  • Tiered Storage with Lakehouse Integration: One of the most crucial features of Fluss is its seamless integration with lakehouse as its tiered storage. This means Fluss can maintain hot data in its local storage for rapid access, while cold data is efficiently managed in the lakehouse for cost-effectiveness. Fluss utilizes native lakehouse formats such as Apache Paimon or Apache Iceberg, ensuring that all existing query engines like Spark, Trino, and StarRocks can directly access the cold data in the lakehouse.
  • Union Read for Unified Data Access: Fluss introduces a powerful feature called Union Read, which truly unifies data streams and the data lakehouse. Union Read intelligently combines the hot data residing in Fluss with the cold data in the lakehouse. It first reads from the batch data and then seamlessly switches to streaming data without any duplicates or missing records. Apache Flink already supports Union Read, and integration with StarRocks is actively in progress.

With Fluss, enterprises can now achieve a truly real-time streaming lakehouse, characterized by a single, unified copy of their data. This eliminates the need to maintain multiple data copies across disparate systems, drastically reducing costs and simplifying the overall data infrastructure. Fluss provides real-time streaming read and write capabilities for Flink stream analytics, offers KV lookup for dimension joins, enables Union Read for OLAP queries, and supports open lake formats for batch processing. It's important to note that the real-time streaming lakehouse facilitated by Fluss is not a new type of lakehouse; rather, it enhances existing lakehouse architectures with robust streaming capabilities, enabling seamless data sharing between data streams and the data lakehouse.

To achieve this data sharing, Fluss maintains a tiering service that continuously converts Fluss data into lakehouse formats like Iceberg and Paimon. This approach is analogous to database systems that employ multiple data layers (hot, warm, cold) with different storage media and formats, ensuring data consistency across layers. Fluss adopts a similar methodology, leveraging open lake formats as the cold layer, thereby making cold data openly accessible to the broader lakehouse ecosystem, including tools like Spark and Trino.

In essence, Fluss serves as the real-time data layer for the lakehouse, optimized for storing short-term, sub-second latency data. Conversely, the lakehouse functions as the historical data layer for streams, accommodating long-term, minutes-level latency data. When a stream needs to be read, the lakehouse provides historical data for fast catch-up. When batch analytics are performed, Fluss delivers the freshest data from the past few minutes to the lakehouse, ensuring that lakehouse analytics are truly real-time.

Fluss in Production: Scale and Real-World Use Cases

Fluss is not merely a theoretical concept; it is already in large-scale production at Alibaba, demonstrating its robustness and efficiency in real-world scenarios. Alibaba is actively migrating its internal Kafka system to Fluss, particularly for real-time analytics use cases, and the adoption is continuously growing as its usage expands.

Currently, Fluss manages over 3 PB (petabytes) of data in total, with clusters handling an impressive ingest throughput of 40 GB per second. Furthermore, it supports very high-performance KV lookups, reaching up to 500,000 QPS (queries per second) on a single table, with the largest table containing over 500 billion rows. These statistics underscore Fluss's capability to handle massive data volumes and high-velocity data streams.

Let's delve into some specific use cases from Alibaba's production environment:

Use Case 1: Log Collection and Real-Time Analytics

At Taobao, China's largest online shopping platform, a vast array of logs are collected from applications and websites. These logs encompass critical data such as clickstreams, user behavior, and order streams, forming the foundation for downstream analytics and AI/machine learning initiatives. However, the Taobao team faced significant challenges when using Kafka for this purpose:

  • Data Volume and Retention Costs: The sheer volume of data continued to grow annually, making it expensive to retain logs for more than the standard three days, despite user demand for longer retention periods. This led to substantial storage costs.
  • High Network Costs: Log data often follows a 1-write-10-read pattern for a single topic, resulting in very high network traffic and associated costs.

By switching to Fluss, the Taobao team leveraged the shared data capabilities of the streaming lakehouse. They could now retain long-term data in the lakehouse, significantly reducing the data volume in Fluss by 30%. As Fluss is a columnar streaming storage, it supports column pruning and partition pruning. This feature allowed them to drastically reduce network costs by avoiding the transfer of unnecessary columns or partitions. In total, compared to their previous solution, the Taobao team achieved a 30% reduction in overall cost and a remarkable 70% reduction in read traffic.

Use Case 2: Delta Join for Large-Scale Streaming Joins

Streaming join is a fundamental operation in Flink, used to enrich data by joining two streams. This operation typically requires storing all upstream data in Flink's state, which can lead to very large Flink states and associated issues. For instance, in Alibaba's search and recommendation team, they needed to join page clickstreams and order streams for attribution analysis – understanding what a user saw that led to a purchase. The clickstream and order stream data were so massive that they resulted in a 100 TB state size, making the Flink jobs unstable and causing frequent checkpoint timeouts.

To address this, Alibaba introduced Delta Join in Flink, which leverages Fluss's key functionalities: streaming read, change log read, and KV lookups. Delta Join can be conceptualized as a bidirectional lookup join. When data arrives from the left stream, it performs a KV lookup on the right table using the join key. Conversely, when data arrives from the right stream, it performs the same operation on the left table. This approach offers the same semantics as a traditional streaming join but without the need to maintain a large state within the Flink job itself.

Implementing Delta Join with Fluss completely eliminated the 100 TB state size, leading to significantly more stable jobs. Checkpoint times were reduced from 90 seconds to a mere 1 second, and Flink resource usage was cut by an impressive 85%.

Beyond these immediate benefits, a major advantage of Delta Join is the decoupling of state and job logic. Changes to the Flink job no longer necessitate reprocessing the entire state, accelerating Flink job updates. Furthermore, users can directly inspect the joined data within Fluss.

The good news is that Delta Join has been open-sourced and donated to Apache Flink, and it is slated for inclusion in the upcoming Flink 2.1 release. This highlights the community's commitment to integrating and leveraging Fluss's capabilities within the broader Flink ecosystem.

The Future Roadmap of Fluss: Multimodal AI and Open Data

The future roadmap for Fluss is ambitious and exciting, focusing on expanding its capabilities to meet the evolving demands of the AI era. The key areas of development include:

  • Enhanced Streaming Lakehouse Resolution: Fluss will continue to enhance its streaming lakehouse capabilities, supporting more open lake formats like Iceberg and Delta Lake. This expansion will also include broader support for various query engines, such as Spark, Trino, and StarRocks.
  • Multimodal AI Integration: A significant focus is on supporting multimodal AI, which involves real-time ingestion of diverse data types (text, images, audio, video) and their integration with Lance, an open format for AI, into the streaming lakehouse.
  • Python Client with PyArrow Integration: Recognizing the growing importance of the Python ecosystem in AI and data science, Fluss plans to release a Python client with deep integration with PyArrow. Given that Fluss uses Arrow as its underlying log format, this integration will be seamless, enabling connections to popular Python libraries like Pandas, Polars, and DuckDB.

In the AI era, data infrastructure faces new demands and challenges, primarily centered around three key aspects: multimodal data, streaming data, and open data.

  • Multimodal Data: Unlike traditional analytics that primarily work with structured data, AI applications increasingly rely on unstructured multimodal data, including text, images, audio, and video. This makes multimodal data increasingly crucial in data and AI infrastructure.
  • Streaming Data: AI and agents will no longer rely on historical data. They require real-time streaming data to make accurate and instantaneous decisions.
  • Open Data: Interoperability is more critical than ever. Similar to analytics, AI also necessitates an open data format. Fluss believes that Lance holds significant potential to become the open format for AI, hence the deep integration plans.

The full picture of the upcoming Fluss release reveals its evolution beyond analytics into a real-time pipeline for multimodal AI.

This will involve supporting real-time ingestion and storage of multimodal data in a streaming format, and then seamlessly converting it into the Lance format. From there, users can connect to the broader Lance ecosystem, including tools like Ray and PyTorch. With the upcoming Fluss Python client, integration with the Python data science ecosystem (Pandas, Polars, etc.) will unlock a multitude of use cases, such as real-time multimodal agents, real-time AI data lakes, and real-time feature engineering.

The Open-Source Journey of Fluss

Fluss's open-source journey began with its live open-sourcing at Flink Forward Asia 2024 in Shanghai. Since then, the community has experienced continuous growth, boasting over 1,200 GitHub stars and more than 50 contributors from leading companies worldwide, including Alibaba, ByteDance, eBay, Xiaomi, and Tencent. The project has also maintained a rapid development pace, with three releases in just six months.

Half a year after its initial open-sourcing, Alibaba has proudly completed the donation of Fluss to the Apache Software Foundation (ASF) at Flink Forward Asia 2025. Fluss is now an incubating project under the ASF, officially known as Apache Fluss. The new repository can be accessed by simply replacing the "alibaba" with "apache" in the previous GitHub URL. Joining the Apache Software Foundation is a significant milestone for the Fluss community, marking a new beginning for its open-source journey towards becoming more open, community-driven, and with a brighter future.

Finally, a private preview of the managed service for Apache Fluss (incubating) has been launched on Alibaba Cloud, now available in the Singapore region. This offers an opportunity to explore the next-generation streaming storage. Users can apply for the private preview program of the managed service for Apache Fluss (incubating) on Alibaba Cloud by visiting this link or scanning the QR code in the slide below.

Conclusion

Fluss is poised to redefine real-time data analytics and AI by providing a unified, high-performance, and cost-effective streaming storage solution. Its seamless integration with lakehouse architectures, support for multimodal AI, and commitment to open data formats position it as a critical component for future data infrastructures. As Fluss continues its open-source journey under the Apache Software Foundation, it promises to empower developers and enterprises to unlock the full potential of real-time data and AI.

0 1 0
Share on

Apache Flink Community

206 posts | 54 followers

You may also like

Comments