Community Blog In-depth Application of Flink in Ant Group Real-time Feature Store

In-depth Application of Flink in Ant Group Real-time Feature Store

This article is based on the keynote speech on AI feature engineering given by ZHAO Liangxingyun, a senior technical expert of Ant Group, during Flink Forward Asia 2023.

Abstract: This article is based on the keynote speech on AI feature engineering given by ZHAO Liangxingyun, a senior technical expert of Ant Group, during Flink Forward Asia 2023. This article contains the following parts:

  1. Architecture of the Ant Group feature store
  2. Real-time feature computing
  3. Feature serving
  4. Feature simulation and tracking

Architecture of the Ant Group Feature Store


The Ant Group feature store is a high-performance AI data processing framework that integrates multiple computing paradigms. It can meet the requirements for low-latency feature output, high-concurrency access, and feature data consistency between online and offline stores in AI training and inference scenarios.

Ant Group built the feature store to enable algorithm engineers to be data self-sufficient. The feature store allows algorithm engineers to develop, test, create, and run features in a low-code manner without the assistance of a dedicated data engineering team.

After a feature starts to run, the feature store automatically completes high-performance real-time feature production tasks and queries, and ensures feature data consistency between offline and online stores, which is transparent to users.

Ant Group started to build a feature store in 2017. By leveraging years of risk management expertise and substantial data insights, Ant Group built its feature store 1.0, which integrates the capabilities of core data products for risk management. The feature store significantly bolstered the business risk management of Ant Group. However, it is difficult to expand the feature store across all algorithm-related services of Ant Group between 2019 and 2020. The core reason is that the feature store involves numerous syntaxes related to risk management businesses. Its computing paradigms, including computing directed acyclic graphs (DAGs), data precision, and operator types, are designed for risk management. From 2020, Ant Group started to rebuild a well-architected feature store.

Until now, the feature store has served many businesses of Ant Group, including search and recommendation, microcredit, international risk management, e-commerce banking, finance and insurance, and credit scoring and loyalty (Zhima Credit). The feature store contains more than 100,000 features and is able to handle 2 million queries per second (QPS) for online serving and about 1 million transactions per second (TPS) on a daily basis.

To meet the requirements of all involved businesses for features, the feature store must provide the following capabilities:

  1. Quick implementation of computing paradigms: Algorithm engineers focus only on feature requirements, data specifications, and data logic. After algorithm engineers submit different data requirements for heterogeneous scenarios to the feature store, the store can quickly convert these requirements into real-time computing tasks based on the optimal logic. This way, the feature store can quickly implement any computing paradigms based on the business requirements.
  2. Large-scale feature simulation and tracking: The first stage of model training is to prepare samples. If a model is trained based on an offline production environment or newly running features, the attempt to generate a large number of training samples may fail because feature snapshots are not stored in offline stores. For real-time features, the feature store needs to quickly calculate the instantaneous values of historical query requests at historical time points. To achieve this, the feature store must be able to track and calculate real-time feature values based on historical query requests. Therefore, the feature store must support large-scale feature simulation and tracking based on unified batch and stream processing capabilities.
  3. Cold start of real-time features: Models use a wide range of real-time features related to window statistics, such as the number of transactions of a person within 30 days. Model iteration is time-consuming if feature serving is provided only after all window values are accumulated for the running real-time features. To improve model iteration efficiency, the feature store must quickly complement the window values of features and provide online serving immediately after real-time features are defined. To address this, the feature store must support code start of real-time features.
  4. High-performance feature serving: When a model is running, it must provide a high-performance model inference service based on high-performance data input. In most cases, I/O throughput may cause a performance bottleneck when a model is running. To make the model service more efficient and accurate, the feature store must provide a set of online feature query services with high performance and low latency. Therefore, the feature store needs to have the capability of high-performance feature serving.

To meet the requirements for the above-mentioned capabilities, Ant Group proposed to build a next-generation feature engine architecture, that is, the universal feature engine (UFE)-based architecture. This architecture involves both offline and online data systems. The offline system is used to simulate and track a large number of features. The online system separates the storage of write operations and read operations. A Flink-based real-time data production system is used for write operations. This system can be used together with a large-scale simulation system to build the Skyline architecture. A self-managed SQL engine is used for read operations. This engine is used to perform efficient feature queries for model inference services. The SQL engine is mainly responsible for returning a batch of features to the model services as soon as possible.

For feature serving, a feature insight system is provided to monitor feature quality. The system can monitor the calls and usage durations of features in real time, and can also analyze the content distribution of features. It generates an alert if the content distribution of features drastically changes.

The unified metadata service at the bottom layer of the architecture abstracts all feature DevOps operations into interfaces. The DevOps operations include R&D, definition, creation, verification, and running of features. The feature management system provided by the feature store is implemented based on these interfaces. Enterprise users can also use these interfaces to build their platform products based on the core data capabilities of the feature store. Although the features running on the feature store are sourced from different configuration platforms, the feature store uses the same feature metadata system to ensure their metadata consistency. This ensures that technological optimizations for production or consumption globally take effect for the feature store. The feature metadata system provided by the feature store is out of the box and has been connected to multiple platforms of Ant Group. If you have sufficient resources and a bundle of personalized requirements, you can develop your product by leveraging data technologies provided by the feature store.

Real-time feature computing

Challenges of real-time feature computing


Maintaining high performance is the first challenge of real-time feature computing. In Ant Group, a computing task often needs to handle hundreds of thousands or even millions of TPS. It is challenging to ensure the smooth running of such a task with minimal latency and stable output.

Another challenge is that customers want to define data requirements on the feature store without taking the details of data implementation into account. However, the optimal implementation method for the same data requirement may vary based on scenarios. The best approaches to implement the same data requirements may change with scenarios, as each one has its own needs, like resource conditions, data accuracy, response time (RT), and query performance. It is challenging to use a real-time feature production system to quickly obtain the optimal computing methods in different scenarios.
Let's consider two scenarios for illustrative purposes.

  1. Risk management: Long-window features account for a large proportion in risk management scenarios. For example, a long-window feature may involve the number of real-time transactions of a user within 90 days or the average number of money transfers of a user within 90 days. Long-window data can comprehensively determine the credibility of users in the risk management field. For risk management, the long-window data and recent data are often compared to check whether sudden behavior changes occur. If a risk is found, data specifications must be immediately changed, and the changes must immediately take effect. Therefore, it is not suitable for obtaining the final result of such feature data on the compute engine to provide serving even if the optimal serving performance is achieved in this case. In real-time computing, it is difficult to store all state data of an ultra-long window in the compute engine. Precomputed key-value pairs cannot be reused to fit the change of data specifications due to sudden behavior changes. If data specifications change, all the precomputed data becomes invalid. Therefore, feature serving based on detailed data or intermediate data is more suitable for risk management scenarios. In summary, the system needs to compute detailed data, hourly bills, or daily bills on the compute engine, store such data, and then aggregate all stored data during feature serving.
  2. Search and recommendation: Real-time features of short windows account for a large proportion in search and recommendation scenarios. Generally, the recent behavior of users can accurately reflect the subsequent consumption intention. However, high feature query performance is required in search and recommendation scenarios. For example, if you want to query 100 features at the same time, the average RT must be within 10 ms, and the long-tail latency cannot exceed 80 ms (RT < 80 ms for 99.99% of tasks). To achieve this, the system must compute the result directly on the compute engine and then store the result as key-value pairs for serving.

The comparison of the two scenarios reveals that similar requirements for real-time features need different optimal implementation approaches in different scenarios. A single computing paradigm and a deployment mode cannot cater to all business needs. Therefore, the feature store must be able to provide scenario-specific optimal implementation approaches to suit the data requirements of users.

Architecture of Skyline

To address the preceding challenges, Ant Group proposed the feature computing architecture Skyline. Skyline receives definitions of real-time features from various platform products by using the metadata service. The definition process is a directed acyclic graph (DAG) for computing requirements. The DAG is instantiated as the optimal computing paradigm by the scenario-specific adaptor layer. For example, if a user wants to calculate the number of logons within seven days, Skyline determines whether to generate a key-value pair or calculate daily bills and stores the bills for temporary aggregation during a feature query at the adaptor layer. Then, the computing optimization module shared by streaming and batch tasks applies the instantiated DAG into tasks, performs logic optimization such as filter push-up and column pruning for the tasks, and then normalizes the tasks. The results are logical execution plans that describe data processing requirements and can be shared by streaming and batch tasks. The logical execution plans apply independent special optimization in batch and stream processing scenarios and are converted into physical jobs for deployment.


Skyline involves three key stages: computing inference, computing normalization, and computing deployment.
The scenario-based rule plug-in instantiates the DAG into different computing tasks based on AGG operators and the length of the time window. For example, the HOP function is used to aggregate data of a window whose length is less than one day, and the TUMBLE function is used to calculate the daily bills of a window whose length is greater than one day and aggregates the bills. Secondary aggregation is performed on the daily bills of multiple days during feature serving.


Skyline performs filter push-up, column pruning, and normalization (node order adjustment and link compression) on computing tasks to form a logical execution plan that consists of core skeleton nodes. Then, Skyline deploys the computing tasks. If absolute task isolation is required to prevent mutual impact between computing tasks in the scenario, the normalized logical execution plan will be converted into a Flink SQL task. To maximize cluster resource utilization in the case of computing resource insufficiency, Skyline searches for physical tasks that have the same skeleton structure as the logical execution plan in all computing metadata of the current cluster. If such a physical task exists, the task is merged into existing physical tasks. Otherwise, Skyline creates a physical task. Physical tasks are written by using the stream API and can automatically load the computing policy without the need of restart.


In Flink, the most direct way for optimization is to reduce the state size. A small state size indicates higher task stability. This way, real-time computing tasks with huge workloads can stably output data with a low latency. A large number of homogeneous sliding window features exist in business scenarios of Ant Group. A sliding window feature is the aggregation value of a specific behavior from the current time to a previous period of time. Homogeneous means that the data computational logic of tasks is the same, but the window lengths are different. If the native HOP function of Flink is used for a sliding window, computing resources are infinitely expanded. In addition, I/O explosion may occur when the result data is exported to the external storage system. In this case, the sliding window is restructured to a fixed window for the state. When data arrives at the window, the data is merged with data in a fixed pane. The length of the pane is the length of the sliding window. When data is flushed out of the window, secondary aggregation is performed on the data in panes. This significantly reduces the state size of the computing task that uses the sliding window. Homogeneous computing can also be performed based on the same state. The original data flush mechanism of sliding windows is also changed. If two consecutive sliding windows have the same data, the data of the latter window is not flushed out because only the data of the latest window is checked during feature serving. In addition, if the data of the latest window does not change, the data of the latter window does not need to be flushed out.


Cold start of features leverages unified batch and stream processing of Flink. The production logic of real-time features is converted into an equivalent Flink batch SQL task. Before a streaming task is submitted, the Flink batch SQL task is submitted to supplement historical data. Then, the streaming task is reset to 00:00 to merge the data of both batch and streaming tasks.

Feature serving


Feature serving supports feature queries for online model inference. In actual scenarios, upper-layer services have strict requirements on feature query performance. A query request contains hundreds of features whose data may be scattered in different storage systems due to the complexity of the data link. The average RT must be less than 10 ms, and the response time for 99.99% of requests must be less than 100 ms. The UFE-serving engine is tailored to achieve low RT and low long-tail latency in the case of a large number of requests and highly concurrent accesses.


The UFE-serving engine involves the following layers:

  • Presentation layer: The top layer is the presentation layer of features. SQL is used as the main feature presentation method. It allows users to define the temporary conversion and secondary processing of data after the data is queried from a storage system with SQL statements. SQL offers the following benefits:
  1. SQL is a common and easy-to-use domain-specific language (DSL).
  2. SQL is used to define the descriptions of data serving and computation. This way, computation and queries can be flexibly deduced and split for the same real-time feature based on actual scenarios.
  3. Since features are described in SQL, optimizations made to the SQL engine can be immediately applied across the global feature execution process.
  • I/O optimization layer: The underlying heterogeneous storage is shielded at the I/O optimization layer. Storage is abstracted as views. The SQL utilizes UFE views for its tables. In batch feature queries, the UFE-serving engine handles I/O extraction and data merging, and optimizes for concurrency across the SQL operations.
  • I/O instance layer: The I/O instance layer is the bottom layer and is used to connect to any storage system. A new storage system can be incorporated into the feature serving system as long as a connector instance is implemented based on the connector published by the UFE-serving engine.


The following section describes the I/O optimization process in a batch feature query.
The UFE-serving engine hierarchically abstracts data. The following sample code shows an example of feature-related SQL statements:

select sum(amount) as total amount_24H
from trade_table
where gmt_occur between now()-24H and now();

In the SQL statements, the trade_table parameter specifies a view. A storage system produces different views, and a view produces different features. The UFE-serving engine builds a global optimal I/O plan for all feature-related SQL statements in a batch feature query. During this process, the UFE-serving engine traverses all feature-related SQL statements to collect information about the columns and windows of the views, and implements the I/O classification and merging algorithm on the I/O data. This algorithm classifies I/O data based on the view storage type. The I/O data of the same view storage type that belongs to different columns of the same row and that belongs to different rows in the same table is merged into one I/O operation. The scan range of a single I/O operation is also reduced based on the information such as valid columns and window ranges collected by the SQL statements. This helps reduce the number of interactions between the query engine and storage system during a single feature serving process, and reduce the scope of a data scan. After concurrent queries are performed on different storage systems, the engine splits the query results into different features.

Through I/O merging optimization and the built-in technologies of the UFE-serving engine, such as automatic hotspot discovery and high-concurrency optimization, the long-tail latency rate of the feature query is kept below one ten-thousandth. Furthermore, the average RT is notably low.

Feature simulation and tracking


The value of either an online or offline real-time feature changes over the timeline. Feature simulation is used to calculate the instantaneous value of a feature at all historical time points based on a historical driving table (historical feature query traffic) and a historical information table (historical transaction events). In risk management and consumer credit scenarios, time travel computing is necessary because the impact of new features on online transactions must be fully evaluated for strategy adjustments or iterations on new models. In most cases, sample data required in these scenarios spans over half a year.


If a user writes SQL statements in their own data warehouse and the amount of data is small, the point-in-time (PIT) value can be calculated. However, if the historical driving table contains tens of billions of data records, a large amount of data is shuffled and joined. This results in serious data bloat. In this case, no native computing method of a compute engine can complete data computing within a short period of time.


The core challenge in feature simulation is to ensure the performance and stability of large-scale data computing based on PIT semantics.


The preceding flowchart shows the core process of feature simulation. The engine performs data pre-pruning based on the driving table, feature logic, and event table to remove the events that are unnecessarily used in the event table. After data pre-pruning, the engine splits detailed data into hourly and daily bills and adds time partitions to the detailed data for subsequent data pruning. At the same time, the engine splits the driving table by time partition to build multiple simulation computing tasks that can run in parallel. Then, the engine performs secondary aggregation on the driving table and obtained intermediate bills to calculate the final feature result. Flink 在蚂蚁实时特征平台的深度应用_en-US.pngDuring secondary aggregation, the engine calculates the start time and end time of the window for the feature based on the data in the driving table. Then, the engine merges the daily bills and the hourly bills at both ends and finally merges the details at both ends of the hourly bills based on the calculated window information. This is because the output of simulation calculation is accurate to milliseconds, which is consistent with the online data output. This join method has better performance than the native join statement that is written by users because the details include time partitions during data processing. The detailed data includes time partitions during the preceding data processing. When the engine reads the detailed data, it prunes a large amount of data based on the hourly and daily partitions to which the data belongs. After data splitting optimization and secondary aggregation are complete, the feature store can perform large-scale PIT computing. For example, the feature store can produce features within 24 hours for 10 billion data records that are generated in a 90-day window.

0 1 0
Share on

Apache Flink Community

147 posts | 41 followers

You may also like


Apache Flink Community

147 posts | 41 followers

Related Products