FeatureStore overview - feature management for ML - Platform For AI

FeatureStore is a centralized platform for storing, managing, and sharing feature data for machine learning and AI training, with automatic offline-online consistency.

What is FeatureStore?

FeatureStore stores and manages feature data for offline and online services. It integrates with DataHub, Flink, Hologres, Tablestore, and FeatureDB, a feature database for search and recommendation.

Applications ingest behavioral logs and real-time properties from DataHub, sync to MaxCompute, process with Flink, and write results to an online store through FeatureStore. Recommendation, user growth, and risk control applications then call the FeatureStore SDK to access features in the online store.

The following diagram shows the data flow from ingestion through feature calculation and sample management to online store publication.

Key concepts

Feature entity: Named collection of feature tables. For example, in a recommendation scenario, set two feature entities: user and item.
Feature view: A group of features with information about derived features. It maps an offline feature table to an online feature table.
Join ID: A field that links a feature view to a feature entity. Each entity has a Join ID to connect features across multiple views.

Note
Each feature view has a primary key, or index key, to retrieve feature data. The index key name can differ from the Join ID name.

For example, in a recommendation scenario, set the Join ID to user_id and item_id, which are the primary keys of the user and item tables.
Label table: Stores model training labels, including the training target and the feature entity Join ID. In recommendation scenarios, typically derived from a behavior table using group by user_id/item_id/request_id.

Use cases

Recommendation systems and ad sorting: Centrally manage user and item features—browsing history, purchase records, personas. Real-time read/write improves model performance and ad delivery accuracy.
Search engine sorting: Feature data includes keyword match, CTR, and sales volume. Train a sorting model to rank recall results from search engines such as Elasticsearch or OpenSearch. Use a TensorFlow model scoring service in EAS for personalized search results.
User growth or risk control: Manage user profiles, transaction behavior, and credit records. Combine with ML models (XGBoost, GBDT) for risk assessment.
Offline KV data synchronization to online store: Manage product and user attribute tables. Simplifies offline-to-online data synchronization scheduling.

Key features

Diverse data sources

Manage the entire process from features to models. Register and manage feature tables from multiple offline and online data sources:

Offline store: MaxCompute
Online stores: FeatureDB, Hologres, and Tablestore

Benefits of registering a feature table in FeatureStore:

Automatic synchronization: Build and sync online and offline tables to ensure data consistency.
Cost savings: Store features once and share among multiple teams to reduce resource costs.
Improved efficiency: Export training tables or import data to an online database with a single line of code.

Management of offline and real-time features

Manage offline feature views and real-time feature views. Offline features cover user and item attributes and statistics. Real-time features include new users or items written to an online store (such as Hologres) through Flink, and time-window statistics such as clicks, forwards, purchases, and conversion rate within one hour.

Real-time statistical features and user sequence features

Model feature complexity and real-time requirements grow over time. Manage real-time statistical features and user behavior sequence features computed by Flink. Define offline user sequence features, such as item IDs a user clicked. Models often need item attribute features (SideInfo), but transmitting SideInfo over the network consumes substantial data. In EasyRec, the FeatureStore SDK caches item features to reduce inference response time and improve inference performance.

Automatic feature association and model sample export

Manage generated samples using PAI-FeatureStore. When a model uses features from a real-time feature view, use the Create Model Feature feature to generate correct samples from real-time update information in FeatureDB. This associates real-time features without a callback interface in the PAI-Rec engine.

Feature sharing

When an algorithm or BI developer creates a new set of user or item features, design a new ModelFeature to associate features required by the training dataset. Export samples for offline training and publish them to an online store using the FeatureStore SDK. Multiple models referencing the same feature view share a single online copy, simplifying iterative model optimization.

Multi-language SDKs

FeatureStore provides Go, Java, and Python SDKs for the joint solution of PAI-REC and EasyRecProcessor. Use the Java SDK to call EasyRecProcessor or other model scoring engines from your own server-side engines (search, recommendation, risk control). Use the Python SDK for data analytics and modeling against online stores.

Feature generation SDK

Define features with a Python script, run it, and register the output on PAI-FeatureStore. The feature generation SDK is an independent, open-source tool based on MaxCompute SQL that simplifies feature generation. It uses day-level intermediate data, significantly reducing compute costs for calculations such as 30-day user preference statistics.

EasyRec recommendation engine integration

FeatureStore integrates deeply with EasyRec and TorchEasyRecfor efficient feature engineering (FG) and model training. Deploy models directly to EasyRec Processor and TorchEasyRec Processor to build high-performing recommendation systems. EasyRec provides memory cache for item feature tables and efficient model scoring.

The FeatureStore Cpp SDK integrated into EasyRec Processor is optimized for large-scale scenarios. Benefits:

Memory usage: The built-in FeatureStore Cpp SDK in EasyRec Processor optimizes feature storage, saving 50% of memory compared to native caching. Savings increase with more features.
Feature pull time: Offline feature views cache to memory over 5x faster than online data sources, reducing pressure on online stores. The stable offline source supports scaling to hundreds of EAS instances, each loading all features within minutes. Scaling out does not put significant pressure on the online store.
Model scoring time: Model scoring extracts features in real time from the optimized cache. FeatureStore Cpp SDK optimizations improve tp100 performance, enhance stability, and reduce timeouts.

How it works

Connect offline and online storage for unified feature data management.
Register feature tables in feature views to aggregate and map feature data.
Store and register label tables in MaxCompute through the offline data source.
Use Join IDs to associate feature views across projects and link all entity features. Combine with label tables to produce Train Set tables in MaxCompute.

Supported regions

Available regions:

Area	Region
Asia-Pacific	China (Hangzhou) China (Shanghai) China (Beijing) China (Shenzhen) China (Hong Kong) Singapore Indonesia (Jakarta)
Europe and America	Germany (Frankfurt) US (Silicon Valley) US (Virginia)

Get started

Create a data source. Data sources include offline stores and online stores.
Create a project. Create feature entities, feature views, and label tables to produce a model feature train set table (training dataset).
Run a data synchronization task to synchronize offline data to an online store.
After you start the synchronization task, view the task status and details in Task Hub.
To read and use online data in a Java or Go online service, join DingTalk group 34415007523 to contact technical support.