FeatureStore overview - Platform For AI - Alibaba Cloud Documentation Center

FeatureStore is a centralized data management and sharing platform in Platform for AI (PAI). You can use FeatureStore to organize, store, and manage feature data for machine learning and AI training. FeatureStore lets you easily share features among multiple people and teams, ensures the consistency of offline and online feature data, and provides efficient access to online features.

What is FeatureStore?

FeatureStore is a feature management tool in PAI. You can use it to store and manage feature data in offline and online services.

FeatureStore integrates with Alibaba Cloud services such as DataHub, Flink, Hologres, and Tablestore, and has developed FeatureDB, a search and recommendation-specific feature database, to provide feature management. Applications can receive behavioral logs, item properties, and real-time user properties from DataHub and synchronize them directly to MaxCompute. The data can also be processed by Flink and written to the corresponding online store through FeatureStore. Applications for recommendation systems, user growth, and risk control can then call the FeatureStore SDK to access feature data in the online store.

The following figure shows the process of ingesting data from MaxCompute and DataHub, processing it through feature computing and model sample management, and publishing it to an online store for client applications.

Terms

Feature entity: A feature entity (FeatureEntity) is the name of a collection of feature tables. For example, in a recommendation scenario, you can define two feature entities: user and item. This is because all table features belong to either the user or the item.
Feature view: A feature view (FeatureView) is a set of features. It contains information about a group of features and their derived features. A feature view is a subset of the full feature set of a feature entity. It is a mapping between an offline feature table and an online feature table.
Join Id: A Join Id is a feature table field that associates a feature view with a feature entity. Each feature entity has a Join Id. You can use the Join Id to associate features from multiple feature views.
Note
Each feature view has a primary key (index key) to retrieve its feature data. However, the index key of the feature view can have a different name from the Join Id.
For example, in a recommendation scenario, you can configure the Join Id to be user_id and item_id, which are the primary keys of the user and item tables.
Label table: A label table is the table that contains the labels for model training. It includes the model training target and the Join Id of the feature entity. In a recommendation scenario, this table is usually generated from a behavior table using operations such as GROUP BY user_id, item_id, request_id.

Scenarios

Recommendation systems and ad sorting: You can use FeatureStore to centrally manage user and item features, including browsing history, purchase records, and user personas. The real-time read and write capabilities of FeatureStore improve model performance, which increases the accuracy and effectiveness of ad delivery.
Search engine sorting: Feature data in this scenario includes keyword relevance, click-through rate, and sales volume. You can use FeatureStore to train a sorting model. This model sorts the recall results from search engines such as Elasticsearch/OpenSearch. The recall results are used to request scores from a TensorFlow model service in EAS. This provides users with more accurate and personalized search results based on their search intent and preferences.
User growth or risk control: You can use FeatureStore to manage feature data such as user personal information, transaction behavior, and credit records. You can combine this data with machine learning models, such as XGBoost and GBDT, to perform risk assessments. This improves the accuracy and efficiency of risk control.
Offline key-value (KV) data synchronization to an online store: You can use FeatureStore to manage feature data such as product attribute tables and user attribute tables. This simplifies the scheduling tasks for synchronizing offline data to an online store.

Features

Support for diverse data sources

FeatureStore manages the entire workflow from features to models. It supports multiple offline and online data sources where you can register and manage feature tables.

The following data sources are supported:

Offline store: MaxCompute
Online stores: FeatureDB, Hologres, Tablestore

After you register a feature table in FeatureStore, you gain the following advantages:

Automatic synchronization: FeatureStore automatically builds online and offline tables to ensure data consistency.
Cost savings: You can store only one copy of a feature and share it across multiple teams to reduce resource costs.
Increased efficiency: FeatureStore saves time. Complex operations, such as exporting training tables or importing data to online stores, can be completed with a single line of code.

Management of offline and real-time features

FeatureStore can manage offline feature views and real-time feature views. Offline features include attribute features and statistical features of users and items. Real-time features include new users or new items that are written directly to an online store such as Hologres through Flink. They also include features calculated over a time window, such as clicks, forwards, purchase quantity, and conversion rate within one hour.

Management of real-time statistical features and user behavior sequences

The complexity and real-time requirements of model features generally increase over time. Therefore, managing the real-time statistical features and user behavior sequences calculated by Flink is essential. FeatureStore defines offline user behavior sequences, such as the sequence of item IDs that a user has clicked. Item ID sequences alone are not enough. Models often use item attribute features (SideInfo). Transmitting SideInfo online consumes a large amount of data. In EasyRec, you can use the FeatureStore SDK to cache item features. This reduces inference response time and improves inference performance.

Automatic association and model sample export

You can use PAI-FeatureStore to manage the generated samples. If a model uses features from a real-time feature view, you can use the Create Model Feature feature. This feature can automatically generate correct samples based on the real-time feature update information recorded in FeatureDB. By using this function, real-time features are automatically associated, eliminating the need to deploy a callback interface in the PAI-Rec engine.

Sharing of new and old features

When an algorithm or business intelligence (BI) developer creates a new set of user or item features, you can design a new ModelFeature to associate the new and old features required by the training dataset. You can use the FeatureStore SDK to export samples for offline training. You can also use the FeatureStore SDK to publish them to an online store for online services. If multiple models reference the same feature view, only one copy is stored online. This feature management capability is helpful for algorithm engineering, especially when adding features to iteratively optimize a model.

Multi-language SDKs

FeatureStore provides SDKs for Go, Java, and Python. These SDKs help you use FeatureStore features in the joint solution of PAI-REC and EasyRec Processor. You can use the Java SDK to call EasyRec Processor or other model scoring engines from your own server, such as search, recommendation, or risk control engines. The Python SDK lets you access data in online stores to perform data analytics and modeling.

Feature generation SDK

Feature generation refers to defining and creating features. You can easily define features using a Python script, execute the script to produce the required features, and then register them on the PAI-FeatureStore platform. The SDK for feature generation is an independent, open-source tool based on MaxCompute SQL that simplifies feature generation. The implementation uses daily intermediate data. This lets you significantly save computing resources when you calculate user preference statistics based on 30 days of behavioral data.

Automated feature engineering

FeatureStore plans to offer automated feature engineering. This feature will use machine learning to automatically discover new features and reduce the manual feature engineering workload for development teams.

Feature monitoring

FeatureStore plans to provide feature monitoring and alerting. This helps you promptly detect and resolve feature anomalies and issues, and reduces the time your team spends on troubleshooting and repairs.

Deep integration with the EasyRec recommendation engine

FeatureStore is deeply integrated with EasyRec. It supports efficient feature engineering (FG) and model training. You can deploy models directly online to the EasyRec EAS Processor. This lets you build a recommendation system and achieve excellent results in a short time. EasyRec can cache item feature tables in memory and provides efficient model scoring.

The FeatureStore C++ SDK integrated into the EasyRec processor is specifically optimized for large-scale scenarios. When you use FeatureStore:

Memory usage: The built-in FeatureStore C++ SDK in the EasyRec processor is optimized for feature storage. Compared to native memory caching, it saves 50% of memory. The savings are more significant when processing many features, which helps reduce resource consumption.
Feature pull time: You can quickly cache features to memory using offline feature views. This is more than five times faster than using an online data source, which increases speed while reducing the load on the online data source. The offline data source is highly stable. Tests show that you can scale out to hundreds of EAS instances simultaneously. Each instance can load all features within a few minutes. Therefore, scaling out does not put significant pressure on the online store.
Model scoring time: Model scoring extracts features in real time from the optimized cache. With the specific optimizations of the FeatureStore C++ SDK, using FeatureStore significantly improves tp100 performance, enhances scoring stability, and reduces timeouts.

How it works

FeatureStore provides data source capabilities and can connect to both offline and online storage products. This lets you read, write, and manage offline and online feature data in a unified way.
You can register offline and online feature tables in a feature view of FeatureStore. You can then use the feature view to aggregate and map feature data.
You can store label tables in the offline store MaxCompute and register them with FeatureStore through the offline data source. The registered FeatureStore label table maps to the actual label table data.
FeatureStore provides feature projects and feature entities. You can use the Join Id of a feature entity to associate feature views across projects. This links all features of an entity. Finally, you can combine this with a label table to produce a model feature table, called a Train Set table, and store it in MaxCompute.

Regions and zones

FeatureStore is available in the following regions:

China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Hong Kong), Singapore, US (Silicon Valley), and US (Virginia).

Procedure

Create a data source. Data sources include offline and online stores.
Create a project. You can configure feature entities, feature views, and label tables to produce a model feature train set table (training dataset).
Create a data synchronization task to synchronize offline data to an online store.
After you start the task to synchronize offline data to an online store, you can view the task status and details in the Task Hub.
To read and use FeatureStore online data in a Java or Go online engine (online service), join the DingTalk group (34415007523) to contact technical support.