This topic describes how to use the FeatureStore software development kit (SDK) to manage features in a recommendation system without using other Alibaba Cloud products.
Background information
A recommendation system suggests personalized content or products to users based on their interests and preferences. A crucial step in a recommendation system is extracting and configuring features for users and items. This document demonstrates how to build a recommendation system using FeatureStore and manage feature data with different versions of the FeatureStore SDK.
For more information about FeatureStore, see FeatureStore overview.
If you have any questions during configuration or use, you can join our DingTalk group (ID: 34415007523) for technical support.
Prerequisites
Before you begin, complete the following preparations.
Dependent product | Specific operation |
Platform for AI (PAI) |
|
MaxCompute |
|
FeatureDB |
|
DataWorks |
|
1. Prepare data
Synchronize data tables
In a typical recommendation scenario, you need to prepare three data tables: a user feature table, an item feature table, and a label table.
For practice purposes, we have prepared sample user, item, and label tables in the pai_online_project MaxCompute project. The user and item tables each contain approximately 100,000 data entries per partition and occupy about 70 MB of storage in MaxCompute. The label table contains approximately 450,000 data entries per partition and occupies about 5 MB of storage in MaxCompute.
Run SQL commands in DataWorks to synchronize the user, item, and label tables from the pai_online_project project to your own MaxCompute project. Follow these steps:
Log on to the DataWorks console.
In the navigation pane on the left, click Data Development and O&M > Data Development.
Select the DataWorks workspace that you created and click Go to Data Studio.
Hover over Create, and choose Create Node > MaxCompute > MaxCompute SQL. In the dialog box that appears, configure the node parameters.
Parameter
Suggested value
Engine Instance
Select the MaxCompute engine you created.
Node Type
MaxCompute SQL
Path
Business Flow/Workflow/MaxCompute
Name
Enter a custom name.
Click OK.
In the new node area, run the following SQL commands to synchronize the user, item, and label tables from the pai_online_project project to your own MaxCompute project. For Resource Group, select the exclusive resource group that you created.
Synchronize the user table: rec_sln_demo_user_table_preprocess_all_feature_v1 (Click for details)
Synchronize the item table: rec_sln_demo_item_table_preprocess_all_feature_v1 (Click for details)
Synchronize the label table: rec_sln_demo_label_table (Click for details)
After you complete these steps, you can view the user table rec_sln_demo_user_table_preprocess_all_feature_v1, the item table rec_sln_demo_item_table_preprocess_all_feature_v1, and the label table rec_sln_demo_label_table in your workspace. These three tables are used as examples in the following operations.
Configure data sources
FeatureStore typically requires two data sources: an offline store (MaxCompute) and an online store (FeatureDB, Hologres, or TableStore). This topic uses MaxCompute and FeatureDB as examples.
Log on to the PAI console. In the navigation pane on the left, click Data Preparation > FeatureStore.
Select a workspace and click Enter FeatureStore.
Configure the MaxCompute data source.
On the Data Source tab, click Create Data Source. In the dialog box that appears, configure the parameters for the MaxCompute data source.
Parameter
Suggested value
Type
MaxCompute
Name
Enter a custom name.
MaxCompute Project Name
Select the MaxCompute project you created.
Copy the authorization statement. Then, click Go Now to navigate to DataWorks. Run the statement in DataWorks to grant the required permissions.
NoteThe authorization operation requires your account to have admin permissions. For more information, see Manage user permissions using commands or Manage user permissions in the console.
After the configuration is complete, click Submit.
Configure the FeatureDB data source.
If you have already created a FeatureDB data source, you can skip this step.
On the Data Source tab, click Create Data Source. In the dialog box that appears, configure the parameters for the FeatureDB data source.
Parameter
Suggested value
Type
FeatureDB (If this is your first time using it, follow the on-screen instructions to activate FeatureDB)
Name
Custom names are not supported. The default value is feature_db.
Username
Set a username.
Password
Set a password.
High-speed VPC Connection (Optional)
After successful configuration, you can use the FeatureStore SDK in a VPC to directly access FeatureDB through a PrivateLink connection. This improves data read and write performance and reduces access latency.
VPC
Select the VPC where your online FeatureStore service is located.
Zone and vSwitch
Select a zone and a vSwitch. Make sure to select the vSwitch in the zone where your online service machine is located. We recommend selecting vSwitches in at least two zones to ensure high availability and stability for your business.
After the configuration is complete, click Submit.
2. Create and register a FeatureStore project
You can create and register a FeatureStore project using either the console or an SDK. Because the SDK is required for subsequent operations, such as exporting a training set and synchronizing data, you must install the FeatureStore Python SDK even if you use the console for the initial setup.
Method 1: Use the console
Create a FeatureStore project.
Log on to the PAI console. In the navigation pane on the left, click Data Preparation > FeatureStore.
Select a workspace and click Enter FeatureStore.
Click Create Project. In the dialog box that appears, configure the project parameters.
Parameter
Suggested value
Name
Enter a custom name. This topic uses fs_demo as an example.
Description
Enter a custom description.
Offline Store
Select the MaxCompute data source you created.
Online Store
Select the FeatureDB data source you created.
Click Submit to create the FeatureStore project.
Create a feature entity.
On the FeatureStore Project List page, click the project name to open the project details page.
On the Feature Entity tab, click Create Feature Entity. In the dialog box that appears, configure the parameters for the user feature entity.
Parameter
Suggested value
Feature Entity Name
Enter a custom name. This topic uses user as an example.
Join Id
user_id
Click Submit.
Click Create Feature Entity. In the dialog box that appears, configure the parameters for the item feature entity.
Parameter
Suggested value
Feature Entity Name
Enter a custom name. This topic uses item as an example.
Join Id
item_id
Click Submit to create the feature entity.
Create a feature view.
On the Feature View tab of the project details page, click Create Feature View. In the dialog box that appears, configure the parameters for the user feature view.
Parameter
Suggested value
View Name
Enter a custom name. This topic uses user_table_preprocess_all_feature_v1 as an example.
Type
Offline
Write Method
Use Offline Table
Data Source
Select the MaxCompute data source you created.
Feature Table
Select the user table you prepared: rec_sln_demo_user_table_preprocess_all_feature_v1.
Feature Fields
Select the user_id primary key.
Sync To Online Feature Table
Yes
Feature Entity
user
Feature Lifecycle (seconds)
Keep the default value.
Click Submit.
Click Create Feature View. In the dialog box that appears, configure the item feature view.
Parameter
Suggested value
View Name
Enter a custom name. This topic uses item_table_preprocess_all_feature_v1 as an example.
Type
Offline
Write Method
Use Offline Table
Data Source
Select the MaxCompute data source you created.
Feature Table
Select the item table you prepared: rec_sln_demo_item_table_preprocess_all_feature_v1.
Feature Fields
Select the item_id primary key.
Sync To Online Feature Table
Yes
Feature Entity
item
Feature Lifecycle (seconds)
Keep the default value.
After the configuration is complete, click Submit to create the feature view.
Create a label table.
On the Label Table tab of the project details page, click Create Label Table. In the dialog box that appears, configure the label table information.
Parameter
Suggested value
Data Source
Select the MaxCompute data source you created.
Table Name
Select the label table you prepared: rec_sln_demo_label_table.
Click Submit.
Create a model feature.
On the Model Feature tab of the project details page, click Create Model Feature. In the dialog box that appears, configure the model feature parameters.
Parameter
Suggested value
Model Feature Name
Enter a custom name. This topic uses fs_rank_v1 as an example.
Select Features
Select the user and item feature views you created.
Label Table Name
Select the label table you created: rec_sln_demo_label_table.
Click Submit to create the model feature.
On the model feature list page, click Details to the right of your model.
In the Model Feature Details dialog box that appears, on the Basic Information tab, you can view the Exported Table Name, which is fs_demo_fs_rank_v1_trainning_set. This table is used for subsequent feature production and model training.
Install the FeatureStore Python SDK. For more information, see Use FeatureStore to manage features in a recommendation system.
Method 2: Use the FeatureStore Python SDK
Log on to the DataWorks console.
In the navigation pane on the left, click Resource Groups.
On the Exclusive Resource Groups tab, find the resource group whose Purpose is Data Scheduling. Click the
icon next to its schedule resource and select O&M Assistant.Click Create Command. In the dialog box that appears, configure the command parameters.
Parameter
Suggested value
Command Name
Enter a custom name. This topic uses install as an example.
Command Type
Manual Input (pip Command Cannot Be Used To Install Third-party Packages)
Command Content
/home/tops/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-2.0.2-py3-none-any.whlTimeout
Set a custom time.
Click Create to create the command.
Click Run Command. In the dialog box that appears, click Run.
You can click Refresh to view the latest execution status. When the status changes to Success, the installation is complete.
For detailed steps on using the SDK, see DSW Gallery.
3. Run the data synchronization node
Before you publish your service, you need to run the data synchronization node on a regular basis to synchronize data from the offline store to the online store. The online service then reads data from the online store in real time. This example demonstrates how to configure a recurring synchronization task for the user and item feature tables.
Log on to the DataWorks console.
In the navigation pane on the left, click Data Development and O&M > Data Development.
Select the DataWorks workspace that you created and click Enter Data Development.
Configure recurring synchronization for the user table.
Hover over New, and choose New Node > MaxCompute > PyODPS 3.
Copy the following content into the script to complete the recurring synchronization of user_table_preprocess_all_feature_v1.
from feature_store_py.fs_client import FeatureStoreClient import datetime from feature_store_py.fs_datasource import MaxComputeDataSource import sys from odps.accounts import StsAccount cur_day = args['dt'] print('cur_day = ', cur_day) access_key_id = o.account.access_id access_key_secret = o.account.secret_access_key sts_token = None endpoint = 'paifeaturestore-vpc.cn-beijing.aliyuncs.com' if isinstance(o.account, StsAccount): sts_token = o.account.sts_token fs = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, security_token=sts_token, endpoint=endpoint) cur_project_name = 'fs_demo' project = fs.get_project(cur_project_name) feature_view_name = 'user_table_preprocess_all_feature_v1' batch_feature_view = project.get_feature_view(feature_view_name) task = batch_feature_view.publish_table(partitions={'ds':cur_day}, mode='Overwrite', offline_to_online=True) task.wait() task.print_summary()In the right-side navigation pane, click Scheduling Configuration. In the dialog box that appears, configure the scheduling parameters.
Parameter
Suggested value
Scheduling Parameters
Parameter Name
dt
Parameter Value
$[yyyymmdd-1]
Resource Properties
Scheduling Resource Group
Select the exclusive resource group you created.
Scheduling Dependencies
Select the user table you created.
After you configure and test the node, save and submit the node configuration.
Perform a data backfill operation. For more information, see Synchronize data tables.
Configure recurring synchronization for the item table.
Hover over New, and choose New Node > MaxCompute > PyODPS 3. In the dialog box that appears, configure the node parameters.
Click Confirm.
Copy the following content into the script.
Routine synchronization for item_table_preprocess_all_feature_v1 (Click for details)
In the right-side navigation pane, click Scheduling Configuration. In the dialog box that appears, configure the scheduling parameters.
Parameter
Suggested value
Scheduling Parameters
Parameter Name
dt
Parameter Value
$[yyyymmdd-1]
Resource Properties
Scheduling Resource Group
Select the exclusive resource group you created.
Scheduling Dependencies
Select the item table you created.
After you configure and test the node, save and submit the node configuration.
Perform a data backfill operation. For more information, see Synchronize data tables.
After the synchronization is complete, you can view the latest synchronized feature data in Hologres.
4. Export the training set script
Log on to the DataWorks console.
In the navigation pane on the left, click Data Development and O&M > Data Development.
Select the DataWorks workspace that you created and click Enter Data Development.
Hover over New, and choose New Node > MaxCompute > PyODPS 3. In the dialog box that appears, configure the node parameters.
Parameter
Suggested value
Engine Instance
Select the MaxCompute engine you created.
Node Type
PyODPS 3
Path
Business Flow/Workflow/MaxCompute
Name
Enter a custom name.
Click Confirm.
Copy the following content into the script.
from feature_store_py.fs_client import FeatureStoreClient from feature_store_py.fs_project import FeatureStoreProject from feature_store_py.fs_datasource import LabelInput, MaxComputeDataSource, TrainingSetOutput from feature_store_py.fs_features import FeatureSelector from feature_store_py.fs_config import LabelInputConfig, PartitionConfig, FeatureViewConfig from feature_store_py.fs_config import TrainSetOutputConfig, EASDeployConfig import datetime import sys from odps.accounts import StsAccount cur_day = args['dt'] print('cur_day = ', cur_day) offset = datetime.timedelta(days=-1) pre_day = (datetime.datetime.strptime(cur_day, "%Y%m%d") + offset).strftime('%Y%m%d') print('pre_day = ', pre_day) access_key_id = o.account.access_id access_key_secret = o.account.secret_access_key sts_token = None endpoint = 'paifeaturestore-vpc.cn-beijing.aliyuncs.com' if isinstance(o.account, StsAccount): sts_token = o.account.sts_token fs = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, security_token=sts_token, endpoint=endpoint) cur_project_name = 'fs_demo' project = fs.get_project(cur_project_name) label_partitions = PartitionConfig(name = 'ds', value = cur_day) label_input_config = LabelInputConfig(partition_config=label_partitions) user_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_user_config = FeatureViewConfig(name = 'user_table_preprocess_all_feature_v1', partition_config=user_partitions) item_partitions = PartitionConfig(name = 'ds', value = pre_day) feature_view_item_config = FeatureViewConfig(name = 'item_table_preprocess_all_feature_v1', partition_config=item_partitions) feature_view_config_list = [feature_view_user_config, feature_view_item_config] train_set_partitions = PartitionConfig(name = 'ds', value = cur_day) train_set_output_config = TrainSetOutputConfig(partition_config=train_set_partitions) model_name = 'fs_rank_v1' cur_model = project.get_model(model_name) task = cur_model.export_train_set(label_input_config, feature_view_config_list, train_set_output_config) task.wait() print("task_summary = ", task.task_summary)In the right-side navigation pane, click Scheduling Configuration. In the dialog box that appears, configure the scheduling parameters.
Parameter
Suggested value
Scheduling Parameters
Parameter Name
dt
Parameter Value
$[yyyymmdd-1]
Resource Properties
Scheduling Resource Group
Select the exclusive resource group you created.
Scheduling Dependencies
Select the user and item tables you created.
After you configure and test the node, save and submit the node configuration.
Perform a data backfill operation. For more information, see Synchronize data tables.
5. Install and use the SDK
Python SDK
For more information, see Use the FeatureStore Python SDK to build a recommendation system.
Go SDK
The FeatureStore Go SDK is open source. For more information, see aliyun-pai-featurestore-go-sdk.
Installation
Run the following command to install the FeatureStore Go SDK.
go get github.com/aliyun/aliyun-pai-featurestore-go-sdk/v2Usage
Run the following command to initialize the client.
accessId := os.Getenv("AccessId") accessKey := os.Getenv("AccessKey") regionId := "cn-hangzhou" projectName := "fs_test_ots" client, err := NewFeatureStoreClient(regionId, accessId, accessKey, projectName)NoteThe SDK connects directly to the online data source. Therefore, the client must run in a VPC environment. For example, Hologres requires connections to originate from a specified VPC.
Retrieve feature data from a feature view.
// get project by name project, err := client.GetProject("fs_test_ots") if err != nil { // t.Fatal(err) } // get featureview by name user_feature_view := project.GetFeatureView("user_fea") if user_feature_view == nil { // t.Fatal("feature view not exist") } // get online features features, err := user_feature_view.GetOnlineFeatures([]interface{}{"100043186", "100060369"}, []string{"*"}, nil)Here,
[]string{"*"}means to retrieve all features in the feature view. You can also specify the names of the features that you want to retrieve.The following is a sample of the returned data:
[ { "city":"Hefei", "follow_cnt":1, "gender":"male", "user_id":"100043186" }, { "city":"", "follow_cnt":5, "gender":"male", "user_id":"100060369" } ]Retrieve feature data from a model feature.
A model feature can be associated with multiple feature entities. You can provide multiple join IDs to retrieve the corresponding features together.
This example uses three join IDs:
join_id,user_id, anditem_id. When you fetch features, you must provide a value for each join ID.// get project by name project, err := client.GetProject("fs_test_ots") if err != nil { // t.Fatal(err) } // get ModelFeature by name model_feature := project.GetModelFeature("rank") if model_feature == nil { // t.Fatal("model feature not exist") } // get online features features, err := model_feature.GetOnlineFeatures(map[string][]interface{}{"user_id": {"100000676", "100004208"}, "item_id":{"238038872", "264025480"}} )The following is a sample of the returned data:
[ { "age":26, "author":100015828, "category":"14", "city":"Shenyang", "duration":63, "gender":"male", "item_id":"238038872", "user_id":"100000676" }, { "age":23, "author":100015828, "category":"15", "city":"Xi'an", "duration":22, "gender":"male", "item_id":"264025480", "user_id":"100004208" } ]You can also specify a feature entity to retrieve all of its corresponding features.
The following is a sample of the returned data:
[ { "age":26, "city":"Shenyang", "gender":"male", "user_id":"100000676" }, { "age":23, "city":"Xi'an", "gender":"male", "user_id":"100004208" } ]
Java SDK
The FeatureStore Java SDK is open source. For more information, see aliyun-pai-featurestore-java-sdk.
This section uses a Hologres data source as an example.
Run the following code to load environment variables and initialize the service.
public static String accessId = ""; public static String accessKey = ""; // Configure the host based on the region public static String host = ""; // Get the accessId and accessKey from the configured local environment variables static { accessId = System.getenv("ACCESS_KEY_ID"); accessKey = System.getenv("ACCESS_KEY_SECRET"); }Create a configuration class that configures the
regionId,accessId,accessKey, and project name.Configuration cf = new Configuration("cn-hangzhou",Constants.accessId,Constants.accessKey,"ele28"); cf.setDomain(Constants.host);// Note: The default environment is a VPC environmentInitialize the client.
ApiClient apiClient = new ApiClient(cf); // FS client FeatureStoreClient featureStoreClient = new FeatureStoreClient(apiClient);Retrieve the project. This example uses a project named ele28.
Project project=featureStoreClient.getProject("ele28"); if(project==null){ throw new RuntimeException("Project not found"); }Retrieve the feature view of the project. This example uses a feature view named mc_test.
FeatureView featureView=project.getFeatureView("mc_test"); if (featureView == null) { throw new RuntimeException("FeatureView not found"); }Retrieve online feature data based on the feature view.
Map<String,String> m1=new HashMap<>(); m1.put("gender","gender1"); // Create an alias user_id='100027781'(FS_INT64) age='28'(FS_INT64) city='null'(FS_STRING) item_cnt='0'(FS_INT64) follow_cnt='0'(FS_INT64) follower_cnt='2'(FS_INT64) register_time='1697641608'(FS_INT64) tags='0'(FS_STRING) gender1='female'(FS_STRING) ---------------To retrieve online features,
String[]{"*"}retrieves all properties. You can also specify certain properties to retrieve only a subset of the information.FeatureResult featureResult1=featureView.getOnlineFeatures(new String[]{"100017768","100027781","100072534"},new String[]{"*"},m1);Output the feature information.
while(featureResult1.next()){ System.out.println("---------------"); // Feature name for(String m:featureResult1.getFeatureFields()){ System.out.print(String.format("%s=%s(%s) ",m,featureResult1.getObject(m),featureResult1.getType(m))); } System.out.println("---------------"); }The following data is returned.
--------------- user_id='100017768'(FS_INT64) age='28'(FS_INT64) city='Dongguan'(FS_STRING) item_cnt='1'(FS_INT64) follow_cnt='1'(FS_INT64) follower_cnt='0'(FS_INT64) register_time='1697202320'(FS_INT64) tags='1,2'(FS_STRING) gender1='female'(FS_STRING) ---------------Retrieve the model.
Model model=project.getModelFeature("model_t1"); if(model==null){ throw new RuntimeException("Model not found"); }Retrieve data from the model feature.
This model feature uses two join IDs: user_id and item_id. The number of values that you provide must match the number of join IDs. This example provides one value for each join ID.
Map<String, List<String>> m2=new HashMap<>(); m2.put("user_id",Arrays.asList("101683057")); m2.put("item_id",Arrays.asList("203665415"));Retrieve all feature data related to the user feature entity from the model feature.
FeatureResult featureResult2 = model.getOnlineFeaturesWithEntity(m2,"user");The following data is returned.
--------------- user_id='101683057' age='28' city='Shenzhen' follower_cnt='234' follow_cnt='0' gender='male' item_cnt='0' register_time='1696407642' tags='2' item_id='203665415' author='132920407' category='14' click_count='0' duration='18.0' praise_count='10' pub_time='1698218997' title='#Idiom Story' ---------------
Cpp SDK
The FeatureStore C++ SDK is currently integrated into the EasyRec Processor. It is specifically optimized for feature extraction, cache management, and read operations to provide a high-performance, low-latency solution for large-scale recommendation scenarios.
Memory usage: When dealing with complex and large-scale feature data, memory consumption is significantly reduced. The memory savings are more significant under high feature loads.
Feature pull time: Instead of pulling features from online stores (such as FeatureDB and Hologres), the SDK pulls them directly from MaxCompute into the Elastic Algorithm Service (EAS) cache. This significantly shortens feature loading time. Additionally, MaxCompute provides better stability and extensibility, which reduces the load on online stores during scale-out operations.
Model scoring time: Using this SDK improves the TP100 performance metric for model scoring, stabilizes response times, and significantly reduces timeout requests. These improvements enhance the overall reliability and user experience of the recommendation service.
References
FeatureStore can also be used with other cloud products to build a recommendation system. For more information, see Use FeatureStore to manage features in a recommendation system.