Best practices for customizing PAI-Rec recommendation algorithms - Artificial Intelligence Recommendation

This topic uses a public dataset to help you get started with PAI-Rec. You can follow the steps to configure key features, such as feature engineering, recall, and fine-grained ranking for custom recommendation algorithms. You can then generate the code and deploy it to the corresponding workflow in DataWorks.

Prerequisites

Before you begin, complete the following prerequisites:

Activate PAI. For more information, see Activate PAI and create a default workspace.
Create a virtual private cloud (VPC) and a vSwitch. For more information, see Create a VPC with an IPv4 CIDR block.
Activate PAI-FeatureStore. For more information, see the Prerequisites section of Create a data source. You do not need to activate Hologres. Instead, select FeatureDB as the data source. For more information, see Create an online store: FeatureDB.
Activate MaxCompute and create a MaxCompute project named project_mc. For more information, see Activate MaxCompute and Create a MaxCompute project.
Create an Object Storage Service (OSS) bucket. For more information, see Create a bucket.
Activate DataWorks and perform the following operations:
- Create a DataWorks workspace. For more information, see Create a workspace.
- Purchase a Serverless resource group for DataWorks. For more information, see Use a Serverless resource group. The resource group is used to synchronize data for PAI-FeatureStore and run eascmd commands to create and update PAI-EAS services.
- Configure DataWorks data sources:
  - Create and attach an OSS data source. For more information, see Data Source Management.
  - Create and attach a MaxCompute data source. For more information, see Attach a MaxCompute computing resource.
Create a FeatureStore project and feature entities. Skip this step if you use a Serverless resource group. If you use a dedicated resource group for DataWorks, you must install the FeatureStore Python software development kit (SDK). For more information, see II. Create and register a FeatureStore and Install the FeatureStore Python SDK.
Activate Flink. For more information, see Activate Realtime Compute for Apache Flink. Note: Set Storage Type to OSS bucket, not Fully Managed Storage. Ensure that the OSS bucket for Flink is the same as the one configured for PAI-Rec. Flink is used to record real-time user behavioral data and calculate real-time user features.
If you choose EasyRec (TensorFlow), the model is trained on MaxCompute by default.
If you choose TorchEasyRec (PyTorch), the model is trained on PAI-DLC by default. To download MaxCompute data on PAI-DLC, you must activate Data Transmission Service. For more information, see Purchase and use a dedicated resource group for Data Transmission Service.

1. Create a PAI-Rec instance and initialize the service

Log on to the Personalized Recommendation Platform home page and click Buy Now.

On the PAI-Rec instance purchase page, configure the following key parameters and click Buy Now.

Parameter	Description
Region And Zone	The region where your cloud service is deployed.
Service Type	Select Premium Edition for this solution. Note Compared to the Standard Edition, the Premium Edition adds data diagnostics and custom recommendation solution features.

Log on to the PAI-Rec console. In the top menu bar, select a region.
In the navigation pane on the left, choose Instance List. Click the instance name to go to the instance details page.

In the Operation Guide section, click Init. You are redirected to the System Configurations > End-to-End Service page. Click Edit, configure the resources as described in the following table, and then click Done.

Resource configuration

Parameter	Description
Modeling
PAI Workspace	Enter the default PAI workspace that you created.
DataWorks Workspace	Enter the automatically generated DataWorks workspace.
MaxCompute Project (Workspace)	Enter the MaxCompute project that you created.
OSS Bucket	Select the OSS bucket that you created.
Engine
Real-time Recall Engine	For Use PAI-FeatureStore, select Yes.
Real-time Feature Query	For Use PAI-FeatureStore, select Yes.

In the navigation pane on the left, choose System Configurations > Permission Management. On the Access Service tab, check the authorization status of each cloud product to ensure that access is granted.

2. Clone the public dataset

1. Synchronize data tables

You can provide input data for this solution in two ways:

Clone data for a fixed time window from the pai_online_project project. This method does not support routine task scheduling.
Use a Python script to generate data. You can run a task in DataWorks to generate data for a specific period.

To schedule daily data generation and model training, use the second method. You must deploy the specified Python code to generate the required data. For more information, see the Generate data using code tab.

Synchronize data for a fixed time window

PAI-Rec provides three common tables for recommendation algorithms in the publicly accessible pai_online_project project:

User table: pai_online_project.rec_sln_demo_user_table
Item table: pai_online_project.rec_sln_demo_item_table
Behavior table: pai_online_project.rec_sln_demo_behavior_table

The subsequent operations in this solution are based on these three tables. The data is randomly generated and simulated and has no real business meaning. Therefore, metrics such as Area Under the Curve (AUC) obtained from training will be low. You must run SQL commands in DataWorks to synchronize the table data from the pai_online_project project to your DataWorks project, such as DataWorks_a. The procedure is as follows:

Log on to the DataWorks console. In the top menu bar, select a region.
In the navigation pane on the left, click Data Development And O&M > Data Development.
Select the DataWorks workspace that you created and click Go To Data Development.

Hover over Create and choose Create Node > MaxCompute > ODPS SQL. Configure the parameters as described in the following table and click Confirm.

Resource configuration

Parameter	Description
Engine Instance	Select the attached MaxCompute data source.
Node Type	Select the node type ODPS SQL.
Path	Select the path where the current node is located. For example, `Business Flow/Workflow/MaxCompute`.
Name	Enter a custom name, such as Data.

In the new node section, copy and run the following code to synchronize the user, item, and behavior tables from the pai_online_project project to your MaxCompute project, such as project_mc. To run the code, you must set variables to specify data from bizdate to 100 days before bizdate. Typically, you can set bizdate to the day before the current date. Configure the scheduling parameters as follows: Run the following code once to copy the data from the public pai_online_project project to your project:

CREATE TABLE IF NOT EXISTS rec_sln_demo_user_table_v1(
 user_id BIGINT COMMENT 'Unique user ID',
 gender STRING COMMENT 'Gender',
 age BIGINT COMMENT 'Age',
 city STRING COMMENT 'City',
 item_cnt BIGINT COMMENT 'Number of created items',
 follow_cnt BIGINT COMMENT 'Number of follows',
 follower_cnt BIGINT COMMENT 'Number of followers',
 register_time BIGINT COMMENT 'Registration time',
 tags STRING COMMENT 'User tags'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;

INSERT OVERWRITE TABLE rec_sln_demo_user_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_user_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";

CREATE TABLE IF NOT EXISTS rec_sln_demo_item_table_v1(
 item_id BIGINT COMMENT 'Item ID',
 duration DOUBLE COMMENT 'Video duration',
 title STRING COMMENT 'Title',
 category STRING COMMENT 'Primary tag',
 author BIGINT COMMENT 'Author',
 click_count BIGINT COMMENT 'Total clicks',
 praise_count BIGINT COMMENT 'Total likes',
 pub_time BIGINT COMMENT 'Publication time'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;

INSERT OVERWRITE TABLE rec_sln_demo_item_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_item_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";

CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_v1(
 request_id STRING COMMENT 'Instrumentation ID/Request ID',
 user_id STRING COMMENT 'Unique user ID',
 exp_id STRING COMMENT 'Experiment ID',
 page STRING COMMENT 'Page',
 net_type STRING COMMENT 'Network type',
 event_time BIGINT COMMENT 'Behavior time',
 item_id STRING COMMENT 'Item ID',
 event STRING COMMENT 'Behavior type',
 playtime DOUBLE COMMENT 'Playback/Read duration'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;


INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_behavior_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";

Generate data using code

Using data from a fixed time window does not support routine task scheduling. To schedule tasks, you must deploy specific Python code to generate the required data. The procedure is as follows:

In the DataWorks console, create a PyODPS 3 node. For more information, see Create and manage MaxCompute nodes.
Download create_data.py and paste the file content into the PyODPS 3 node.
In the right-side pane, click Scheduling Configurations, configure the parameters, and then click the Save and Submit icons in the upper-right corner.
- Configure scheduling parameters:
  - Note the variable replacements:
    Replace $user_table_name with rec_sln_demo_user_table.
    Replace $item_table_name with rec_sln_demo_item_table.
    Replace $behavior_table_name with rec_sln_demo_behavior_table.
    After replacement:
- Configure scheduling dependencies.
Go to the Operation Center and choose Periodic Task O&M > Periodic Tasks.
In the Actions column of the target task, choose Backfill Data > Current And Descendant Nodes.
In the Backfill Data panel, set the data timestamp and click Submit And Go.
A good backfill time range is 60 days. We recommend that you set the data timestamp to Scheduled Task Date - 60 to ensure data integrity.

2. Configure dependency nodes

To ensure smooth code generation and deployment, add three SQL code nodes to your DataWorks project in advance. Configure the scheduling dependencies of these nodes to the root node of the workspace. After you complete all the settings, publish the nodes. The procedure is as follows:

Hover over Create and choose Create Node > General > Virtual Node. Create three virtual nodes as described in the following table and click Confirm.

Resource configuration

Parameter	Description	Example
Node Type	Select the node type.	Virtual Node
Path	Select the path where the current node is located.	Workflow/Workflow/General
Name	Enter the names of the synchronized data tables.	rec_sln_demo_user_table_v1 rec_sln_demo_item_table_v1 rec_sln_demo_behavior_table_v1

Select a node, set the node content to select 1; for each node, and then click Scheduling Configurations in the right-side pane to complete the configurations:
- In the Time Property section, set Rerun Property to Rerun When Succeeded Or Failed.
- In the Scheduling Dependencies > Upstream Dependencies section, enter the DataWorks workspace name, select the node with the _root suffix, and click Add.
  Configure all three virtual nodes.
Click the icon in front of the virtual node to submit it.

3. Register data

To configure feature engineering, recall, and sorting algorithms in the custom recommendation solution, you must first register the three tables that you synchronized to your DataWorks project. The procedure is as follows:

Log on to the PAI-Rec console. In the top menu bar, select a region.
In the navigation pane on the left, choose Instance List. Click the instance name to go to the instance details page.

In the navigation pane on the left, choose Custom Recommendation Solution > Data Registration. On the MaxCompute Table tab, click Add Data Table. Add one user table, one item table, and one behavior table as described in the following table, and then click Start Import.

Parameter	Description	Example default solution
MaxCompute project	Select the MaxCompute project that you created.	project_mc
MaxCompute table	Select the data tables that you synchronized to the DataWorks workspace.	User table: rec_sln_demo_user_table_v1 Item table: rec_sln_demo_item_table_v1 Behavior table: rec_sln_demo_behavior_table_v1
Data table name	Enter a custom name.	User Table Item Table Behavior Table

4. Create a recommendation scenario

Before you configure a recommendation task, you must create a recommendation scenario. For information about the basic concepts of recommendation scenarios and the meaning of traffic IDs, see Terms.

In the navigation pane on the left, choose Recommendation Scenarios. Click Create Scenario, create a recommendation scenario as described in the following table, and then click OK.

Resource configuration

Parameter	Description	Example: Default Solution
Scenario Name	Enter a custom name.	HomePage
Scenario Description	A detailed description of the scenario.	None

5. Create and configure an algorithm solution

To configure a complete real-world scenario, we recommend the following recall and fine-grained ranking configurations.

Global hot recall: Ranks the top k items based on statistics from log data.
Global hot fallback recall: Uses Redis as a fallback to prevent the recommendation API from returning empty data.
Grouped hot recall: Recalls items by categories, such as city and gender, to help improve the accuracy of popular item recommendations.
etrec u2i recall: Based on the etrec collaborative filtering algorithm.
swing u2i recall (optional): Based on the Swing algorithm.
Cold-start recall (optional): Uses the DropoutNet algorithm for cold-start recall.
Fine-grained ranking: You can choose MultiTower for single-objective ranking or DBMTL for multi-objective ranking.

Vector recall or PDN recall algorithms are typically enabled after the recall stage is comprehensive. Vector recall requires a vector recall engine. We do not configure vector recall in this example because FeatureDB does not support it.

This topic is designed to guide you through the configuration and deployment process. Therefore, in the recall configuration stage, we configure only global hot recall and the u2i recall strategy of RECommender (eTREC, a collaborative filtering implementation). For the ranking configuration, we select fine-grained ranking to optimize the experience. The procedure is as follows:

In the navigation pane on the left, choose Custom Recommendation Solution > Solution Configuration. Select the scenario that you created, click Create Recommendation Solution, create a solution as described in the following table, and then click Save And Configure Algorithm Solution.

Keep the default values for parameters that are not described. For more information, see Data Table Configuration.

Resource configuration

Parameter	Description
Solution Name	Enter a custom name.
Scenario Name	Select the recommendation scenario that you created.
Offline Store	Select the MaxCompute project associated with the recommendation scenario.
DataWorks Workspace	Select the DataWorks workspace associated with the recommendation scenario.
Workflow Name	This is the name of the workflow created in DataWorks when you deploy the recommendation solution script. You can enter a custom name, such as Flow.
StorageAPI configuration	For regions in China, such as Beijing and Shanghai, you can select "StorageAPI", which is a pay-as-you-go Data Transmission Service. For regions outside China, such as China (Hong Kong), Singapore, and Frankfurt, you must first purchase and use a dedicated resource group for Data Transmission Service. If a pay-as-you-go option is not available, you must purchase a subscription Data Transmission Service. Then, refresh the page and select the name of the subscription Data Transmission Service. Add a parameter to the TorchEasyRec training task of PAI-DLC in DataWorks, in a format similar to: -odps_data_quota_name ot_xxxx_p#ot_yyyy.
slim_mode	If the DataWorks edition you purchased has a size limit on the code packages imported by Migration Assistant, you can use this feature and manually upload the code packages that exceed the size limit. For this solution, select No.
OSS Bucket	Select the OSS bucket associated with the recommendation scenario.
Project	Select the FeatureStore project that you created. For the online store, select FeatureDB.
User Entity	Select the user feature entity `user` corresponding to the FeatureStore project.
Item Entity	Select the item feature entity `item` corresponding to the FeatureStore project.

At the Data Table Configuration node, click Add to the right of the target data table. Configure the Behavior Log Table, User Table, and Item Table as described in the following tables. Set the partition, event, feature, and timestamp fields, and then click Next.

Keep the default values for parameters that are not described. For more information, see Data Table Configuration.

Behavior log table resource configuration

When you configure the behavior log table, make adjustments based on the actual data content. In this topic, the behavior log contains core information, such as request ID, unique user ID, the page where the behavior occurred, behavior timestamp, and behavior category. If the table contains richer data dimensions, we recommend that you classify this information by user and item and configure it as user information or item information for subsequent feature engineering.

Parameter	Description	Example
Behavior Table Name	Select the registered behavior table.	rec_sln_demo_behavior_table_v1
Time Partition	The partition field of the behavior table.	ds yyyymmdd
Behavior information configuration
Request ID	The ID that marks each recommendation request in the log, typically a program-generated UUID. This is optional.	request_id
Behavior Event	The field that records the behavior event in the log.	event
Behavior Event Enumeration Values	The enumeration values included in the behavior event, such as impression, click, add-to-cart, or purchase.	expr,click,praise
Behavior Value	Represents the depth of the behavior, such as transaction price or viewing duration.	playtime
Behavior Timestamp	The time when the log was generated, as a UNIX timestamp accurate to the second.	event_time
Timestamp Format	Used with the behavior timestamp.	unixtime
Behavior Scenario	The scenario field where the log occurred, such as home page, search page, or product details page.	page
Scenario Enumeration Values	Indicates which scenario data is used. You can calculate statistics for features by scenario in subsequent feature engineering.	home,detail
User information configuration
User ID	The user ID identifier in the behavior table.	user_id
User Categorical Features	User categorical features in the behavior table, such as network, operating platform, or gender.	net_type
Item information configuration
Item ID	The item ID identifier in the behavior table.	item_id

User table resource configuration

Parameter	Description	Example
User Table Name	Select the registered user table.	rec_sln_demo_user_table_v1
Time Partition	The time partition field of the user table.	ds yyyymmdd
User information configuration
User ID	The user ID field in the user table.	user_id
Registration Timestamp	The time when the user registered.	register_time
Timestamp Format	Used with the registration timestamp.	unixtime
Categorical Features	Categorical fields in the user table, such as gender, age group, or city.	gender, city
Numerical Features	Numerical fields in the user table, such as number of works or points.	age, item_cnt, follow_cnt, follower_cnt
Tag Feature	The name of the tag feature field.	tags

Item table resource configuration

Parameter	Description	Example
Item Table Name	Select the registered item table.	rec_sln_demo_item_table_v1
Time Partition	The time partition field of the item table.	ds yyyymmdd
Item information configuration
Item ID	The item ID field in the item table.	item_id
Author ID	The author of the item.	author
Listing Timestamp	The name of the item listing timestamp field.	pub_time
Timestamp Format	Used with the listing timestamp.	unixtime
Categorical Features	Categorical fields in the item table, such as category.	category
Numerical Features	Numerical fields in the item table, such as price, total sales, or number of likes.	click_count, praise_count

At the Feature Configuration node, configure the parameters as described in the following table, click Generate Features, set the feature version, and then click Next.

After you click Generate Features, various statistical features are derived for users and items. In this solution, we do not edit the derived features and keep the default settings. You can edit the derived features as needed. For more information, see Feature Configuration.

Resource configuration

Parameter

Description

Example

Common Statistics Period

This configuration is used for batch feature generation. To avoid generating too many features, this solution sets the statistics period to 3, 7, and 15 days to calculate statistics for users and items in the last 3, 7, and 15 days, respectively.

If the number of user behaviors is small, you can try setting it to 21 days.

3,7,15

Key Behaviors

Select the configured behavior events. We recommend adding them in the order of expr (impression), click, and praise.

expr, click, praise

At the Recall Configuration node, click Add to the right of the target category, configure the parameters, click Confirm, and then click Next.

The following sections describe multiple recall configuration methods. To quickly guide you through the deployment process, you can configure only Global hot recall and etrec u2i recall. Other methods, such as vector recall and collaborative metric recall, are for reference only.

Resource configuration

Global hot recall

Global hot recall generates a ranking of popular items (`top_n` represents the number of items in the ranking) based on click event statistics. If you want to modify the scoring formula for popularity or the access event, you can do so after you generate the relevant code and deploy it to the DataWorks platform.

The scoring formula is click_uv*click_uv/(expr+adj_factor)*exp(-item_publish_days/fresh_decay_denom), where:

click_uv: For the same click-through rate (CTR), a higher number of clicks indicates greater popularity.
click_uv/(expr+adj_factor): The smoothed CTR, where click_uv is the number of unique users who clicked and expr is the number of impressions. The adjustment factor adj_factor is added to prevent the denominator from being zero and to adjust the CTR when the number of impressions is low. When impressions are few, the CTR approaches 1. Adding adj_factor moves the CTR away from 1, making it closer to the true CTR.
exp(-item_publish_days/fresh_decay_denom): Penalizes items that were published earlier. item_publish_days is the number of days from the publication date to the current date.

etrec u2i recall

etrec is an item-based collaborative filtering algorithm. For more information, see Collaborative filtering etrec.

Parameter	Description
Training Days	The number of days of behavior logs used for training. The default is 30 days. You can increase or decrease this value based on the log volume.
Recall Count	The final number of user-to-item pairs generated offline.
U2ITrigger	Items with which the user has interacted. For example, items that the user has clicked, favorited, or purchased. This generally does not include items with only impressions.
Behavior Time Window	The number of days of behavior data to collect. The default is 15, which means the last 15 days.
Behavior Time Attenuation Coefficient	A value between 0 and 1. A larger value indicates that past behaviors decay more rapidly, and their weight in constructing the trigger_item is smaller.
Trigger Selection Count	The number of item IDs to take for each user to perform a Cartesian product with the i2i data generated by etrec. We recommend a value between 10 and 50. If the number of triggers is too large, it will result in too many candidate items for recall.
U2i Behavior Weight	Note that the impression event should either not be set or be set to a weight of 0. We recommend not setting the impression event, which means skipping user impression data.
I2I Model Settings	The parameter settings for etrec. For more information, see Collaborative filtering etrec. We recommend not setting the number of related item selections too high.

Grouped hot recall

You can set up rankings based on attributes, such as city and gender, to provide initial personalized recall. In the following example, a combination of gender and the bucketing number of a numerical value is used as the group.

swing u2i recall

Swing is a method for calculating item relevance, measuring item similarity based on the User-Item-User principle.

Vector recall

Two vector recall methods are provided: DSSM and MIND. For more information, see:

Recall target name: Generally refers to whether an item was clicked. Set this to is_click.
Recall target selection: Set this to max(if(event='click', 1, 0)).
You can use the following code for execution:
```
select max(if(event='click',1,0)) is_click ,...
from ${behavior_table}
where between dt=${bizdate_start} and dt=${bizdate_end}
group by req_id,user_id,item
```
Where:
- ${behavior_table}: The behavior table.
- ${bizdate_start}: The start date of the behavior time window.
- event: The event field in the ${behavior_table} table. Select a value based on the specific field.
- is_click: The target name.
The formulas for dimension calculation are as follows:
```
EMB_SQRT4_STEP8: (8 + Pow(count, 0.25)) / 8) * 8
EMB_SQRT4_STEP4: (4 + Pow(count, 0.25)) / 4) * 4
EMB_LN_STEP8:    (8 + Log(count + 1)) / 8) * 8
EMB_LN_STEP4:    (4 + Log(count + 1)) / 4) * 4
```
Here, count is the number of feature enumeration values. Use the Log function when the number of feature values is large.

Cold-start recall

Similar to the DSSM dual-tower recall model, it is divided into a user tower and an item tower. DropoutNet is a recall model suitable for head users and items, along with for the long tail and even brand new users and items.

Global hot fallback recall

Global hot fallback recall is similar to global hot recall. Its main purpose is to ensure that a sufficient candidate set can be recalled if the global hot recall engine fails. Therefore, it is stored in Redis, and this output has only one row of data.

Collaborative metric learning i2i recall

Collaborative metric learning i2i recall, also known as the Collaborative Metric Learning I2I recall model, calculates the similarity between items based on session click data.

At the Ranking Configuration node, click Add to the right of Fine-grained Ranking, configure the parameters as described in the following table, click Confirm, and then click Next.
Resource configuration
The platform provides multiple ranking models. For more information, see Ranking Models. The following section describes how to set the ranking parameters for the DBMTL multi-objective ranking model.
Click Add next to Refined Ranking Target Settings (labels) to add the following two labels:
- Target 1
- Target 2 (Note: The 'l' in 'ln' is a lowercase L)
At the Generate Script node, click Generate Deployment Script.
Important
After the script is successfully generated, the system generates an OSS address as shown in the preceding figure. This OSS path stores all the files to be deployed. You can save this address locally to manually deploy the script later.
After the script is generated, click OK in the dialog box. You are redirected to the Custom Recommendation Solution > Deployment Records page.
If the generation fails, view the run logs, analyze and resolve the specific error, and then generate the script again.

6. Deploy the recommendation solution

After the script is generated, you can deploy it to DataWorks in one of two ways.

Method 1: Deploy through the Personalized Recommendation Platform

Click Go To Deploy to the right of the target solution.
On the Deployment Preview page, in the File Diff section, select the files to deploy. Because this is the first deployment, click Select All and then click Deploy To DataWorks.
The page automatically returns to the Deployment Records page, which shows that the script deployment is in progress.
Wait for a moment, then click to refresh the list and check the deployment status.
- If the deployment fails, click View Log in the Actions column, analyze and resolve the specific error, and then regenerate and deploy the script.
- When the Deployment Status changes to Success, the script is successfully deployed. You can go to the Data Development page of the DataWorks workspace configured for this solution to view the deployed code. For more information, see Data development process guide.
View the task data backfill process.
1. On the Custom Recommendation Solution > Deployment Records page, click Details in the Actions column of the successfully deployed recommendation solution.
2. On the Deployment Preview page, click View Task Data Backfill Process to understand the backfill process and related instructions to ensure data integrity.
3. Ensure that the user table, item table, and user behavior table partitions contain data for the last n days, where n is the sum of the training time window and the maximum feature time window. If you use the demo data from this topic, synchronize the latest data partitions. If you generate data using a Python script, backfill the data in the DataWorks Operation Center to produce the latest data partitions.
4. Click Create Deployment Task. Under the Backfill Task List, click Start Tasks Sequentially. Ensure that all tasks run successfully. If a task fails, click Details to view the log information, analyze and resolve the error, and then rerun the task. After a successful rerun, click Continue in the upper-left corner of the page until all tasks are successful.

Method 2: Deploy using Migration Assistant

After the script is successfully generated, you can also go to the DataWorks console and manually deploy the script using the Migration Assistant feature. The key parameters are described below. For other operations, see Create and view a DataWorks import task.

Import Name: Set this as prompted in the console.
Upload Method: Select OSS File, enter the OSS Link, and click Verify.
The deployment file is stored at the OSS address generated in Step 5, such as oss://examplebucket/algoconfig/plan/1723717372/package.zip. You can log on to the OSS console and follow the steps below to obtain the URL of the corresponding file.

7. Freeze nodes

This topic uses demo data. After the data backfill is complete, freeze the tasks in the Operation Center (the three nodes from Step 2.2) to prevent them from being scheduled and run daily.

Go to the DataWorks Operation Center. Choose Periodic Task O&M > Periodic Tasks. Search for the name of the node that you created, such as rec_sln_demo_user_table_v1. Select the target node (Workspace.Node Name) and choose Pause (Freeze).