Configuration and O&M of PAI Flow - DataWorks - Alibaba Cloud Documentation Center

PAI Flow provides end-to-end machine learning process development capabilities. It offers the same workflow functionality as visual modeling Designer in Platform for AI (PAI) and supports recurring workflow scheduling.

Limits

Product limits:
- PAI Flow only supports DataWorks Workspaces (new version).
- PAI Flow currently only supports Source/Target and RAG Data Processing nodes.
- PAI Flow only supports Serverless resource groups.
Region limits: Supported regions include China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), China (Hong Kong), Singapore, Indonesia (Jakarta), Japan (Tokyo), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).

Prerequisites

You have created a DataWorks Data Studio (new version) workspace and a Platform for AI (PAI) workspace.

When you create a workspace, select Create An AI Workspace With The Same Name. The system will automatically create a PAI workspace with the same name as the DataWorks workspace and bind them together.
For existing workspaces, if you want to enable Schedule PAI Algorithm Tasks, you need to enable it in the Management Hub. This operation will create a PAI workspace with the same name as the DataWorks workspace.

Create a PAI Flow

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the project directory module of DataStudio, click and select Create Node > Algorithm > PAI Flow to create and enter the PAI Flow orchestration page to create a new node.

Develop a PAI Flow

PAI Flow supports various visual modeling nodes. You can design workflows and develop nodes based on different node types.

In PAI Flow, select the required nodes from the left side and drag them to the canvas. Design the workflow by manually connecting the nodes.

After completing the workflow design, click a node to configure it in the right panel.

Node type	Node	Node description
Source/Target	Read Table	The Read Table component reads data from MaxCompute tables. By default, the component reads the table data of the current project.
	Read OSS Data	This component is used to read files or folders from the Object Storage Service `OSS Bucket` path.
	Read CSV File	This component supports reading `CSV` files from `OSS`, `HTTP`, and `HDFS`.
	Write Table	This component supports writing input data to `MaxCompute`.
RAG data processing	RAG Text Parsing and Chunking	Reads and parses text files (`HTML`, `PDF`, `Markdown`, `Text`, etc.) from the input directory, generates continuous text blocks no larger than the specified block size, and saves them in `JSONline` format to the specified output path.
	RAG Vector Generation	Loads all parsed and chunked document files (`JSONline` format) from the specified directory, then uses an `Embedding` model to generate text vectors.
	RAG Knowledge Base Index Synchronization	Synchronizes input data to the target knowledge base index.

Note

When configuring file paths, you can include variables in the path, for example: https://examplebucket.oss-cn-hangzhou.aliyuncs.com/${variable}/example.csv. When configuring variables, you can use scheduling parameters as variables to read from or write to different storage paths during recurring scheduling.

After completing node development, configure scheduling settings for PAI Flow in the right toolbar of the orchestration page to ensure recurring scheduling after publishing to the production environment.
Note
When configuring scheduling settings, the schedule resource group only supports Serverless resource groups.

Publish a PAI Flow

After completing the debugging and scheduling configuration of PAI Flow, the nodes will run periodically according to the scheduling configuration only after you submit and publish the PAI Flow workflow.

Click the Save button in the top toolbar to save the PAI Flow.
After saving, click the button in the top toolbar to open the publish panel to publish the task. Click Start Publishing To Production, and the task will execute the publishing operation according to the publishing check process.

More operations

After PAI Flow is successfully published, you can click the Go To O&M button in the publish panel to navigate to the Recurring tasks page to view the scheduling and running status of PAI Flow.

Note

In the DAG graph, you can view internal tasks only after opening the PAI Flow.