PAI Flow provides end-to-end machine learning process development capabilities. It offers the same workflow functionality as visual modeling Designer in Platform for AI (PAI) and supports recurring workflow scheduling.
Limits
Product limits:
PAI Flow only supports DataWorks Workspaces (new version).
PAI Flow currently only supports Source/Target and RAG Data Processing nodes.
PAI Flow only supports Serverless resource groups.
Region limits: Supported regions include China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), China (Hong Kong), Singapore, Indonesia (Jakarta), Japan (Tokyo), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
Prerequisites
You have created a DataWorks Data Studio (new version) workspace and a Platform for AI (PAI) workspace.
When you create a workspace, select Create An AI Workspace With The Same Name. The system will automatically create a PAI workspace with the same name as the DataWorks workspace and bind them together.
For existing workspaces, if you want to enable Schedule PAI Algorithm Tasks, you need to enable it in the Management Hub. This operation will create a PAI workspace with the same name as the DataWorks workspace.
Create a PAI Flow
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the project directory module of DataStudio, click
and select to create and enter the PAI Flow orchestration page to create a new node.
Develop a PAI Flow
PAI Flow supports various visual modeling nodes. You can design workflows and develop nodes based on different node types.
In PAI Flow, select the required nodes from the left side and drag them to the canvas. Design the workflow by manually connecting the nodes.
After completing the workflow design, click a node to configure it in the right panel.
Node type
Node
Node description
Source/Target
The Read Table component reads data from MaxCompute tables. By default, the component reads the table data of the current project.
This component is used to read files or folders from the Object Storage Service
OSS Bucketpath.This component supports reading
CSVfiles fromOSS,HTTP, andHDFS.This component supports writing input data to
MaxCompute.RAG data processing
RAG Text Parsing and Chunking
Reads and parses text files (
HTML,PDF,Markdown,Text, etc.) from the input directory, generates continuous text blocks no larger than the specified block size, and saves them inJSONlineformat to the specified output path.RAG Vector Generation
Loads all parsed and chunked document files (
JSONlineformat) from the specified directory, then uses anEmbeddingmodel to generate text vectors.RAG Knowledge Base Index Synchronization
Synchronizes input data to the target knowledge base index.
NoteWhen configuring file paths, you can include variables in the path, for example:
https://examplebucket.oss-cn-hangzhou.aliyuncs.com/${variable}/example.csv. When configuring variables, you can use scheduling parameters as variables to read from or write to different storage paths during recurring scheduling.After completing node development, configure scheduling settings for PAI Flow in the right toolbar of the orchestration page to ensure recurring scheduling after publishing to the production environment.
NoteWhen configuring scheduling settings, the schedule resource group only supports Serverless resource groups.
Publish a PAI Flow
After completing the debugging and scheduling configuration of PAI Flow, the nodes will run periodically according to the scheduling configuration only after you submit and publish the PAI Flow workflow.
Click the Save button in the top toolbar to save the PAI Flow.
After saving, click the
button in the top toolbar to open the publish panel to publish the task. Click Start Publishing To Production, and the task will execute the publishing operation according to the publishing check process.
More operations
After PAI Flow is successfully published, you can click the Go To O&M button in the publish panel to navigate to the Recurring tasks page to view the scheduling and running status of PAI Flow.
In the DAG graph, you can view internal tasks only after opening the PAI Flow.