PAI Flow lets you build and schedule end-to-end machine learning workflows using a visual drag-and-drop interface. It offers the same capabilities as Designer in Platform for AI (PAI) and supports periodic scheduling.
Limitations
Product limits
-
PAI Flow is supported only in DataWorks Workspace (New Version).
-
PAI Flow currently supports only the Source/Target and RAG Data Processing node types.
-
PAI Flow supports only Serverless resource groups as the scheduling resource group.
Region limits
PAI Flow is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), China (Hong Kong), Singapore, Indonesia (Jakarta), Japan (Tokyo), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
Prerequisites
Before you begin, make sure that you have:
-
A DataWorks DataStudio (New) workspace provisioned
-
A Platform for AI workspace provisioned
New workspace: When you create a workspace, select Create An AI Workspace With The Same Name. This automatically creates a PAI workspace with the same name.
Existing workspace: Enable the Schedule PAI Algorithm Tasks feature for an existing workspace in the Management Center. This automatically creates a PAI workspace with the same name as the DataWorks workspace.
Create a PAI Flow node
-
Go to the DataStudio page. Log in to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Development and O\&M \> Data Development. Select the target workspace from the drop-down list and click Go to Data Development.
-
In the DataStudio project folder, click the
icon and select New Node \> Algorithm \> PAI Flow. A new PAI Flow node is created and the PAI Flow orchestration page opens.
Develop a PAI Flow node
PAI Flow provides a set of visual modeling nodes for designing workflows and a canvas for connecting them into a pipeline.
-
From the left panel, select a node, drag it onto the canvas, and connect the nodes to design the flow.
-
Click a node to configure it in the right-side pane. The following nodes are available:
File paths support variables. For example:
https://examplebucket.oss-cn-hangzhou.aliyuncs.com/${variable}/example.csv. Use scheduling parameters as variables so that the recurring schedule reads from or writes to different storage paths on each run.Node type Node Description Source/Destination Read Table Reads data from a MaxCompute table. By default, reads from the current project. Read OSS Data Reads a file or folder from a path in an Object Storage Service (OSS) bucket. Read CSV File Reads CSV file data from OSS, HTTP, or Hadoop Distributed File System (HDFS). Write to Table Writes input data to MaxCompute. Retrieval-augmented generation (RAG) Data Processing RAG Text Parsing and Splitting Reads and parses text files (HTML, PDF, Markdown, and plain text) from the input directory. Generates consecutive text blocks up to the specified block size and saves them to the output path in JSONline format. RAG Embedding Generation Loads all parsed and split document files (JSONline format) from the specified directory and uses an embedding model to generate text embeddings. RAG Knowledge Base Index Synchronization Synchronizes input data to the destination knowledge base index. -
After designing the flow, open Scheduling Configuration in the right-hand toolbar of the orchestration page to set the schedule for the node.
Publish a PAI Flow node
After testing the PAI Flow node and configuring its scheduling settings, commit and publish the node so that it runs periodically.
-
In the toolbar, click Save.
-
Click the
icon in the toolbar to open the publishing panel. For details, see Publish tasks. -
Click Publish to Production.
What's next
After publishing, click Go to O\&M in the publishing panel. You are redirected to the Recurring Tasks page, where you can monitor the scheduling and run status of the node.
In the directed acyclic graph (DAG), internal tasks are visible only after you open the PAI Flow node.