A workflow is an orderly process that consists of a series of jobs. Jobs in a workflow depend on each other and run in a specific order. If you want to run tasks at specific points in time, you can create a workflow and configure tasks and scheduling policies in the workflow. This topic describes how to create and run a workflow.
Prerequisites
You have created a workspace. For more information, see Manage workspaces.
You have developed and published jobs.
Create a workflow
Go to the Workflows page.
Log on to the E-MapReduce console.
In the left navigation bar, select
.On the Spark page, click the name of the target workspace.
On the EMR Serverless Spark page, click Workflows in the left navigation bar.
On the Workflows page, click Create Workflow.
In the Create Workflow panel, enter the required information and click Next.
Parameter
Description
Workflow Name
The name of the workflow. The name must be unique in a workspace.
Resource Queue
The default resource queue for the workflow.
NoteThe resource queue specified for workflow nodes can override the default resource queue.
Other Settings
Scheduling Type
The mode in which the node is run in the production environment. Valid values:
None (Manual): The workflow is manually run. This is the default value.
Scheduler: The workflow runs based on the settings of the scheduler. The workflow can be scheduled to run by minute, hour, or day.
If you set Scheduling Type to Scheduler, you must also configure the Scheduling Cycle and Scheduling Start Time parameters.
Scheduling Cycle
The scheduling cycle of the workflow. This parameter determines the scheduling frequency of the workflow in the production environment. DataWorks generates instances for the node based on the scheduling frequency and the number of scheduling cycles of the node. The node is run as an instance. This parameter is required only when Scheduling Type is set to Scheduler.
Valid values:
Days: Nodes run once a day at the specified point in time.
Hours: Nodes run at the specified interval of
N hours
during the specified period of time every day.Minutes: Nodes run at the specified interval of
N minutes
during the specified period of time every day.
Scheduling Start Time
The date and time when the workflow is scheduled to run. The default value is the current time. This parameter is required only when Scheduler is selected.
ImportantAfter you create a workflow with Scheduler type, you need to turn on the Scheduling Status switch on the Workflows page to trigger the workflow at the specified effective time.
Number Of Retries
The number of retries after a workflow node fails to run. By default, no retry is performed.
NoteThe number of retries specified for a workflow node can override the value of this parameter.
Failure Notification
The email address to which a notification is sent after the workflow fails to run.
Tags
The tags that are used to identify the workflow. You can specify the key and value of each tag.
Add a node in the workflow.
On the Edit Workflow page, click Add Node at the bottom.
In the Add Node panel that appears, configure the node parameters.
Parameter
Description
Source File Path
The job path that corresponds to the node. The job in the path must be published.
Node Type
The type of the node. By default, the system infers the type of the node based on the job in the corresponding path.
Node Name
The name of the node. The system automatically enters a node name based on the value of Source File Path. You can also specify a name based on your business requirements.
Upstream Node
The upstream node of the current node. The upstream node must be a node that is created in the current workflow.
You do not need to specify an upstream node for the first node in the workflow.
Number Of Retries
The number of retries defined in the workflow is used. By default, no retry is performed.
Timeout (seconds)
The timeout period for a single run of the node. By default, no limit is imposed.
Status Subscription
The email address to which a notification is sent when the node is in the specified state.
Tags
The tags of the node. By default, the workflow_name and task_name tags are provided for each node.
Resource Queue
The resource queue that is used to run the node. By default, the resource queue that you specify for the workflow is used. You can configure a resource queue for the node to override the resource queue that you specified for the workflow.
ImportantAfter you specify a resource queue for the workflow node, the specified resource queue prevails even if you modify the resource queue configured for the workflow.
NoteIf you use an SQL job, you can configure the parameters in the Task Configuration section based on your business requirements. Task parameters inherit from the task template by default. You can modify the task template to adjust the default values. For more information about the parameters, see Manage configurations.
Click Save.
After the initial node is configured, you can click Add Node at the bottom of the page to add more nodes.
Deploy the workflow.
Click Deploy Workflow in the upper-right corner.
In the Deploy dialog box, enter deployment information and click OK.
Run a workflow
Each time a workflow runs, a workflow instance is generated on the Workflow Instance List tab of the workflow details page.
Debug run
When you edit a workflow, you can debug the workflow of the latest version.
On the Edit Workflow page, click Debug Run.
In the Debug Run dialog box, select a resource queue for the development environment and click Run.
System scheduling
If you select Scheduler for Scheduling Type when you create a workflow and turn on the Scheduling Status switch after the workflow is created, the workflow is triggered to run at the specified effective time.
Trigger run
On the Workflows page, click the name of the target workflow, and then click Run in the upper-right corner. Select a scheduling method to trigger the current workflow to run.
Manual Run (default): The task is immediately executed by manual triggering, without relying on the system's scheduled rules.
Backfill: Processes workflows for a historical time period, typically used to fix workflows that were not run or failed. When using the backfill scheduling method, you need to configure the following parameters:
Parameter
Description
Business Cycle
The system generates corresponding workflow instances based on the time range you select.
You can select cycles later than the current time. When the actual time is greater than the set time, the backfill workflow instance will automatically start running.
Backfill workflow instances are generated and executed only when the workflow's scheduled time falls within the selected business cycle.
If time variables exist in the workflow (for example, ${ds}), the system automatically replaces these variables with the time of the selected business cycle.
Resource Queue
By default, this is consistent with the resource queue set for the workflow. You can select other available queues in the production environment from the dropdown list.
Remarks
You can enter descriptive information for the backfill workflow to facilitate subsequent management and troubleshooting.
More Settings
Failure Notification: You can set up email addresses for failure alerts to receive timely notifications when backfill workflows fail.
View running status
You can view the running status of all workflow instances and nodes of a workflow in the Workflow Running Status and Workflow Node Running Status columns of the target workflow.
Status of workflow runs
Status
Description
Blue
Running
Green
Succeeded
Red
Failed
Purple
Pending
Status of workflow nodes
Status
Description
Blue
Running
Green
Succeeded
Red
Failed
Yellow
Retrying
Purple
Pending
References
For concepts related to job orchestration, details are available in Terms.
For more information about viewing workflow instances, node instances, and other information, see Manage workflow instances and node instances.