In a data development project of E-MapReduce (EMR), you can define a group of dependent
jobs, and create a workflow to allow the jobs to run in sequence based on their dependencies.
An EMR workflow can be represented as a directed acyclic graph (DAG) that allows big
data jobs to run in parallel. You can schedule workflows or view the status of workflows
in the EMR console.
Background information
Workflow-related operations:
Create a workflow
Perform the following steps to create a workflow:
- Go to the Data Platform tab.
- Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Data Platform tab.
- In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
- Create a workflow.
- In the Workflows pane on the left side of the page that appears, right-click the folder on which you
want to perform operations and select Create Workflow.
- In the Create Workflow dialog box, specify Workflow Name, Description, Select Resource Group, and Target Cluster.
Valid values of the Target Cluster parameter:
- Select Existing Cluster: When the workflow is executed, the jobs run on the cluster that you selected.
- Create Cluster from Template: When the workflow is executed, the jobs run on a temporary cluster that is created
by using the cluster template you selected. When the workflow ends, the cluster is
automatically released. For more information, see Create a cluster template.
Note Only the clusters that are associated with the project are displayed in the
Select Existing Cluster drop-down list. Before you can select a different cluster, you must disassociate
the existing clusters from the project. For more information, see
Manage projects.
- Click OK.
After the workflow is created, you can edit and configure the workflow.
Edit a workflow
- Drag different types of job nodes to the canvas for editing a workflow.
After you drag a node of a specific type to the canvas, you can configure the parameters
that are described in the following table in the
Edit Node panel.
Parameter |
Description |
Associated Job |
Select a job of the same type as the job node from the Associated Job drop-down list.
|
Customize Job Configuration |
You can customize job configurations based on your business requirements.
- If you turn on this switch, you can change the value of the Target Cluster parameter.
- If you turn off this switch, the jobs that are associated with the job node run on
the cluster that you select when you create a workflow. By default, the Customize
Job Configuration switch is turned off.
|
- Associate job nodes.
On the canvas, drag a line from a job node to associate this job node with other job
nodes based on the dependencies between the jobs. Arrows indicate the direction of
the workflow.
- Configure controller nodes to complete the design of the workflow.
Drag the
END node from the
Controller Node section to the canvas. Then, associate the
START node, job nodes, and
END node to complete the design of the workflow. You can click
Auto Adjust in the upper-right corner to adjust the layout of the job nodes in the workflow.
When you edit a workflow, you can click
Lock in the upper-right corner to lock the workflow. This way, only you can edit or run
the workflow. Other members in the project can edit the workflow only after the workflow
is unlocked.
Note Only the RAM user that locks the workflow and the Alibaba Cloud account can unlock
the workflow.
Configure workflow scheduling
You can enable the workflow scheduling feature and configure scheduling-related parameters.
Then, relevant workflows periodically run based on the parameter settings, and jobs
are delivered to a specified cluster for running. Perform the following steps to configure
the parameters on the Basic Attributes, Scheduling Settings, and Alert Settings tabs
in the Workflow Scheduling panel:
- Go to the Data Platform tab.
- Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Data Platform tab.
- In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
- On the workflow design page, click Configure.
- On the Basic Attributes tab of the Workflow Scheduling panel, modify the workflow description, resource group, and the cluster used to run
the jobs in the workflow based on your business requirements.
- After the basic attributes are modified, click the Scheduling Settings tab and configure the parameters related to workflow scheduling.
Parameter |
Description |
Scheduling Status |
Valid values:
- Start: Start workflow scheduling. After you select Start for Scheduling Status, Scheduling appears in the upper-right corner of the workflow editing canvas, which indicates
that the workflow is being scheduled.
- Stop: Stop workflow scheduling.
|
Time-based Scheduling |
Start Time |
The time when workflow scheduling starts. |
End Time |
The time when workflow scheduling ends. This parameter is optional. |
Recurrence |
The cycle of workflow scheduling. |
CRON Expression |
The CRON expression that is used to specify the cycle of workflow scheduling. |
Dependency-based Scheduling |
Project |
The project to which the dependent workflow of the current workflow belongs. This
parameter is optional.
|
Dependent Workflow |
The dependent workflow of the current workflow. The current workflow is executed only
after the dependent workflow ends. This parameter is optional.
|
- Click the Alert Settings tab and configure the alert parameters.
Parameter |
Description |
Execution Failed |
Specifies whether to send a notification to an alert contact group or a DingTalk alert
group if the workflow fails.
|
Actions on Failures |
Specifies whether to send a notification to an alert contact group or a DingTalk alert
group if a job node in the workflow fails to run.
|
Executed |
Specifies whether to send a notification to an alert contact group or a DingTalk alert
group if the workflow succeeds.
|
Action on Startup Timeout |
Specifies whether to send a notification to an alert contact group or a DingTalk alert
group if a job node in the workflow does not start within 30 minutes after it is delivered
to a cluster.
|
Node execution timed out |
Specifies whether to send a notification to an alert contact group or a DingTalk alert
group if the running duration of a job node exceeds the expected maximum running duration
in the job configuration.
|
Run a workflow
You can specify the business time of a workflow. Time variables in jobs of the workflow
are calculated by using the specified business time. The business time is used for
rerunning the workflow instance in a specific period of time. You can rerun a single
workflow instance or multiple workflow instances at a time. If no time variables are
configured for your jobs, you can select Execute.
- Go to the Data Platform tab.
- Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
- In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
- Click the Data Platform tab.
- In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
- Run your workflow.
- On the page that appears, select a workflow and click Run in the upper-right corner.
- In the Run Workflow dialog box, configure the runtime parameters.
You can select a
running mode based on your business requirements. The following table describes the running modes
that are supported:
Execute and
Run Periodically.
Mode |
Description |
Execute |
Immediately runs a workflow. You can use the specified time as the business time of the workflow. Time-related variables are calculated based
on the business time.
|
Run Periodically |
Runs multiple workflows at a time. The trigger time of specific scheduling rules is
used as the business time of the workflows, and time-related variables are calculated
based on the business time. A maximum of 100 points in time are supported at a time.
If you select Run Periodically for Mode, configure the following parameters:
- Start Time: the time when workflow scheduling starts.
- End Time: the time when workflow scheduling ends. This parameter is optional.
- Recurrence: the cycle of workflow scheduling.
- CRON Expression: the CRON expression that is used to specify the cycle of workflow scheduling.
- Skip Successful Nodes: specifies whether to skip a successful workflow instance. You can determine whether
to turn on this switch based on your business requirements. After you turn on the
Skip Successful Nodes switch, if the workflow instance that runs at a specific business
time is successful, the system skips the workflow instance and continues to run the
workflow instances that fail at a different business time.
|
- Click OK.
View the running details about a workflow
After you run a workflow, you can perform the following steps to view the running
details about the workflow:
- Click the Records tab in the lower part of the workflow design page.
You can view the status of a workflow instance.
- Find your workflow instance and click Details in the Action column to go to the Scheduling Center tab.
You can view the details about the workflow instance. You can also pause, resume,
stop, or rerun the workflow instance. For more information, see Scheduling center.
Operation |
Description |
Details |
Views the details and status of the workflow instance. |
Stop Workflow |
Stops all running job nodes of the workflow instance. |
Pause Workflow |
If you click this button, the running job nodes continue running, but the subsequent
job nodes in the workflow will not start.
|
Resume Workflow |
Resumes the workflow instance if it has been suspended. |
Rerun Workflow Instance |
Reruns the workflow instance if it has been terminated. After you click Rerun Workflow Instance, you can determine whether to rerun failed job nodes or rerun all job nodes from
the START node.
|
Operations that you can perform on workflows
In the
Workflows pane, you can right-click a workflow and perform the operations that are described
in the following table.
Operation |
Description |
Clone Workflow |
Clones a workflow with the same design in the same folder.
Note The settings of the scheduling parameters for the original workflow cannot be cloned.
|
Rename Workflow |
Renames a workflow. |
Delete Workflow |
Deletes a workflow. You cannot delete a running workflow. |