In a data development project of E-MapReduce (EMR), you can define a group of dependent jobs, and create a workflow to allow the jobs to run in sequence based on their dependencies. An EMR workflow can be represented as a directed acyclic graph (DAG) that allows big data jobs to run in parallel. You can schedule workflows or view the status of workflows in the EMR console.

Background information

Prerequisites

  • A project is created. For more information, see Manage projects.
  • Jobs are edited. For more information, see Edit jobs.

Create a workflow

Perform the following steps to create a workflow:

  1. Go to the Data Platform tab.
    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Data Platform tab.
  2. In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
  3. Create a workflow.
    1. In the Workflows pane on the left side of the page that appears, right-click the folder on which you want to perform operations and select Create Workflow.
    2. In the Create Workflow dialog box, specify Workflow Name, Description, Select Resource Group, and Target Cluster.
      Valid values of the Target Cluster parameter:
      • Select Existing Cluster: When the workflow is executed, the jobs run on the cluster that you selected.
      • Create Cluster from Template: When the workflow is executed, the jobs run on a temporary cluster that is created by using the cluster template you selected. When the workflow ends, the cluster is automatically released. For more information, see Create a cluster template.
        Note Only the clusters that are associated with the project are displayed in the Select Existing Cluster drop-down list. Before you can select a different cluster, you must disassociate the existing clusters from the project. For more information, see Manage projects.
    3. Click OK.
      After the workflow is created, you can edit and configure the workflow.

Edit a workflow

  1. Drag different types of job nodes to the canvas for editing a workflow.
    After you drag a node of a specific type to the canvas, you can configure the parameters that are described in the following table in the Edit Node panel.
    Parameter Description
    Associated Job Select a job of the same type as the job node from the Associated Job drop-down list.
    Customize Job Configuration You can customize job configurations based on your business requirements.
    • If you turn on this switch, you can change the value of the Target Cluster parameter.
    • If you turn off this switch, the jobs that are associated with the job node run on the cluster that you select when you create a workflow. By default, the Customize Job Configuration switch is turned off.
  2. Associate job nodes.
    On the canvas, drag a line from a job node to associate this job node with other job nodes based on the dependencies between the jobs. Arrows indicate the direction of the workflow.
  3. Configure controller nodes to complete the design of the workflow.
    Drag the END node from the Controller Node section to the canvas. Then, associate the START node, job nodes, and END node to complete the design of the workflow. You can click Auto Adjust in the upper-right corner to adjust the layout of the job nodes in the workflow. Edit a workflow
    When you edit a workflow, you can click Lock in the upper-right corner to lock the workflow. This way, only you can edit or run the workflow. Other members in the project can edit the workflow only after the workflow is unlocked.
    Note Only the RAM user that locks the workflow and the Alibaba Cloud account can unlock the workflow.

Configure workflow scheduling

You can enable the workflow scheduling feature and configure scheduling-related parameters. Then, relevant workflows periodically run based on the parameter settings, and jobs are delivered to a specified cluster for running. Perform the following steps to configure the parameters on the Basic Attributes, Scheduling Settings, and Alert Settings tabs in the Workflow Scheduling panel:

  1. Go to the Data Platform tab.
    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Data Platform tab.
  2. In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
  3. On the workflow design page, click Configure.
  4. On the Basic Attributes tab of the Workflow Scheduling panel, modify the workflow description, resource group, and the cluster used to run the jobs in the workflow based on your business requirements.
  5. After the basic attributes are modified, click the Scheduling Settings tab and configure the parameters related to workflow scheduling.
    Parameter Description
    Scheduling Status Valid values:
    • Start: Start workflow scheduling. After you select Start for Scheduling Status, Scheduling appears in the upper-right corner of the workflow editing canvas, which indicates that the workflow is being scheduled.
    • Stop: Stop workflow scheduling.
    Time-based Scheduling Start Time The time when workflow scheduling starts.
    End Time The time when workflow scheduling ends. This parameter is optional.
    Recurrence The cycle of workflow scheduling.
    CRON Expression The CRON expression that is used to specify the cycle of workflow scheduling.
    Dependency-based Scheduling Project The project to which the dependent workflow of the current workflow belongs. This parameter is optional.
    Dependent Workflow The dependent workflow of the current workflow. The current workflow is executed only after the dependent workflow ends. This parameter is optional.
  6. Click the Alert Settings tab and configure the alert parameters.
    Parameter Description
    Execution Failed Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the workflow fails.
    Actions on Failures Specifies whether to send a notification to an alert contact group or a DingTalk alert group if a job node in the workflow fails to run.
    Executed Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the workflow succeeds.
    Action on Startup Timeout Specifies whether to send a notification to an alert contact group or a DingTalk alert group if a job node in the workflow does not start within 30 minutes after it is delivered to a cluster.
    Node execution timed out Specifies whether to send a notification to an alert contact group or a DingTalk alert group if the running duration of a job node exceeds the expected maximum running duration in the job configuration.

Run a workflow

You can specify the business time of a workflow. Time variables in jobs of the workflow are calculated by using the specified business time. The business time is used for rerunning the workflow instance in a specific period of time. You can rerun a single workflow instance or multiple workflow instances at a time. If no time variables are configured for your jobs, you can select Execute.

  1. Go to the Data Platform tab.
    1. Log on to the Alibaba Cloud EMR console by using your Alibaba Cloud account.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Data Platform tab.
  2. In the Projects section of the page that appears, find your project and click Workflows in the Actions column.
  3. Run your workflow.
    1. On the page that appears, select a workflow and click Run in the upper-right corner.
    2. In the Run Workflow dialog box, configure the runtime parameters.
      You can select a running mode based on your business requirements. The following table describes the running modes that are supported: Execute and Run Periodically.
      Mode Description
      Execute Immediately runs a workflow. You can use the specified time as the business time of the workflow. Time-related variables are calculated based on the business time.
      Run Periodically Runs multiple workflows at a time. The trigger time of specific scheduling rules is used as the business time of the workflows, and time-related variables are calculated based on the business time. A maximum of 100 points in time are supported at a time. If you select Run Periodically for Mode, configure the following parameters:
      • Start Time: the time when workflow scheduling starts.
      • End Time: the time when workflow scheduling ends. This parameter is optional.
      • Recurrence: the cycle of workflow scheduling.
      • CRON Expression: the CRON expression that is used to specify the cycle of workflow scheduling.
      • Skip Successful Nodes: specifies whether to skip a successful workflow instance. You can determine whether to turn on this switch based on your business requirements. After you turn on the Skip Successful Nodes switch, if the workflow instance that runs at a specific business time is successful, the system skips the workflow instance and continues to run the workflow instances that fail at a different business time.
    3. Click OK.

View the running details about a workflow

After you run a workflow, you can perform the following steps to view the running details about the workflow:

  1. Click the Records tab in the lower part of the workflow design page.
    You can view the status of a workflow instance.
  2. Find your workflow instance and click Details in the Action column to go to the Scheduling Center tab.

    You can view the details about the workflow instance. You can also pause, resume, stop, or rerun the workflow instance. For more information, see Scheduling center.

    Operation Description
    Details Views the details and status of the workflow instance.
    Stop Workflow Stops all running job nodes of the workflow instance.
    Pause Workflow If you click this button, the running job nodes continue running, but the subsequent job nodes in the workflow will not start.
    Resume Workflow Resumes the workflow instance if it has been suspended.
    Rerun Workflow Instance Reruns the workflow instance if it has been terminated. After you click Rerun Workflow Instance, you can determine whether to rerun failed job nodes or rerun all job nodes from the START node.

Operations that you can perform on workflows

In the Workflows pane, you can right-click a workflow and perform the operations that are described in the following table.
Operation Description
Clone Workflow Clones a workflow with the same design in the same folder.
Note The settings of the scheduling parameters for the original workflow cannot be cloned.
Rename Workflow Renames a workflow.
Delete Workflow Deletes a workflow. You cannot delete a running workflow.