EMR Workflow lets you build data processing pipelines by connecting task nodes into a directed acyclic graph (DAG). Each pipeline runs as a workflow, and each execution produces a workflow instance you can monitor and rerun independently.
Prerequisites
Before you begin, ensure that you have:
-
A project created in EMR Studio. For more information, see Create a project.
Create a workflow
-
Log on to the E-MapReduce (EMR) console.
-
In the left-side navigation pane, choose EMR Studio > Workflow.
-
Click the Project tab, then click the name of a project.
-
In the left-side navigation pane of the project details page, choose Workflow > Workflow Definition.
-
On the Workflow Definition page, click Create Workflow.
-
On the Create Workflow page, drag a node type onto the canvas. In this example, drag HIVECLI to the canvas.
In the Current node settings dialog box, configure the parameters and click Confirm. For more information about HIVECLI, see HIVECLI. For other node types, see Node types.
-
(Optional) Configure dependencies between nodes.
EMR Workflow allows you to configure custom node dependencies between workflows.
-
To connect two nodes: hover over the
icon on the right side of a node, then drag the connection line to another node. -
To delete a dependency or node: click the connection line or node, then click the
icon in the upper-right corner of the canvas.
-
-
Save the workflow.
-
Click Save in the upper-right corner of the canvas.
-
In the Basic Information dialog box, configure the following parameters and click Confirm.
Parameter Description Workflow Name
The name of the workflow.
Description
A description of the workflow.
Timeout Alert
Disabled by default. If enabled, specify a timeout period. An alert is triggered when a node's execution time exceeds the timeout period.
Process execute type
How multiple instances of the same workflow are run concurrently. parallel: instances run at the same time. Serial wait: instances run one after another.
Global Variables
Variables that apply to all nodes in the workflow.
-
Workflow states and available operations
A workflow is in either the Online or Offline state. The state determines which operations are available. The following table maps each operation to the state in which it can be performed.
| Operation | Online | Offline | Description |
|---|---|---|---|
| Edit | Yes | Edit the workflow definition. | |
| Start | Yes | Run the workflow manually. See Run a workflow. | |
| Timing | Yes | Configure a Cron-based schedule. After saving, the scheduled workflow returns to Offline state and must be brought Online on the Cron manage page. See Schedule a workflow. | |
| Online | Yes | Change the workflow state from Offline to Online. | |
| Offline | Yes | Change the workflow state from Online to Offline. | |
| Copy Workflow | Yes | Yes | Generate a new workflow by copying this one. |
| Cron manage | Yes | Yes | View, edit, or change the state of scheduled workflow entries. |
| Delete | Yes | Delete the workflow. Only the creator can delete a workflow. | |
| Tree View | Yes | Yes | View node types and statuses in a tree structure. |
| Export | Yes | Yes | Export the workflow as a JSON file. |
| Version Info | Yes | Yes | View version information for the workflow. |
Run a workflow
Each run generates a workflow instance, which appears on the Workflow Instance page.
-
On the Workflow Definition page, find the workflow and click the
(Online) icon in the Operation column to bring the workflow online. -
Click the
(Start) icon in the Operation column. -
In the dialog box that appears, configure the following parameters and click Confirm.
Parameter Description Failure Strategy What happens to concurrent nodes when one node fails. Continue: other nodes keep running. End: downstream nodes of the failed node are terminated. Notification Strategy When to send a notification after the workflow ends. Valid values: None, Success, Failure, All. Workflow Priority The priority of the workflow run. Default: MEDIUM. Valid values: HIGHEST, HIGH, MEDIUM, LOW, LOWEST. Execution Cluster The cluster to run the workflow on. Select a cluster associated on the Cluster Manage page of the Security tab. Alarm Group The alarm group for notifications. Select a group configured on the Alarm Group Manage page of the Security tab. Complement Data Whether to generate backfill data for a past time range. See the Complement Data parameters section below. Startup Parameter A startup parameter and value. Defines or overwrites a global variable when this workflow instance starts. Whether Dry-Run When enabled, the workflow performs a dry run and records a success log without executing actual tasks. -
In the left-side navigation pane of the project details page, choose Workflow > Workflow Instance to view the status of the workflow instance.
Complement Data parameters
Complement Data generates backfill data for workflow runs within a specified past time range. Select Whether it is a complement process? to enable it, then configure the following parameters.
| Parameter | Description |
|---|---|
| Mode of dependent | Whether to generate backfill data for workflows that depend on the current workflow. Close (default): dependent workflows are not backfilled. Open: dependent workflows are also backfilled, provided the current workflow is Online and has scheduling configured. |
| Mode of execution | How backfill data is generated across the specified date range. Serial execution: workflow instances are generated one per day in chronological order. Parallel execution: workflow instances are generated for all days simultaneously. In parallel mode, set Custom Parallelism to limit the maximum number of concurrent instances. |
| Scheduling Date | The past time range for which to generate backfill data. |
Match Mode of execution to the workflow's Process execute type: use Parallel execution if the workflow is set to parallel, and Serial execution if it is set to Serial wait.
Import a workflow
Import a workflow from a JSON file previously exported from EMR Workflow.
-
On the Workflow Definition page, click Import Workflow.
-
In the Upload dialog box, click Upload and select the exported JSON file.
-
Click Confirm.
Schedule a workflow
Configure a Cron-based schedule to run the workflow automatically at a set interval.
-
On the Workflow Definition page, find the workflow and click the
(Timing) icon in the Operation column.The Timing operation is available only for Online workflows.
-
Configure the following parameters and click Confirm.
Parameter Description Start and stop time The time range during which the workflow is scheduled to run. No scheduled instances are generated outside this range. Timing The scheduling interval (Cron expression) at which the workflow runs. Execution Cluster The cluster to use for scheduled runs. -
Activate the scheduled workflow.
After saving the scheduling settings, the scheduled workflow entry is in the Offline state. To make it take effect:
-
On the Workflow Definition page, click the
(Cron manage) icon in the Operation column. -
On the Cron manage page, find the scheduled workflow entry and click the
(Online) icon in the Operation column.
The scheduled workflow is now active and will run according to the configured interval.
-