All Products
Search
Document Center

E-MapReduce:Manage workflows

Last Updated:Mar 26, 2026

When you need to run multiple Spark jobs in a specific order — for example, a data ingestion job followed by a transformation job and then a reporting job — use workflows to define the dependency chain and automate execution. This topic explains how to create, run, and monitor workflows in EMR Serverless Spark.

Key concepts:

  • Workflow: A pipeline of jobs linked by dependency relationships and run according to a schedule or on demand.

  • Node: A single job in the workflow. Nodes are connected by upstream/downstream relationships to define execution order.

  • Workflow run: A single execution of the workflow. Each run is recorded and viewable in the Workflow Runs tab.

Prerequisites

Before you begin, ensure that you have:

Create a workflow

  1. Go to the Workflows page.

    1. Log on to the E-MapReduce (EMR) console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of the workspace you want to use.

    4. In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Workflows.

  2. On the Workflows tab, click Create Workflow.

  3. In the Create Workflow panel, configure the parameters and click Next. When Scheduling Type is Scheduler, configure:

    • Scheduling Time: The run frequency. Days runs once per day at a fixed time. Hours runs every N hours within a daily window. Minutes runs every N minutes within a daily window.

    • Scheduling Started At: The date and time when scheduled runs begin. Defaults to the current time.

    Important

    After the workflow is created, turn on the Scheduling Status switch for the workflow on the Workflows tab. Without this, the workflow does not run at the scheduled time.

    ParameterDescription
    NameThe workflow name. Must be unique within a workspace.
    Resource QueueThe default resource queue for the workflow. Node-level resource queues override this setting.
    Other Settings > Scheduling TypeHow the workflow runs in the production environment. Valid values: None (Manual) (manually triggered, default) and Scheduler (automatic, by minute, hour, or day). See the scheduling types table below.
    Retries After FailureThe number of times to retry a failed node. Default: no retry. Node-level retry settings override this value.
    Failure NotificationThe email address to notify when the workflow fails.
    TagsKey-value pairs to identify the workflow.
    Scheduling typeBehaviorAdditional required parameters
    None (Manual) (default)Trigger runs manually.
    SchedulerRun automatically by minute, hour, or day.Scheduling Time and Scheduling Started At
  4. Add nodes to the workflow. Nodes represent jobs in the pipeline. Connect them through upstream/downstream relationships to define execution order.

    1. On the canvas, click Add Node in the lower part of the canvas.

    2. In the Add Node panel, configure the parameters.

      Parameter

      Description

      Source File Path

      The path of the job to run at this node. The job must be published.

      Node Type

      Inferred automatically from the job at the specified path.

      Node Name

      Auto-filled from Source File Path. Customize as needed.

      Upstream Node

      The node that must complete before this node runs. Must be a node in the current workflow. Leave blank for the first node.

      Number of Retries

      Defaults to the workflow-level retry count. No retry by default.

      Timeout (Seconds)

      The maximum run time for a single node run. Default: no limit.

      Subscription

      The email address to notify when the node reaches a specified state.

      Tags

      The node tags. Each node includes workflow_name and task_name tags by default.

      Resource Queue

      The resource queue for this node. Defaults to the workflow resource queue. Once set at the node level, this setting persists even if you later change the workflow-level resource queue.

      Note

      For SQL jobs, configure additional parameters in the Task Configuration section. Default values match the job-level configuration. See Manage default configurations.

    3. Click Save. Repeat to add more nodes.

  5. Publish the workflow.

    1. In the upper-right corner, click Publish Workflow.

    2. In the Publish dialog box, enter remarks and click OK.

Run a workflow

Each workflow run produces a run record. View run history on the Workflow Runs tab of the workflow details page.

Debug a workflow

Debug the latest version of a workflow before running it in production.

  1. In the Actions column, click Edit for the workflow. On the page that appears, click Debug next to the workflow name.

    image

  2. In the Debug dialog box, select a development environment resource queue and click Run.

Run on a schedule

When Scheduling Type is set to Scheduler and the Scheduling Status switch is on, the workflow runs automatically at the configured time.

image.png

Run manually

  1. On the Workflows tab, click the workflow name.

  2. In the upper-right corner, click Run.

  3. In the Run dialog box, set the Scheduling Method and click OK.

Scheduling Method values:

ValueWhen to useBehavior
Manually Run (default)Run the workflow now, regardless of schedule.Starts immediately.
BackfillReprocess data for a historical time range — for example, when a scheduled run was missed or a job was fixed and needs to rerun over past data.Generates runs for each scheduling interval within the specified range.

When you select Backfill, configure the following parameters:

ParameterDescription
CycleThe historical time range. A run is generated for each scheduling interval that falls within this range. The range can be earlier than the current time. Time variables such as ${ds} are automatically replaced with the corresponding cycle time.
Resource QueueDefaults to the workflow's configured resource queue. Select a different production queue if needed.
RemarksA description to help you manage and troubleshoot the run.
More > Failure NotificationThe email address to notify if backfilling fails.

Check workflow run status

The Workflow Runs Status column shows the status of each workflow run. The Workflow Node Runs Status column shows the status of individual nodes within a run. For details about run records and node-level run logs, see Manage workflow runs and workflow node runs.

image.png

Workflow run status

ColorStatus
BlueRunning
GreenSucceeded
RedFailed
PurplePending

Workflow node status

ColorStatus
BlueRunning
GreenSucceeded
RedFailed
YellowRetrying
PurplePending

What's next

References