All Products
Search
Document Center

DataWorks:Create an EMR Shell node

Last Updated:Mar 26, 2026

EMR Shell nodes let you run Shell scripts against your EMR cluster from DataWorks, covering data processing, Hadoop component calls (YARN, Spark, Sqoop), and file management. This topic walks you through creating, developing, scheduling, and deploying an EMR Shell node.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud EMR cluster registered to DataWorks. See DataStudio (old version): Associate an EMR computing resource

  • A serverless resource group purchased and configured (associated with a workspace and with network connectivity set up). See Use serverless resource groups

  • A workflow created in DataStudio. All node development in DataStudio is organized under workflows. See Create a workflow

  • (If developing as a RAM user) The RAM user added to the DataWorks workspace as a member with the Develop or Workspace Administrator role assigned. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. See Add workspace members and assign roles to them

  • (Optional) A custom image prepared, if you need to customize the component environment—for example, to replace Spark JAR packages or include specific libraries. Build your custom image from the official dataworks_emr_base_task_pod image. See Create a custom image and use it in DataStudio

  • (If running Python scripts on a resource group) A third-party package installed:

    • Serverless resource group: use the image management feature. See Custom images

    • Exclusive resource group for scheduling: use the O&M Assistant feature. See O&M Assistant

Limitations

  • EMR Shell nodes run on DataWorks resource groups for scheduling, not inside EMR clusters. Use a serverless resource group (recommended) or an exclusive resource group for scheduling. If you need to use a DataStudio image, use a serverless resource group.

  • EMR Shell nodes cannot run Python scripts. To run Python scripts, use Shell nodes instead.

  • To reference an EMR resource in a node, upload the resource to DataWorks first. See Create and use an EMR JAR resource.

  • Do not add comments in your EMR Shell node code.

  • When using spark-submit, set deploy-mode to cluster, not client.

  • To manage metadata for a DataLake or custom cluster in DataWorks, configure EMR-HOOK in the cluster first. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, data lineage is not tracked, and EMR governance tasks cannot run. See Use the Hive extension feature to record data lineage and historical access information.

Step 1: Create an EMR Shell node

  1. Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select a workspace from the drop-down list and click Go to Data Development.

  2. Find the workflow you want to work in. Right-click the workflow name, then choose Create Node > EMR > EMR Shell. Alternatively, hover over the Create icon and choose Create Node > EMR > EMR Shell.

  3. In the Create Node dialog box, configure Name, Engine Instance, Node Type, and Path. Click Confirm. The configuration tab of the EMR Shell node opens.

    The node name can contain only letters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Shell task

EMR Shell nodes support two ways to reference external resources. Choose the method that fits your setup:

Method 1: Upload a JAR resource (recommended)Method 2: Reference an OSS resource
Use whenYour EMR cluster is a DataLake clusterYou need to reference JAR dependencies or scripts already stored in OSS
How it worksUpload from your local machine to DataStudio, then insert a resource referencePoint directly to an OSS path using the ossref:// scheme

Method 1: Upload and reference an EMR JAR resource

DataWorks allows you to upload a resource from your on-premises machine to DataStudio before you can reference the resource. If the EMR cluster that you want to use is a DataLake cluster, you can perform the following steps to reference an EMR JAR resource. If the EMR Shell node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and reference the resources in the code of the EMR Shell node.

  1. Create an EMR JAR resource. See Create and use an EMR resource. In this example, the JAR package is stored in the emr/jars directory. The first time you use an EMR JAR resource, click Authorize to grant DataWorks access, then click Upload.

    Create a JAR resource

  2. Open the EMR Shell node. In the EMR folder under Resource, right-click the resource you want to reference and select Insert Resource Path. A ##@resource_reference{""} line appears in the editor, confirming the resource is referenced.

    Reference a resource

  3. Write your task code. Replace the placeholder values with your actual resource name, bucket, and path:

    ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
    onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs

Method 2: Reference an OSS resource

Use the OSS REF method to point your node code directly at an Object Storage Service (OSS) object. When the task runs, DataWorks automatically downloads the referenced resource. This is commonly used when JAR dependencies or scripts are already stored in OSS.

Reference format:

ossref://{endpoint}/{bucket}/{object}
ParameterDescription
endpointThe OSS endpoint. If left blank, only OSS buckets in the same region as the EMR cluster can be referenced.
bucketThe OSS bucket name. View all buckets in your account from the OSS console.
objectThe file name or path within the bucket.

Example: reference a shell script stored in OSS

  1. Upload emr_shell_test.sh to your OSS bucket. Sample content:

    #!/bin/sh
    echo "Hello, DataWorks!"
  2. Reference the script in the EMR Shell node:

    sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh

    In this example: oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the bucket, and emr_shell_test.sh is the object.

  3. Run the task. Expected output:

    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
    Process Output>>> Hello, DataWorks!

Use scheduling parameters

Define variables in your node code using ${Variable} syntax. Assign values to those variables in the Scheduling Parameter section of the Properties tab. DataWorks substitutes the values at runtime.

DD=`date`;
echo "hello world, $DD"
echo ${var};

Supported commands (EMR DataLake clusters)

All commands run on the DataWorks resource group, not inside the EMR cluster directly. The following commands are supported when your cluster is a DataLake cluster:

CategoryCommands
ShellCommands under /usr/bin and /bin, such as ls and echo
YARNhadoop, hdfs, yarn
Sparkspark-submit
Sqoopsqoop-export, sqoop-import, sqoop-import-all-tables
To use Sqoop, add your resource group's IP address to the IP address whitelist of the ApsaraDB RDS instance that stores EMR cluster metadata.

Run the task

  1. In the toolbar, click the 高级运行 (Run with Parameters) icon. In the Parameters dialog box, select a resource group from the Resource Group Name drop-down list and click Run.

    To access a computing resource over the Internet or a virtual private cloud (VPC), use the resource group that is connected to that computing resource. See Network connectivity solutions.
  2. Click the 保存 icon in the top toolbar to save the node.

  3. (Optional) Perform smoke testing on the node in the development environment before committing. See Perform smoke testing.

Step 3: Configure scheduling properties

To have DataWorks run the task on a schedule, click Properties in the right-side navigation pane and configure the scheduling settings.

Before committing the task, configure the Rerun and Parent Nodes parameters on the Properties tab.

If you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the custom image in DataStudio. For example, you can replace Spark JAR packages or include specific libraries, files, or JAR packages when creating the custom image.

For full details on scheduling options, see Overview.

Step 4: Deploy the task

  1. Click the 保存 icon to save the task.

  2. Click the 提交 icon to commit the task. In the Submit dialog box, fill in Change description and decide whether to enable code review.

    If you enable code review, the committed code is deployed only after it passes review. See Code review.
  3. If your workspace is in standard mode, deploy the task to the production environment after committing. Click Deploy in the upper-right corner of the configuration tab. See Deploy nodes.

What's next