All Products
Search
Document Center

DataWorks:EMR Shell node

Last Updated:Mar 25, 2026

Use EMR Shell nodes in DataWorks to write and run Shell scripts on E-MapReduce (EMR) clusters — for data processing, calling Hadoop components, and managing files.

Prerequisites

Before you begin, make sure that you have:

  • An Alibaba Cloud EMR cluster bound to DataWorks. For details, see Data Studio: Associate an EMR computing resource

  • (Optional) A custom image based on the official dataworks_emr_base_task_pod image, configured in Data Studio. Use a custom image to replace Spark JAR packages or include specific libraries, files, or JAR packages. For details, see Create a custom image and Use the image in Data Studio

  • (Optional) If you are a Resource Access Management (RAM) user, confirm that you have been added to the workspace and assigned the Developer or Workspace Administrator role. For details, see Add members to a workspace

Limitations

  • Execution environment: EMR Shell nodes run on DataWorks resource groups for scheduling, not in EMR clusters. Only a serverless resource group (recommended) or an exclusive resource group for scheduling is supported. To use a custom image in Data Studio, use a serverless resource group.

  • Resource references: Because nodes run on DataWorks resource groups and cannot directly read resource information from EMR, upload any resources you need to DataWorks first. For details, see Resource Management. Alternatively, store large resources in the Hadoop Distributed File System (HDFS) and reference them directly in your code.

  • Python files: EMR Shell nodes do not support running Python files. To run Python files, use Shell nodes instead.

  • Code comments: Comments are not supported in EMR Shell node code.

  • Metadata management: For DataLake or custom clusters, configure EMR-HOOK in the cluster before managing metadata in DataWorks. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, data lineage is not displayed, and EMR-related administration tasks cannot be performed. For configuration steps, see Configure EMR-HOOK for Hive.

  • spark-submit deploy mode: For tasks submitted using spark-submit, set deploy-mode to cluster instead of client.

  • Sqoop components: When using Sqoop components, add the IP address of the resource group to the RDS whitelist.

Develop an EMR Shell node

Step 1: Write the Shell code

Choose one of the following methods to reference external resources in your Shell code.

Method 1: Reference an EMR JAR resource (DataLake clusters)

Use this method when working with a DataLake cluster. DataWorks uploads the resource to Data Studio and injects it into the node at runtime.

  1. Create an EMR JAR resource.

    1. Follow the steps in Resource Management to create a new JAR resource. Store the JAR package (prepared in Prepare initial data and a JAR package) in the emr/jars directory.

    2. Click Upload to upload the JAR resource.

    3. Set Storage Path, Data Source, and Resource Group, then click Save.

    image

  2. Reference the resource in the EMR Shell node. A reference statement is automatically added to the code editor:

    1. Open the EMR Shell node to go to the code editor.

    2. In Resource Management in the left navigation pane, right-click the resource (for example, onaliyun_mr_wordcount-1.0-SNAPSHOT.jar) and select Reference Resources.

    ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
    onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs

    Replace the resource package, bucket name, and paths with your actual values.

Method 2: Reference an OSS resource directly

Use this method to load a file from Object Storage Service (OSS) at runtime to the local machine, without uploading it to DataWorks. This is useful for JAR dependencies or shell scripts already stored in OSS.

Reference the resource using an OSS REF statement:

ossref://{endpoint}/{bucket}/{object}
ParameterDescriptionRequired
endpointThe OSS endpoint. If left blank, only OSS buckets in the same region as the EMR cluster are accessible.No
bucketThe name of the OSS bucket that stores the object. Bucket names are unique. Log on to the OSS console to view all buckets under the current account.Yes
objectThe file name or path of the object within the bucket.Yes

Example

Upload emr_shell_test.sh to an OSS bucket. The file contains:

#!/bin/sh
echo "Hello, DataWorks!"

Reference the file in the EMR Shell node:

sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh

In this example, oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the bucket name, and emr_shell_test.sh is the object name.

The node output is:

...
>>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
>>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
>>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
Process Output>>> Hello, DataWorks!
...

Step 2: Configure scheduling parameters (optional)

Define variables in the Shell code using the ${variable_name} format, then assign values on the Schedule tab under Scheduling Parameters. The values are passed dynamically each time the node runs.

DD=`date`;
echo "hello world, $DD"
echo ${var};

For the full syntax and expression options, see Sources and expressions of scheduling parameters.

Supported commands for DataLake clusters

For DataLake clusters, the following commands are available in addition to your custom scripts:

  • Shell commands from /usr/bin and /bin (for example, ls, echo)

  • YARN components: hadoop, hdfs, yarn

  • Spark components: spark-submit

  • Sqoop components: sqoop-export, sqoop-import, sqoop-import-all-tables

Step 3: Run the Shell task

  1. On the Run Configuration tab, set Compute Resource and Resource Group.

    • CUs For Scheduling: Adjust the compute units (CUs) based on the task's resource needs. The default is 0.25 CUs.

    • Image: Select a custom image if your task requires a customized component environment.

    • Network access: If your task needs to reach a data source over the public network or through a VPC, use a resource group that has passed the connectivity test with that data source. For details, see Network connectivity solutions.

  2. Click Run in the toolbar.

Step 4: Configure scheduling and deploy

  1. To run the node on a schedule, configure its scheduling properties. For details, see Configure scheduling properties for a node.

  2. Deploy the node. For details, see Deploy nodes and workflows.

  3. After deployment, monitor the node's run status in Operation Center. For details, see Get started with Operation Center.

What's next