Use EMR Shell nodes in DataWorks to write and run Shell scripts on E-MapReduce (EMR) clusters — for data processing, calling Hadoop components, and managing files.
Prerequisites
Before you begin, make sure that you have:
An Alibaba Cloud EMR cluster bound to DataWorks. For details, see Data Studio: Associate an EMR computing resource
(Optional) A custom image based on the official
dataworks_emr_base_task_podimage, configured in Data Studio. Use a custom image to replace Spark JAR packages or include specific libraries, files, or JAR packages. For details, see Create a custom image and Use the image in Data Studio(Optional) If you are a Resource Access Management (RAM) user, confirm that you have been added to the workspace and assigned the Developer or Workspace Administrator role. For details, see Add members to a workspace
Limitations
Execution environment: EMR Shell nodes run on DataWorks resource groups for scheduling, not in EMR clusters. Only a serverless resource group (recommended) or an exclusive resource group for scheduling is supported. To use a custom image in Data Studio, use a serverless resource group.
Resource references: Because nodes run on DataWorks resource groups and cannot directly read resource information from EMR, upload any resources you need to DataWorks first. For details, see Resource Management. Alternatively, store large resources in the Hadoop Distributed File System (HDFS) and reference them directly in your code.
Python files: EMR Shell nodes do not support running Python files. To run Python files, use Shell nodes instead.
Code comments: Comments are not supported in EMR Shell node code.
Metadata management: For DataLake or custom clusters, configure EMR-HOOK in the cluster before managing metadata in DataWorks. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, data lineage is not displayed, and EMR-related administration tasks cannot be performed. For configuration steps, see Configure EMR-HOOK for Hive.
spark-submit deploy mode: For tasks submitted using spark-submit, set
deploy-modetoclusterinstead ofclient.Sqoop components: When using Sqoop components, add the IP address of the resource group to the RDS whitelist.
Develop an EMR Shell node
Step 1: Write the Shell code
Choose one of the following methods to reference external resources in your Shell code.
Method 1: Reference an EMR JAR resource (DataLake clusters)
Use this method when working with a DataLake cluster. DataWorks uploads the resource to Data Studio and injects it into the node at runtime.
Create an EMR JAR resource.
Follow the steps in Resource Management to create a new JAR resource. Store the JAR package (prepared in Prepare initial data and a JAR package) in the
emr/jarsdirectory.Click Upload to upload the JAR resource.
Set Storage Path, Data Source, and Resource Group, then click Save.

Reference the resource in the EMR Shell node. A reference statement is automatically added to the code editor:
Open the EMR Shell node to go to the code editor.
In Resource Management in the left navigation pane, right-click the resource (for example,
onaliyun_mr_wordcount-1.0-SNAPSHOT.jar) and select Reference Resources.
##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"} onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputsReplace the resource package, bucket name, and paths with your actual values.
Method 2: Reference an OSS resource directly
Use this method to load a file from Object Storage Service (OSS) at runtime to the local machine, without uploading it to DataWorks. This is useful for JAR dependencies or shell scripts already stored in OSS.
Reference the resource using an OSS REF statement:
ossref://{endpoint}/{bucket}/{object}| Parameter | Description | Required |
|---|---|---|
endpoint | The OSS endpoint. If left blank, only OSS buckets in the same region as the EMR cluster are accessible. | No |
bucket | The name of the OSS bucket that stores the object. Bucket names are unique. Log on to the OSS console to view all buckets under the current account. | Yes |
object | The file name or path of the object within the bucket. | Yes |
Example
Upload emr_shell_test.sh to an OSS bucket. The file contains:
#!/bin/sh
echo "Hello, DataWorks!"Reference the file in the EMR Shell node:
sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.shIn this example, oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the bucket name, and emr_shell_test.sh is the object name.
The node output is:
...
>>> [2024-10-24 15:46:01][INFO ][CommandExecutor ]: Process ready to execute. command: sh ./emr_shell_test.sh
>>> [2024-10-24 15:46:01][INFO ][CommandExecutor ]: Command state update to RUNNING
>>> [2024-10-24 15:46:01][INFO ][CommandExecutor ]: Process start to execute...
Process Output>>> Hello, DataWorks!
...Step 2: Configure scheduling parameters (optional)
Define variables in the Shell code using the ${variable_name} format, then assign values on the Schedule tab under Scheduling Parameters. The values are passed dynamically each time the node runs.
DD=`date`;
echo "hello world, $DD"
echo ${var};For the full syntax and expression options, see Sources and expressions of scheduling parameters.
Supported commands for DataLake clusters
For DataLake clusters, the following commands are available in addition to your custom scripts:
Shell commands from
/usr/binand/bin(for example,ls,echo)YARN components:
hadoop,hdfs,yarnSpark components:
spark-submitSqoop components:
sqoop-export,sqoop-import,sqoop-import-all-tables
Step 3: Run the Shell task
On the Run Configuration tab, set Compute Resource and Resource Group.
CUs For Scheduling: Adjust the compute units (CUs) based on the task's resource needs. The default is
0.25CUs.Image: Select a custom image if your task requires a customized component environment.
Network access: If your task needs to reach a data source over the public network or through a VPC, use a resource group that has passed the connectivity test with that data source. For details, see Network connectivity solutions.
Click Run in the toolbar.
Step 4: Configure scheduling and deploy
To run the node on a schedule, configure its scheduling properties. For details, see Configure scheduling properties for a node.
Deploy the node. For details, see Deploy nodes and workflows.
After deployment, monitor the node's run status in Operation Center. For details, see Get started with Operation Center.
What's next
Run Python scripts on Shell nodes — Run Python 2 or Python 3 scripts using Shell node
pythoncommands.Use ossutil on Shell nodes — Use the ossutil tool to manage OSS resources from Shell nodes.