All Products
Search
Document Center

DataWorks:EMR Shell node

Last Updated:Jun 22, 2026

You can create EMR Shell nodes in DataWorks to run custom Shell scripts for advanced operations, such as data processing, calling Hadoop components, and file management. This topic shows you how to configure and use EMR Shell nodes to edit and run Shell tasks.

Prerequisites

  • Before you start node development, if you need a custom component environment, create a custom image based on the official dataworks_emr_base_task_pod image and then use it in Data Studio. For more information, see Create a custom image and Use images in data development.

    For example, you can replace Spark JAR packages or add dependencies on specific libraries, files, or JAR packages when you create a custom image.

  • Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.

  • (Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

    If you are using a root account, skip this step.

Limitations

  • EMR Shell nodes can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling. Using a custom image for data development requires a serverless resource group.

  • To manage metadata for a DataLake cluster or a custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Hive.

    Note

    If EMR-HOOK is not configured on the cluster, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related governance tasks in DataWorks.

  • For tasks submitted with spark-submit, set the deploy-mode to cluster. The client mode is not recommended.

  • EMR Shell nodes run on DataWorks scheduling resource groups, not on EMR clusters. You can use some EMR component commands, but you cannot directly read the resource status from the EMR cluster. Therefore, you must first upload any required resources to DataWorks. For more information, see Resource Management.

  • EMR Shell nodes do not support running Python files. Use a Shell node to run them.

Procedure

  1. In the EMR Shell node editor, perform the following operations.

    Develop Shell code

    You can choose an option based on your specific requirements:

    Option 1: Upload and reference EMR JAR

    You can upload resources from your local machine to Data Studio and then reference them. If you are using a DataLake cluster, follow these steps to reference an EMR JAR resource. If a resource that the EMR Shell node depends on is too large to be uploaded through the DataWorks UI, you can store the resource in HDFS and reference it in your code.

    1. Create an EMR JAR resource.

      1. For more information, see Resource Management. Store the JAR package generated in Prepare initial data and a JAR package in the JAR resource directory emr/jars. Click Click Upload.

      2. Select a Storage Path, Data Sources, and Resource Group.

      3. Click Save.

      For Storage Path, select HDFS.

    2. Reference the EMR JAR resource.

      1. Open the created EMR Shell node and go to the code editor page.

      2. In the left-side navigation pane, find the resource you want to reference under Resource Management (for example, onaliyun_mr_wordcount-1.0-SNAPSHOT.jar). Right-click the resource and select Insert Resource Path.

      3. After you select a reference, a success message appears on the code editing page of the EMR Shell node. This indicates that the code resource is successfully referenced. You must then run the following command. The resource package, Bucket name, and path information in the following command are examples. You must replace them with your actual information.

      ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
      onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs
      Note

      Comments are not supported when you edit code in an EMR Shell node.

    Option 2: Directly reference OSS resource

    You can directly reference OSS resources by using OSS REF. When you run the EMR node, DataWorks automatically loads the OSS resources from your code to the local machine.

    ossref://{endpoint}/{bucket}/{object}
    • Endpoint: The public access endpoint for OSS. If you leave this parameter empty, the OSS bucket must be in the same region as the EMR cluster.

    • bucket: The name of the bucket, which is a container that stores objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets under the current account.

    • object: The specific object, such as a file name or path, stored in the bucket.

    Example

    1. Upload a sample file to an OSS bucket. In this example, we use the emr_shell_test.sh file with the following content:

      #!/bin/sh
      echo "Hello, DataWorks!"
    2. Directly reference the OSS resource in the EMR Shell node.

      sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh
      Note

      In this command, oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the bucket name, and emr_shell_test.sh is the object file name.

      The command returns the following output from the emr_shell_test.sh file:

      ...
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
      Process Output>>> Hello, DataWorks!
      ...

    Configure EMR Shell scheduling parameters

    When developing task code in the Shell editor, you can define variables using the ${variable_name} format. Then, assign a value to the variable in the Scheduling Parameters section under Scheduling Settings on the right side of the node editor. This allows you to pass parameters dynamically during scheduled runs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example:

    DD=`date`;
    echo "hello world, $DD"
    ## Can be used with scheduling parameters
    echo ${var};
    Note

    If you are using a DataLake cluster, the following command-line tools are also supported.

    • Shell commands: Shell commands in /usr/bin and /bin, such as ls and echo.

    • YARN components: hadoop, hdfs, and yarn.

    • Spark components: spark-submit.

    • Sqoop components: sqoop-export, sqoop-import, sqoop-import-all-tables, and more.

    If you use these components to access an RDS instance, you must add the resource group's IP address to the RDS allowlist.

    Run the Shell task

    1. In the Run Configuration section, configure the Compute Resource and Resource Group.

      Note
      • You can also CUs for Scheduling based on the resources required for task execution. The default CU is 0.25.

      • You can select an Image based on your task requirements.

      • To access a data source in a public network or a VPC environment, you must use a scheduling resource group that can connect to the data source. For more information, see Network connectivity solutions.

    2. In the toolbar, click Run to run the Shell task.

  2. To run the node task periodically, configure its scheduling properties. For more information, see Configure scheduling properties for a node.

    Note

    If you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the image in Data Development.

    For example, when you create a custom image, you can replace Spark JAR packages or depend on specific libraries, files, or JAR packages.

  3. After configuring the node, deploy it. For more information, see Deploy nodes and workflows.

  4. After deploying the task, you can view its running status in Operation Center. For more information, see Get started with Operation Center.

FAQ

Q: A hosts mapping has been configured on the legacy scheduling resource group, but the EMR Shell node still reports a resolution failure. How can I resolve this issue?

A: You must use the configured resource group to reinitialize the EMR cluster so that the EMR Shell script can recognize the newly added hosts mapping. Go to the compute resources list page, click Initialize Resources, and then click Re-initialize in the dialog box to ensure successful initialization.

Related documentation