All Products
Search
Document Center

DataWorks:Create an EMR Shell node

Last Updated:Jun 20, 2026

You can create an EMR Shell node in DataWorks to run custom Shell scripts for advanced operations, such as data processing, calling Hadoop components, and managing files. This topic describes how to configure, develop, and run Shell tasks in EMR Shell nodes.

Prerequisites

  • To customize the component environment before you start node development, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the image in DataStudio.

    For example, you can replace a Spark JAR package or depend on specific libraries, files, or JAR packages when you create a custom image.

  • An EMR cluster is registered with DataWorks. For more information, see DataStudio (old version): Bind an EMR compute resource.

  • (Optional, for RAM users) The RAM user for task development must be a member of the workspace and have the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information, see Add members to a workspace and assign roles to them.

  • A serverless resource group is purchased and configured. The configuration includes binding the resource group to a workspace and setting up the network. For more information, see Use a serverless resource group.

  • A workflow is created in DataStudio.

    In DataStudio, all node development is organized into workflows. Therefore, you must create a workflow before creating a node. For more information, see Create a workflow.

  • If you run a Python script on a DataWorks resource group and your code requires third-party packages, you must prepare the package environment on the resource group. The method varies based on the type of resource group you use:

    • serverless resource group (recommended): Install third-party packages using image management. For more information, see Custom images.

    • exclusive resource group for scheduling: Install third-party packages using O&M Assistant. For more information, see O&M Assistant.

Limitations

  • You can run this type of task only on a serverless resource group (recommended) or an exclusive resource group for scheduling. If you need to use an image in data development, you must use a serverless resource group.

  • To manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK on the cluster. Without EMR-HOOK, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related governance tasks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.

  • For tasks submitted using spark-submit, we recommend using cluster mode for the deploy-mode parameter instead of client mode.

  • EMR Shell nodes run on DataWorks scheduling resource groups, not on EMR clusters. You can use some EMR component commands, but you cannot directly read resource information from EMR. To reference a resource, you must upload it to DataWorks first. For more information, see Upload EMR resources.

  • EMR Shell nodes do not support running Python files. Use a Shell node instead.

Step 1: Create an EMR Shell node

  1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Shell node.

    1. Right-click the target workflow and choose Create Node > EMR > EMR Shell.

      Note

      Alternatively, you can hover over Create and choose Create Node > EMR > EMR Shell.

    2. In the Create Node dialog box, enter a Name and select an Engine Instance, Node Type, and Path. Click OK to go to the EMR Shell node editor.

      Note

      The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Shell task

On the EMR Shell node editor page, double-click the node that you created. Then, choose one of the following options based on your scenario:

Option 1: Upload and reference EMR JAR resources

You can upload a resource from your local computer to DataStudio and then reference it. If you use a DataLake cluster, you can follow these steps to reference an EMR JAR resource. If an EMR Shell node depends on a large resource, you cannot upload the resource on the DataWorks page. In this case, you can store the resource in HDFS and reference it in your code.

  1. Create an EMR JAR resource.

    For more information, see Create and use EMR resources. In this example, the JAR package that is generated in Prepare initial data and a JAR package is stored in the emr/jars directory. The first time you use this feature, you must click Authorize and then click Click Upload to upload the JAR resource.

  2. Reference the EMR JAR resource.

    1. Open the created EMR Shell node and go to the code editor.

    2. Under the EMR > Resources node, find the target Insert Resource Path (for example, onaliyun_mr_wordcount-1.0-SNAPSHOT.jar), and right-click to select Insert Resource Path.

    3. Referencing the resource adds a statement in the ##@resource_reference{""} format to the code editor of the EMR Shell node. This statement indicates that the resource is referenced. Then, run the following commands. The resource package, bucket name, and path information in the following commands are for demonstration only. Replace them with your actual information.

      ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
      onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs
      Note

      You cannot add comments to the code of an EMR Shell node.

Option 2: Reference OSS resources

An EMR Shell node can directly reference OSS resources by using the OSS REF feature. When you run the node, DataWorks automatically loads the OSS resources that are referenced in the code to your local environment. This method is commonly used in scenarios where an EMR task needs to run with JAR dependencies or depends on scripts. Use the following format to reference an OSS resource:

ossref://{endpoint}/{bucket}/{object}
  • Endpoint: the domain name used to access OSS. If you leave the endpoint parameter empty, you can access only the OSS resources that are in the same region as the EMR cluster.

  • bucket: a container that stores objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets in your account.

  • object: a specific object, such as a file name or path, that is stored in a bucket.

Example

  1. Upload a sample file to an OSS bucket. This topic uses emr_shell_test.sh as an example. The following code provides the sample content of the file:

    #!/bin/sh
    echo "Hello, DataWorks!"
  2. Directly reference the OSS resource in the EMR Shell node.

    sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh
    Note

    In this example, oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the bucket name, and emr_shell_test.sh is the object name.

    The output of the emr_shell_test.sh file is as follows:

    ...
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
    Process Output>>> Hello, DataWorks!
    ...

Configure scheduling parameters

Define variables in your code using the ${variable_name} format. You can then assign values to these variables in the Scheduling > Scheduling Parameter section in the right-side pane to pass dynamic parameters at runtime. For more information, see Supported formats for scheduling parameters. The following code provides an example:

DD=`date`;
echo "hello world, $DD"
## You can use this with scheduling parameters.
echo ${var};
Note

If you use a DataLake cluster, the following command-line tools are also supported.

  • Shell commands: Shell commands in the /usr/bin and /bin directories, such as ls and echo.

  • Yarn components: hadoop, hdfs, and yarn.

  • Spark component: spark-submit.

  • Sqoop components: sqoop-export, sqoop-import, and sqoop-import-all-tables.

To use Sqoop components with ApsaraDB RDS, you must add the IP address of the resource group to the ApsaraDB RDS whitelist.

Run an EMR Shell task

  1. In the toolbar, click the 高级运行 icon. In the Parameter dialog box, select the created scheduling resource group and click Running.

    Note
    • To access compute resources over the internet or a VPC, you must use a scheduling resource group that passed the connectivity test for the compute resource. For more information, see Network connectivity solutions.

    • If you need to change the resource group for subsequent task runs, you can click the Run with Parameters 高级运行 icon and select a different scheduling resource group.

  2. Click the 保存 icon to save the Shell script.

  3. (Optional) Perform smoke testing.

    If you want to perform smoke testing in the development environment, you can run it before or after you submit the node. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note
  • You must configure the Rerun attribute and Parent Nodes properties for the node before you can submit it.

  • If you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use it in DataStudio.

    For example, you can replace Spark JAR packages or include specific libraries, files, or JAR packages when you create a custom image.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

More operations

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage scheduled tasks.

Related topics

To learn how to use Python 2 or Python 3 commands to run a Python script in an EMR Shell node, see Run a Python script in a Shell-type node.

To learn how to use the OSSUtils tool in an EMR Shell node, see Use ossutil to access OSS in a Shell-type node.