All Products
Search
Document Center

DataWorks:Create an EMR Shell node

Last Updated:Jul 08, 2025

You can create E-MapReduce (EMR) Shell nodes in DataWorks to meet specific business requirements. You can specify custom Shell scripts and run them to use features such as data processing, Hadoop component calling, and file management. This topic describes how to configure and use an EMR Shell node in DataWorks to specify and run a Shell script.

Requirements

  • Before you start node development, if you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod, and use the image in Data Development.

    For example, you can replace the Spark JAR package or depend on specific libraries, files, or jar packages when creating a custom image.

  • An EMR cluster is created and registered to DataWorks. For more information, see Old version of Data Development: Bind EMR computing resources.

  • (Optional, required for Resource Access Management (RAM) users) The RAM user that develops tasks has been added to the corresponding workspace and has the Developer or Workspace Administrator (with greater permissions, use with caution) role. For more information about how to add members, see Add workspace members.

  • A Serverless resource group is purchased and configured, including binding the workspace and configuring the network. For more information, see Add and use a Serverless resource group.

  • A workflow is created in DataStudio.

    DataStudio performs specific development operations on different development engines based on workflows. Therefore, you need to create a workflow before creating a node. For more information, see Create a workflow.

  • A third-party package is installed based on the resource group that you use. The third-party package needs to be referenced when you run a Python script on a DataWorks resource group:

    • Serverless resource group (recommended): Install third-party packages through image management. For more information, see Custom image.

    • Exclusive resource group for scheduling: Install third-party packages through O&M Assistant. For more information, see O&M Assistant.

Limits

  • Only Serverless resource groups (recommended) or exclusive resource groups for scheduling can be used to run this type of task. If you need to use an image in Data Development, you must use a Serverless resource group.

  • If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, the related EMR governance tasks cannot be run. For more information about configuring EMR-HOOK, see Configure EMR-HOOK for Hive.

  • If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.

  • EMR Shell nodes are run on the resource groups for scheduling of DataWorks rather than in EMR clusters. You can run specific commands supported by EMR components but cannot directly read the information about EMR resources. If you want to reference an EMR resource in an EMR Shell node, you must upload the resource to DataWorks first. For more information, see Upload EMR resources.

  • EMR Shell nodes do not support running Python files. You need to use Shell nodes to run them.

1. Create an EMR Shell node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Shell node.

    1. Right-click the target workflow and select Create Node > EMR > EMR Shell.

      Note

      You can also hover over Create and select Create Node > EMR > EMR Shell.

    2. In the Create Node dialog box, enter a Name, and select an Engine Instance, Node Type, and Path. Click OK to go to the EMR Shell node configuration page.

      Note

      The node name can contain only letters, digits, underscores (_), and periods (.).

2. Develop an EMR Shell task

On the configuration tab of the EMR Shell node, double-click the created node. You can use one of the following methods based on your business requirements:

Method 1: Upload a resource before referencing an EMR JAR resource

DataWorks allows you to upload a resource from your on-premises machine to DataStudio before you reference the resource. If you use an EMR data lake cluster, you can perform the following steps to reference an EMR JAR resource. If the EMR Shell node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and reference the resources in the code of the EMR Shell node.

  1. Create an EMR JAR resource.

    Create an EMR JAR resource. For more information, see Create and use EMR resources. In this example, the JAR package generated in Prepare Initial Data and JAR Resource Package is stored in the JAR resource directory emr/jars. For the first use, you need to select One-click Authorization, and then click the Click To Upload button to upload the JAR resource.新建JAR资源

  2. Reference the JAR package.

    1. Open the created EMR Shell node and stay on the code editing page.

    2. Under the EMR > Resources node, find the Resource To Reference (in this example, onaliyun_mr_wordcount-1.0-SNAPSHOT.jar), right-click and select Reference Resource.引用资源

    3. After you select the reference, when the code editing page of the EMR Shell node displays a statement in the ##@resource_reference{""} format, it indicates that the code resource has been successfully referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and directory.

      ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
      onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs
      Note

      You cannot add comments when you write code for the EMR Shell node.

Method 2: Reference an OSS resource

The current node can directly reference OSS resources through OSS REF. When running an EMR node, DataWorks automatically loads the OSS resources in the code to the local environment for use. This method is commonly used in scenarios in which JAR dependencies are required in EMR tasks or EMR tasks need to depend on scripts. Reference format:

ossref://{endpoint}/{bucket}/{object}
  • endpoint: The endpoint of OSS. If the endpoint parameter is left empty, only a resource in an OSS bucket that resides in the same region as the current EMR cluster can be referenced.

  • Bucket: The container used by OSS to store objects. Each Bucket has a unique name. You can log on to the OSS Management Console to view all Buckets under the current account.

  • object: A specific object (file name or path) stored in the Bucket.

Example

  1. Upload an object to the desired OSS bucket. This example uses emr_shell_test.sh with the following content:

    #!/bin/sh
    echo "Hello, DataWorks!"
  2. Reference the OSS resource in the EMR Shell node.

    sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh
    Note

    oss-cn-shanghai.aliyuncs.com is the endpoint, test-oss-of-dataworks is the Bucket name, and emr_shell_test.sh is the object file name.

    The running result is as follows, showing the output of the emr_shell_test.sh file:

    ...
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
    >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
    Process Output>>> Hello, DataWorks!
    ...

Configure EMR Shell scheduling parameters

Develop task code in the node editing area. You can define variables in the code using the ${variable name} format, and assign values to these variables in the Scheduling Configuration > Scheduling Parameters section in the right navigation pane of the node editing page. This enables dynamic parameter passing in scheduling scenarios. For more information about scheduling parameters, see Supported formats of scheduling parameters. Example:

DD=`date`;
echo "hello world, $DD"
## Scheduling parameters are supported.
echo ${var};
Note

If you use an EMR data lake cluster, the following command lines are also supported:

  • Shell commands: Shell commands in /usr/bin and /bin, such as ls and echo.

  • YARN: hadoop, hdfs, and yarn.

  • Spark: spark-submit.

  • Sqoop: sqoop-export, sqoop-import, and sqoop-import-all-tables.

To use the Sqoop component, you must add the IP address of the resource group that is used by the component to the IP address whitelist of ApsaraDB RDS.

Execute an EMR Shell task

  1. Click the 高级运行 icon in the toolbar, select the created scheduling resource group in the Parameters dialog box, and click Run.

    Note
    • To access computing resources in a public network or VPC network environment, you need to use a scheduling resource group that has successfully tested connectivity with the computing resources. For more information, see Network connectivity solutions.

    • If you need to modify the resource group used for subsequent task execution, you can click the Run With Parameters 高级运行 icon and select the scheduling resource group you want to change to.

  2. Click the 保存 icon to save the Shell script you wrote.

  3. Optional. Perform smoke testing.

    If you want to perform smoke testing in the development environment, you can do so after submitting the node or after the node is submitted. For more information, see Perform smoke testing.

3. Configure node scheduling

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note
  • You need to set the Rerun Property and Upstream Node of the node before you can submit the node.

  • If you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the image in Data Development.

    For example, you can replace the Spark JAR package or depend on specific libraries, files, or jar packages when creating a custom image.

4. Deploy the node task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit the task.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.

References

To learn how to use Python 2 or Python 3 commands to run Python scripts in EMR Shell nodes, see Run Python scripts in Shell nodes.

To learn how to use the OSSUtils tool in EMR Shell nodes, see Use the ossutil tool in Shell nodes.