All Products
Search
Document Center

DataWorks:EMR Shell node

Last Updated:Mar 14, 2025

You can create E-MapReduce (EMR) Shell nodes in DataWorks to meet specific business requirements. You can specify custom Shell scripts and run the scripts to use advanced features such as data processing, Hadoop component calling, and file management. This topic describes how to configure and use an EMR Shell node in DataWorks to specify and run a Shell script.

Prerequisites

  • An EMR Shell node is created.

Limits

  • This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group.

  • If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. For more information about how to configure EMR-HOOK, see Use the Hive extension feature to record data lineage and historical access information.

    Note

    If you do not configure EMR-HOOK in your cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. EMR governance tasks also cannot be run.

  • If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.

  • EMR Shell nodes are run on the resource groups for scheduling of DataWorks rather than in EMR clusters. You can run specific commands supported by EMR components but cannot directly read the information about EMR resources. If you want to reference an EMR resource in an EMR Shell node, you must upload the resource to DataWorks first. For more information, see Resource management.

Procedure

  1. On the configuration tab of the EMR Shell node, perform the following operations:

    Develop Shell code

    You can use one of the following methods based on your business requirements:

    Method 1: Upload and reference an EMR JAR resource

    DataWorks allows you to upload a resource from your on-premises machine to Data Studio before you can reference the resource. If the EMR cluster that you want to use is a DataLake cluster, you can perform the following steps to reference an EMR JAR resource. If the EMR Shell node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and reference the resources in the code of the EMR Shell node.

    1. Create an EMR JAR resource.

      1. For more information about how to create an EMR JAR resource, see Resource management. In this example, the JAR package that is generated in the Prepare initial data and a JAR resource package section is stored in the emr/jars directory. Click Upload to upload the JAR package.

      2. Configure the Storage Path, Data Source, and Resource Group parameters.

      3. Click Save.

      image

    2. Reference the EMR JAR resource.

      1. Open the EMR Shell node. The configuration tab of the node appears.

      2. Find the resource that you want to reference in the RESOURCE MANAGEMENT: ALL pane in the left-side navigation pane of the Data Studio page, right-click the resource name, and then select Reference Resources. In this example, the resource is onaliyun_mr_wordcount-1.0-SNAPSHOT.jar.

      3. If the information in the ##@resource_reference{""} format appears on the configuration tab of the EMR Shell node, the code resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and path.

      ##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"}
      onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs
      Note

      You cannot add comments when you write code for the EMR Shell node.

    Method 2: Reference an OSS resource

    The current node can reference an Object Storage Service (OSS) resource by using the OSS REF method. When you run the node, DataWorks automatically loads the OSS resource specified in the node code. This method is commonly used in scenarios in which JAR dependencies are required in EMR tasks or EMR tasks need to depend on scripts. Reference format:

    ossref://{endpoint}/{bucket}/{object}
    • endpoint: the endpoint of OSS. If the endpoint parameter is left empty, only a resource in an OSS bucket that resides in the same region as the current EMR cluster can be referenced.

    • bucket: a container that is used to store objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets within the current logon account.

    • object: a file name or path that is stored in a bucket.

    Example

    1. Upload an object to the desired OSS bucket. In this example, emr_shell_test.sh is used. Sample content:

      #!/bin/sh
      echo "Hello, DataWorks!"
    2. Reference the OSS resource in the EMR Shell node.

      sh ossref://oss-cn-shanghai.aliyuncs.com/test-oss-of-dataworks/emr_shell_test.sh
      Note

      oss-cn-shanghai.aliyuncs.com is the endpoint of OSS, test-oss-of-dataworks is the name of the bucket, and emr_shell_test.sh is the name of the object.

      Output of emr_shell_test.sh:

      ...
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process ready to execute. command: sh ./emr_shell_test.sh
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Command state update to RUNNING
      >>> [2024-10-24 15:46:01][INFO   ][CommandExecutor       ]: Process start to execute...
      Process Output>>> Hello, DataWorks!
      ...

    Configure scheduling parameters for the EMR Shell node

    In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Scheduling Parameters section of the Properties tab in the right-side navigation pane of the configuration tab of the node. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. Sample code:

    DD=`date`;
    echo "hello world, $DD"
    ## Scheduling parameters are supported.
    echo ${var};
    Note

    If you use an EMR DataLake cluster, the following command lines are also supported:

    • Shell commands: Shell commands under /usr/bin and /bin, such as ls and echo.

    • YARN: hadoop, hdfs, and yarn.

    • Spark: spark-submit.

    • Sqoop: sqoop-export, sqoop-import, and sqoop-import-all-tables.

    To use the Sqoop service, you must add the IP address of your resource group to the IP address whitelist of the ApsaraDB RDS instance that is used to store the metadata of the EMR cluster.

    Run the EMR Shell node

    1. On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the node, configure the Computing Resource parameter in the Computing Resource section and configure the Resource Group parameter in the DataWorks Configurations section.

      Note
      • You can also configure the CUs For Computing parameter based on the resources required for task execution. The default value of this parameter is 0.25.

      • You can configure the Image parameter based on your business requirements.

      • If you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.

    2. In the top toolbar of the configuration tab of the node, click Run to run the node.

  2. If you want to run a task on the node on a regular basis, configure the scheduling information based on your business requirements.

  3. After you configure the node, deploy the node. For more information, see Node/workflow release.

  4. After you deploy the node, view the status of the node in Operation Center. For more information, see Getting started with Operation Center.

References

For information about how to run Python scripts on EMR Shell nodes by using Python 2 or Python 3 commands, see Use a Shell node to run Python scripts.

For information about how to use ossutil in EMR Shell nodes, see Use ossutil in Shell nodes.