You can create EMR Shell nodes in DataWorks to run custom Shell scripts for advanced operations, such as data processing, calling Hadoop components, and file management. This topic shows you how to configure and use EMR Shell nodes to edit and run Shell tasks.
Prerequisites
-
Before you start node development, if you need a custom component environment, create a custom image based on the official
dataworks_emr_base_task_podimage and then use it in Data Studio. For more information, see Create a custom image and Use images in data development.For example, you can replace Spark JAR packages or add dependencies on specific
libraries,files, orJAR packageswhen you create a custom image. -
Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.
-
(Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.
If you are using a root account, skip this step.
Limitations
-
EMR Shell nodes can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling. Using a custom image for data development requires a serverless resource group.
-
To manage metadata for a DataLake cluster or a custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Hive.
NoteIf EMR-HOOK is not configured on the cluster, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related governance tasks in DataWorks.
-
For tasks submitted with
spark-submit, set thedeploy-modetocluster. Theclientmode is not recommended. -
EMR Shell nodes run on DataWorks scheduling resource groups, not on EMR clusters. You can use some EMR component commands, but you cannot directly read the resource status from the EMR cluster. Therefore, you must first upload any required resources to DataWorks. For more information, see Resource Management.
-
EMR Shell nodes do not support running Python files. Use a Shell node to run them.
Procedure
-
In the EMR Shell node editor, perform the following operations.
Develop Shell code
You can choose an option based on your specific requirements:
Option 1: Upload and reference EMR JAR
You can upload resources from your local machine to Data Studio and then reference them. If you are using a DataLake cluster, follow these steps to reference an EMR JAR resource. If a resource that the EMR Shell node depends on is too large to be uploaded through the DataWorks UI, you can store the resource in HDFS and reference it in your code.
-
Create an EMR JAR resource.
-
For more information, see Resource Management. Store the JAR package generated in Prepare initial data and a JAR package in the JAR resource directory
emr/jars. Click Click Upload. -
Select a Storage Path, Data Sources, and Resource Group.
-
Click Save.
For Storage Path, select HDFS.
-
-
Reference the EMR JAR resource.
-
Open the created EMR Shell node and go to the code editor page.
-
In the left-side navigation pane, find the resource you want to reference under Resource Management (for example,
onaliyun_mr_wordcount-1.0-SNAPSHOT.jar). Right-click the resource and select Insert Resource Path. -
After you select a reference, a success message appears on the code editing page of the EMR Shell node. This indicates that the code resource is successfully referenced. You must then run the following command. The resource package, Bucket name, and path information in the following command are examples. You must replace them with your actual information.
##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"} onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputsNoteComments are not supported when you edit code in an EMR Shell node.
-
Option 2: Directly reference OSS resource
You can directly reference OSS resources by using OSS REF. When you run the EMR node, DataWorks automatically loads the OSS resources from your code to the local machine.
ossref://{endpoint}/{bucket}/{object}-
Endpoint: The public access endpoint for OSS. If you leave this parameter empty, the OSS bucket must be in the same region as the EMR cluster.
-
bucket: The name of the bucket, which is a container that stores objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets under the current account.
-
object: The specific object, such as a file name or path, stored in the bucket.
Example
Configure EMR Shell scheduling parameters
When developing task code in the Shell editor, you can define variables using the ${variable_name} format. Then, assign a value to the variable in the Scheduling Parameters section under Scheduling Settings on the right side of the node editor. This allows you to pass parameters dynamically during scheduled runs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example:
DD=`date`; echo "hello world, $DD" ## Can be used with scheduling parameters echo ${var};NoteIf you are using a DataLake cluster, the following command-line tools are also supported.
-
Shell commands: Shell commands in
/usr/binand/bin, such as ls and echo. -
YARN components:
hadoop,hdfs, andyarn. -
Spark components:
spark-submit. -
Sqoop components:
sqoop-export,sqoop-import,sqoop-import-all-tables, and more.
If you use these components to access an RDS instance, you must add the resource group's IP address to the RDS allowlist.
Run the Shell task
-
In the Run Configuration section, configure the Compute Resource and Resource Group.
Note-
You can also CUs for Scheduling based on the resources required for task execution. The default CU is
0.25. -
You can select an Image based on your task requirements.
-
To access a data source in a public network or a VPC environment, you must use a scheduling resource group that can connect to the data source. For more information, see Network connectivity solutions.
-
-
In the toolbar, click Run to run the Shell task.
-
-
To run the node task periodically, configure its scheduling properties. For more information, see Configure scheduling properties for a node.
NoteIf you need to customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_podand use the image in Data Development.For example, when you create a custom image, you can replace Spark JAR packages or depend on specific
libraries,files, orJAR packages. -
After configuring the node, deploy it. For more information, see Deploy nodes and workflows.
-
After deploying the task, you can view its running status in Operation Center. For more information, see Get started with Operation Center.
FAQ
Q: A hosts mapping has been configured on the legacy scheduling resource group, but the EMR Shell node still reports a resolution failure. How can I resolve this issue?
A: You must use the configured resource group to reinitialize the EMR cluster so that the EMR Shell script can recognize the newly added hosts mapping. Go to the compute resources list page, click Initialize Resources, and then click Re-initialize in the dialog box to ensure successful initialization.
Related documentation
-
For information about using Python 2 or Python 3 commands to run Python scripts in an EMR Shell node, see Run Python scripts on Shell-type nodes.
-
For information about using the ossutil tool in an EMR Shell node, see Use ossutil on Shell-type nodes.