You can create E-MapReduce (EMR) Shell nodes in DataWorks to meet specific business requirements. You can specify custom Shell scripts and run the scripts to use advanced features such as data processing, Hadoop component calling, and file management. This topic describes how to configure and use an EMR Shell node in DataWorks to specify and run a Shell script.
Prerequisites
An EMR Shell node is created.
Limits
This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group.
If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. For more information about how to configure EMR-HOOK, see Use the Hive extension feature to record data lineage and historical access information.
NoteIf you do not configure EMR-HOOK in your cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. EMR governance tasks also cannot be run.
If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.
EMR Shell nodes are run on the resource groups for scheduling of DataWorks rather than in EMR clusters. You can run specific commands supported by EMR components but cannot directly read the information about EMR resources. If you want to reference an EMR resource in an EMR Shell node, you must upload the resource to DataWorks first. For more information, see Resource management.
Procedure
On the configuration tab of the EMR Shell node, perform the following operations:
Develop Shell code
You can use one of the following methods based on your business requirements:
Method 1: Upload and reference an EMR JAR resource
DataWorks allows you to upload a resource from your on-premises machine to Data Studio before you can reference the resource. If the EMR cluster that you want to use is a DataLake cluster, you can perform the following steps to reference an EMR JAR resource. If the EMR Shell node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and reference the resources in the code of the EMR Shell node.
Create an EMR JAR resource.
For more information about how to create an EMR JAR resource, see Resource management. In this example, the JAR package that is generated in the Prepare initial data and a JAR resource package section is stored in the
emr/jars
directory. Click Upload to upload the JAR package.Configure the Storage Path, Data Source, and Resource Group parameters.
Click Save.
Reference the EMR JAR resource.
Open the EMR Shell node. The configuration tab of the node appears.
Find the resource that you want to reference in the RESOURCE MANAGEMENT: ALL pane in the left-side navigation pane of the Data Studio page, right-click the resource name, and then select Reference Resources. In this example, the resource is
onaliyun_mr_wordcount-1.0-SNAPSHOT.jar
.If the information in the ##@resource_reference{""} format appears on the configuration tab of the EMR Shell node, the code resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and path.
##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"} onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputs
NoteYou cannot add comments when you write code for the EMR Shell node.
Method 2: Reference an OSS resource
The current node can reference an Object Storage Service (OSS) resource by using the OSS REF method. When you run the node, DataWorks automatically loads the OSS resource specified in the node code. This method is commonly used in scenarios in which JAR dependencies are required in EMR tasks or EMR tasks need to depend on scripts. Reference format:
ossref://{endpoint}/{bucket}/{object}
endpoint: the endpoint of OSS. If the endpoint parameter is left empty, only a resource in an OSS bucket that resides in the same region as the current EMR cluster can be referenced.
bucket: a container that is used to store objects in OSS. Each bucket has a unique name. You can log on to the OSS console to view all buckets within the current logon account.
object: a file name or path that is stored in a bucket.
Example
Configure scheduling parameters for the EMR Shell node
In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Scheduling Parameters section of the Properties tab in the right-side navigation pane of the configuration tab of the node. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. Sample code:
DD=`date`; echo "hello world, $DD" ## Scheduling parameters are supported. echo ${var};
NoteIf you use an EMR DataLake cluster, the following command lines are also supported:
Shell commands: Shell commands under
/usr/bin
and/bin
, such as ls and echo.YARN: hadoop, hdfs, and yarn.
Spark: spark-submit.
Sqoop: sqoop-export, sqoop-import, and sqoop-import-all-tables.
To use the Sqoop service, you must add the IP address of your resource group to the IP address whitelist of the ApsaraDB RDS instance that is used to store the metadata of the EMR cluster.
Run the EMR Shell node
On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the node, configure the Computing Resource parameter in the Computing Resource section and configure the Resource Group parameter in the DataWorks Configurations section.
NoteYou can also configure the CUs For Computing parameter based on the resources required for task execution. The default value of this parameter is
0.25
.You can configure the Image parameter based on your business requirements.
If you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.
In the top toolbar of the configuration tab of the node, click Run to run the node.
If you want to run a task on the node on a regular basis, configure the scheduling information based on your business requirements.
After you configure the node, deploy the node. For more information, see Node/workflow release.
After you deploy the node, view the status of the node in Operation Center. For more information, see Getting started with Operation Center.
References
For information about how to run Python scripts on EMR Shell nodes by using Python 2 or Python 3 commands, see Use a Shell node to run Python scripts.
For information about how to use ossutil in EMR Shell nodes, see Use ossutil in Shell nodes.