You can create an EMR Shell node in DataWorks to run custom Shell scripts for advanced operations, such as data processing, calling Hadoop components, and managing files. This topic describes how to configure, develop, and run Shell tasks in EMR Shell nodes.
Prerequisites
-
To customize the component environment before you start node development, you can create a custom image based on the official image
dataworks_emr_base_task_podand use the image in DataStudio.For example, you can replace a Spark JAR package or depend on specific
libraries,files, orJAR packageswhen you create a custom image. -
An EMR cluster is registered with DataWorks. For more information, see DataStudio (old version): Bind an EMR compute resource.
-
(Optional, for RAM users) The RAM user for task development must be a member of the workspace and have the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information, see Add members to a workspace and assign roles to them.
-
A serverless resource group is purchased and configured. The configuration includes binding the resource group to a workspace and setting up the network. For more information, see Use a serverless resource group.
-
A workflow is created in DataStudio.
In DataStudio, all node development is organized into workflows. Therefore, you must create a workflow before creating a node. For more information, see Create a workflow.
-
If you run a Python script on a DataWorks resource group and your code requires third-party packages, you must prepare the package environment on the resource group. The method varies based on the type of resource group you use:
-
serverless resource group (recommended): Install third-party packages using image management. For more information, see Custom images.
-
exclusive resource group for scheduling: Install third-party packages using O&M Assistant. For more information, see O&M Assistant.
-
Limitations
-
You can run this type of task only on a serverless resource group (recommended) or an exclusive resource group for scheduling. If you need to use an image in data development, you must use a serverless resource group.
-
To manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK on the cluster. Without EMR-HOOK, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related governance tasks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.
-
For tasks submitted using spark-submit, we recommend using
cluster modefor the deploy-mode parameter instead ofclient mode. -
EMR Shell nodes run on DataWorks scheduling resource groups, not on EMR clusters. You can use some EMR component commands, but you cannot directly read resource information from EMR. To reference a resource, you must upload it to DataWorks first. For more information, see Upload EMR resources.
-
EMR Shell nodes do not support running Python files. Use a Shell node instead.
Step 1: Create an EMR Shell node
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
-
Create an EMR Shell node.
-
Right-click the target workflow and choose .
NoteAlternatively, you can hover over Create and choose .
-
In the Create Node dialog box, enter a Name and select an Engine Instance, Node Type, and Path. Click OK to go to the EMR Shell node editor.
NoteThe node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
-
Step 2: Develop an EMR Shell task
On the EMR Shell node editor page, double-click the node that you created. Then, choose one of the following options based on your scenario:
-
(Recommended) Upload resources from your local computer to DataStudio and then reference them in your code. For more information, see Option 1: Upload a resource and then reference the EMR JAR resource.
-
Reference OSS resources using the OSS REF feature. For more information, see Option 2: Directly reference an OSS resource.
Option 1: Upload and reference EMR JAR resources
You can upload a resource from your local computer to DataStudio and then reference it. If you use a DataLake cluster, you can follow these steps to reference an EMR JAR resource. If an EMR Shell node depends on a large resource, you cannot upload the resource on the DataWorks page. In this case, you can store the resource in HDFS and reference it in your code.
-
Create an EMR JAR resource.
For more information, see Create and use EMR resources. In this example, the JAR package that is generated in Prepare initial data and a JAR package is stored in the emr/jars directory. The first time you use this feature, you must click Authorize and then click Click Upload to upload the JAR resource.
-
Reference the EMR JAR resource.
-
Open the created EMR Shell node and go to the code editor.
-
Under the node, find the target Insert Resource Path (for example,
onaliyun_mr_wordcount-1.0-SNAPSHOT.jar), and right-click to select Insert Resource Path. -
Referencing the resource adds a statement in the
##@resource_reference{""}format to the code editor of the EMR Shell node. This statement indicates that the resource is referenced. Then, run the following commands. The resource package, bucket name, and path information in the following commands are for demonstration only. Replace them with your actual information.##@resource_reference{"onaliyun_mr_wordcount-1.0-SNAPSHOT.jar"} onaliyun_mr_wordcount-1.0-SNAPSHOT.jar cn.apache.hadoop.onaliyun.examples.EmrWordCount oss://onaliyun-bucket-2/emr/datas/wordcount02/inputs oss://onaliyun-bucket-2/emr/datas/wordcount02/outputsNoteYou cannot add comments to the code of an EMR Shell node.
-
Option 2: Reference OSS resources
Configure scheduling parameters
Define variables in your code using the ${variable_name} format. You can then assign values to these variables in the Scheduling > Scheduling Parameter section in the right-side pane to pass dynamic parameters at runtime. For more information, see Supported formats for scheduling parameters. The following code provides an example:
DD=`date`;
echo "hello world, $DD"
## You can use this with scheduling parameters.
echo ${var};
If you use a DataLake cluster, the following command-line tools are also supported.
-
Shell commands: Shell commands in the
/usr/binand/bindirectories, such as ls and echo. -
Yarn components: hadoop, hdfs, and yarn.
-
Spark component: spark-submit.
-
Sqoop components: sqoop-export, sqoop-import, and sqoop-import-all-tables.
To use Sqoop components with ApsaraDB RDS, you must add the IP address of the resource group to the ApsaraDB RDS whitelist.
Run an EMR Shell task
-
In the toolbar, click the
icon. In the Parameter dialog box, select the created scheduling resource group and click Running.Note-
To access compute resources over the internet or a VPC, you must use a scheduling resource group that passed the connectivity test for the compute resource. For more information, see Network connectivity solutions.
-
If you need to change the resource group for subsequent task runs, you can click the Run with Parameters
icon and select a different scheduling resource group.
-
-
Click the
icon to save the Shell script. -
(Optional) Perform smoke testing.
If you want to perform smoke testing in the development environment, you can run it before or after you submit the node. For more information, see Perform smoke testing.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
-
You must configure the Rerun attribute and Parent Nodes properties for the node before you can submit it.
-
If you need to customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_podand use it in DataStudio.For example, you can replace Spark JAR packages or include specific
libraries,files, orJAR packageswhen you create a custom image.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
-
Click the
icon in the top toolbar to save the task. -
Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note-
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
-
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
-
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
More operations
After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage scheduled tasks.
Related topics
To learn how to use Python 2 or Python 3 commands to run a Python script in an EMR Shell node, see Run a Python script in a Shell-type node.
To learn how to use the OSSUtils tool in an EMR Shell node, see Use ossutil to access OSS in a Shell-type node.