Create an EMR Spark Streaming node - DataWorks - Alibaba Cloud Documentation Center

EMR Spark Streaming nodes process high-throughput, real-time data streams and provide fault tolerance to help you recover quickly from data stream errors. This topic describes how to create an EMR Spark Streaming node and develop data tasks.

Prerequisites

An Alibaba Cloud EMR cluster is created and registered to DataWorks. For more information, see DataStudio (legacy): Register an EMR compute resource.
(Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add members to a workspace.
A serverless resource group is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use a serverless resource group.
A workflow is created in DataStudio.
Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

Limits

This type of task runs only on a serverless resource group (recommended) or an exclusive resource group for scheduling.
You cannot create EMR Spark Streaming nodes for task development in EMR on ACK Spark clusters.

Step 1: Create an EMR Spark Streaming node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark Streaming node.
1. Right-click the target workflow and choose Create Node > EMR > EMR Spark Streaming.
  Note
  Alternatively, you can hover over Create and choose Create Node > EMR > EMR Spark Streaming.
2. In the Create Node dialog box, enter a Name and select an engine instance, Node Type, and Path. Click OK. This opens the configuration page for the EMR Spark Streaming node.
  Note
  A node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Spark Streaming task

On the EMR Spark Streaming node configuration page, double-click the node you created to open the task development page.

Create and reference an EMR JAR resource

If you use a DataLake cluster, follow these steps to reference an EMR JAR resource.

Note

If an EMR Spark Streaming node depends on a large resource that cannot be uploaded in DataWorks, you can store the resource in HDFS and reference it in your code. For example:

spark-submit --master yarn
--deploy-mode cluster
--name SparkPi
--driver-memory 4G
--driver-cores 1
--num-executors 5
--executor-memory 4G
--executor-cores 1
--class org.apache.spark.examples.JavaSparkPi
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100

Create an EMR JAR resource. For more information, see Create and use EMR resources. The first time you use this feature, you must perform a Authorize.
Reference the EMR JAR resource.
1. Open the EMR Spark Streaming node and navigate to the code editor.
2. Under the EMR > Resource node, find the desired resource, right-click it, and select Insert Resource Path.
3. After you select the resource, a statement in the ##@resource_reference{""} format appears in the node editor. This statement references the resource. Then, enter your spark-submit command. The resource package, bucket name, and path in the command are for demonstration purposes only. Replace them with your actual values.
```
##@resource_reference{"examples-1.2.0-shaded.jar"}
--master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
```

Develop the spark-submit code

In the EMR Spark Streaming node editor, enter the spark-submit command for your job. For example:

spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>

Note

In this example, the resource uploaded to DataWorks is examples-1.2.0-shaded.jar.
Replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain them, log on to the DataWorks console, hover over your profile picture in the upper-right corner, and go to the AccessKey Management page.
Comments are not supported when you edit code for an EMR Spark Streaming node.
If multiple EMR compute resources are associated with your workspace in DataStudio, select the one that meets your business requirements. If only one resource is associated, no selection is required.

(Optional) Configure Advanced Settings

You can configure specific properties in the Advanced Settings section of the node. For more information about the properties, see Spark Configuration. The following table describes the available advanced parameters.

DataLake: EMR on ECS

Parameter	Description
queue	The scheduling queue for the job. The default queue is default. For more information about EMR YARN, see Basic queue configurations.
priority	The priority of the job. The default value is 1.
Others	You can add custom SparkConf parameters in this section. When you submit the code, DataWorks automatically adds these parameters to the command. Example: `"spark.driver.memory" : "2g"`. Note To enable Ranger for access control, add the `spark.hadoop.fs.oss.authorization.method=ranger` configuration in Set global Spark parameters. For more information about how to configure parameters, see Set global Spark parameters.

Run the task

In the toolbar, click the icon. In the Parameter dialog box, select the scheduling resource group that you created and click Running.
Note
- To access compute resources over a public network or in a VPC, you must use a scheduling resource group that can connect to the compute resources. For more information, see Network connectivity solutions.
- If you need to change the resource group for subsequent runs, click the Run with Parameters icon and select the resource group that you want to use.
Click the icon to save the code.
(Optional) Perform smoke testing.
If you want to perform smoke testing in the development environment, run the test before or after you commit the node. For more information, see Perform smoke testing.

Step 3: Configure scheduling properties

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.

Step 4: Deploy the task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.

In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note
- You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
- You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploying tasks.

More operations

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see Manage auto triggered tasks.