E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with
high throughput. This type of node supports fault tolerance, which helps you restore
data streams on which errors occur. This topic describes how to create an EMR Spark
Streaming node and use the node to develop data.
Limits
- An EMR Spark Streaming node can run only on an exclusive resource group for scheduling.
- If the exclusive resource group for scheduling and EMR cluster that you use are created
before June 10, 2021, you must submit a ticket to upgrade the resource group and the EMR cluster.
Create an EMR Spark Streaming node and use the node to develop data
- Go to the DataStudio page.
- Log on to the DataWorks console.
- In the left-side navigation pane, click Workspaces.
- In the top navigation bar, select the region where the required workspace resides,
find the workspace, and then click Data Analytics.
- Create a workflow.
If you have a workflow, skip this step.
- Move the pointer over the
icon and select Workflow.
- In the Create Workflow dialog box, set the Workflow Name parameter.
- Click Create.
- Create an EMR Spark Streaming node.
- On the DataStudio page, move the pointer over the
icon and choose . Alternatively, you can find the desired workflow, right-click the workflow name, and
then choose .
- Click Commit. Then, the configuration tab of the EMR Spark Streaming node appears.
- Use the EMR Spark Streaming node to develop data.
- Select the EMR compute engine instance.
On the configuration tab of the EMR Spark Streaming node, select the EMR compute engine instance.
- Write code for the EMR Spark Streaming node.
On the configuration tab of the EMR Spark Streaming node, write code for the node.
Sample code:
--master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount /tmp/examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
The system automatically generates
spark-submit
after the node is created. The following code is the code that is finally sent to
the compute engine instance:
spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount /tmp/examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
/tmp/examples-1.2.0-shaded.jar
is the name of the JAR package generated by the node code. For more information about
the code for the node, see Consume data in real time.
Note The JAR package can be stored in the following objects:
- Master node of the EMR cluster.
- Object Storage Service (OSS). We recommend that you store the JAR package in OSS.
For more information about how to store the JAR package in OSS, see Operations in the OSS console.
- You must replace
access-key-id
and access-key-secret
with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain
the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select
AccessKey Management.
For more information about Spark Streaming parameters, see
Spark documentation.
- Configure a resource group for scheduling.
- Click the
icon in the top toolbar. In the Parameters dialog box, select the desired resource
group for scheduling.
- Click OK.
- Save and run the EMR Streaming SQL node.
In the top toolbar, click the

icon to save the EMR Streaming SQL node and click the

icon to run the EMR Streaming SQL node.
- Configure properties for the EMR Spark Streaming node.
If you want the system to periodically run the EMR Spark Streaming node, you can click
Properties in the right-side navigation pane to configure properties for the node based on your
business requirements.
- Commit and deploy the MySQL node.
- Click the
icon in the top toolbar to save the node.
- Click the
icon in the top toolbar to commit the node.
- In the Commit Node dialog box, enter your comments in the Change description field.
- Click OK.
If you use a workspace in standard mode, you must deploy the node in the production
environment after you commit the node. Click
Deploy in the upper-right corner. For more information, see
Deploy nodes.
- View the EMR Spark Streaming node.
- Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
- View the EMR Spark Streaming node that is running. For more information, see Manage real-time computing nodes.