E-MapReduce (EMR) Spark Streaming nodes process high-throughput, real-time streaming data. They include a fault tolerance mechanism to enable the quick recovery of failed data streams. This topic describes how to create and use an EMR Spark Streaming node for data development.
Prerequisites
You have created an Alibaba Cloud EMR cluster and bound it to DataWorks. For more information, see Data Studio: Associate an EMR computing resource.
(Optional) If you are a Resource Access Management (RAM) user, ensure that you have been added to the workspace for task development and have been assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about adding members, see Add members to a workspace.
If you use an Alibaba Cloud account, you can skip this step.
Limitations
This type of task can run only on serverless resource groups (recommended) or exclusive resource groups for scheduling.
You cannot create and use EMR Spark Streaming nodes for data development on EMR on ACK Spark clusters.
Procedure
On the EMR Spark Streaming node edit page, perform the following steps:
Create and reference an EMR JAR resource
If you use a DataLake cluster, follow these steps to reference an EMR JAR resource.
NoteIf a resource required by the EMR Spark Streaming node is too large, you cannot upload it from the DataWorks page. Instead, you must store the resource in Hadoop Distributed File System (HDFS) and reference it in your code. The following code provides an example.
spark-submit --master yarn --deploy-mode cluster --name SparkPi --driver-memory 4G --driver-cores 1 --num-executors 5 --executor-memory 4G --executor-cores 1 --class org.apache.spark.examples.JavaSparkPi hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100Create an EMR JAR resource.
For more information, see Resource Management. The generated JAR package is stored in the
emr/jarsdirectory. Click the Click To Upload button to upload the JAR package.Set the Storage Path, Data Source, and Resource Group.
Click Save.
Reference the EMR JAR resource.
Open the EMR Spark Streaming node that you created. This opens the code editor.
In the navigation pane on the left, under Resource Management, find the resource that you want to reference. Right-click the resource and select Reference Resource.
After you select the reference, a success message appears in the code editor of the EMR Spark Streaming node. Then, run the following command. Replace the example resource package, bucket name, and path with your actual information.
##@resource_reference{"examples-1.2.0-shaded.jar"} --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
Develop SQL code
In the code editor for the EMR Spark Streaming node, enter the code for the job. The following code provides an example.
spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>NoteIn the example, the resource uploaded to DataWorks is
examples-1.2.0-shaded.jar.Replace
access-key-idandaccess-key-secretwith the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain an AccessKey ID and an AccessKey secret, log on to the DataWorks console, move your mouse over your profile picture in the upper-right corner, and select AccessKey Management.Comments are not supported in the code editor for EMR Spark Streaming nodes.
(Optional) Configure advanced parameters
In the Scheduling Configuration section on the right side of the node page, you can configure the parameters described in the following table under .
NoteThe available advanced parameters vary based on the EMR cluster type, as shown in the following table.
For more information about open-source Spark properties, see Spark Configuration. You can configure these parameters under in the Scheduling section on the right side of the page.
DataLake cluster: EMR on ECS
Advanced parameter
Configuration description
FLOW_SKIP_SQL_ANALYZE
The method to execute SQL statements. Valid values:
true: Execute multiple SQL statements at a time.false(default): Execute one SQL statement at a time.
NoteThis parameter is supported only for test runs in the data development environment.
queue
The scheduling queue to which jobs are submitted. The default value is default. For more information about EMR YARN, see Basic queue configuration.
priority
The priority. The default value is 1.
Other
You can add custom SparkConf parameters in the advanced configuration section. DataWorks automatically adds these parameters to the command when you submit the code. Example:
"spark.driver.memory" : "2g".NoteTo enable Ranger access control, add the
spark.hadoop.fs.oss.authorization.method=rangerconfiguration in Set global Spark parameters. This ensures that Ranger access control is effective.For more information about parameter configurations, see Set global Spark parameters.
Execute the SQL task
In the Computing Resource section of Run Configuration, you can configure the Computing Resource and Resource Group.
NoteYou can also set Schedule CU based on the task's resource requirements. The default value is
0.25.To access a data source over the public network or a VPC, you must use a scheduling resource group that has passed the connectivity test with the data source. For more information, see Network connectivity solutions.
In the toolbar's parameter dialog box, select the data source that you created and click Run to execute the SQL task.
To run the node task on a schedule, you can configure its scheduling properties. For more information, see Node scheduling configuration.
After you configure the node, you must publish it. For more information, see Publish a node or workflow.
After the task is published, you can view the status of the auto triggered task in the Operation Center. For more information, see Get started with Operation Center.