All Products
Search
Document Center

DataWorks:EMR Spark Streaming node

Last Updated:Mar 14, 2025

E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance and can help you restore data streams on which errors occur. This topic describes how to use an EMR Spark Streaming node to develop data.

Prerequisites

  • An EMR cluster is created and the cluster is registered to DataWorks. For more information, see Register an EMR cluster to DataWorks.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to your DataWorks workspace as a member and is assigned the Development or Workspace Manager role. The Workspace Manager role has more permissions than necessary. Exercise caution when you assign the Workspace Manager role. For more information about how to add a member, see Add workspace members and assign roles to them.

    Note

    If you use an Alibaba Cloud account, you can skip this operation.

  • A workspace directory is created. For more information, see Workspace directories.

  • An EMR Spark Streaming node is created.

Limits

  • This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group.

  • EMR Spark Streaming nodes that you create in DataWorks cannot be used to develop data in Spark clusters created on the EMR on ACK page.

Procedure

  1. On the configuration tab of the EMR Spark Streaming node, perform the following operations:

    Create and reference an EMR JAR resource

    If you use an EMR DataLake cluster, you can perform the following steps to reference an EMR JAR resource:

    1. Prepare an EMR JAR package.

    2. Create an EMR JAR resource.

      1. Create an EMR JAR resource in the RESOURCE MANAGEMENT: ALL pane of the Data Studio page. For more information, see Resource management.

      2. On the configuration tab of the resource, click Upload to upload the JAR package.

      3. Configure the Storage Path, Data Source, and Resource Group parameters.

      4. Click Save.

    3. Reference the EMR JAR resource.

      1. Open the EMR Spark Streaming node. The configuration tab of the node appears.

      2. Find the resource that you want to reference in the RESOURCE MANAGEMENT: ALL pane in the left-side navigation pane of the Data Studio page, right-click the resource name, and then select Reference Resources.

      3. If the information in the ##@resource_reference{""} format appears on the configuration tab of the EMR Spark Streaming node, the code resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and path.

        ##@resource_reference{"examples-1.2.0-shaded.jar"}
        --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>

    Develop SQL code

    On the configuration tab of the EMR Spark Streaming node, write code for the node. Sample code:

    spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
    Note
    • In this example, the examples-1.2.0-shaded.jar package is uploaded in the DataWorks console.

    • You must replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey.

    • You cannot add comments when you write code for the EMR Spark Streaming node.

    (Optional) Configure advanced parameters

    You can configure specific parameters in the EMR Node Parameters section of the Real-time Configurations tab. For more information about how to configure the parameters, see Spark Configuration. The following table describes the advanced parameters that can be configured.

    DataLake cluster: created on the EMR on ECS page

    Advanced parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    The manner in which SQL statements are executed. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false (default): Only one SQL statement is executed at a time.

    Note

    This parameter is available only for testing in the development environment of a DataWorks workspace.

    queue

    The scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    priority

    The priority. Default value: 1.

    Others

    You can also add a SparkConf parameter in advanced settings for the EMR Spark Streaming node. When you commit the code for the EMR Spark Streaming node in DataWorks, DataWorks automatically adds the custom parameter to the command. Example: "spark.driver.memory" : "2g".

    Execute SQL statements

    1. On the Debugging Configurations tab, configure the Computing Resource parameter in the Computing Resource section and configure the Resource Group parameter in the DataWorks Configurations section.

      Note
      • You can also configure the CUs For Computing parameter based on the resources required for task execution. The default value of this parameter is 0.25.

      • If you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.

    2. In the top toolbar of the configuration tab of the node, click Run to execute SQL statements.

  2. If you want to run a task on the node on a regular basis, configure the scheduling information based on your business requirements.

  3. After you configure the node, deploy the node. For more information, see Node/workflow release.

  4. After you deploy the node, view the status of the node in Operation Center. For more information, see Getting started with Operation Center.