All Products
Search
Document Center

DataWorks:Create an EMR Spark Streaming node

Last Updated:Oct 31, 2023

E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance and can help you restore data streams on which errors occur. This topic describes how to create and use an EMR Spark Streaming node to develop data.

Prerequisites

The preparations for creating a node are complete for EMR and DataWorks. The preparations vary based on the type of your EMR cluster. EMR provides the following types of clusters:

Limits

  • This type of node can be run only on an exclusive resource group for scheduling.

  • If the exclusive resource group for scheduling and the EMR cluster that you use are created before June 10, 2021, you must upgrade the resource group and the EMR cluster. To upgrade the resource group and EMR cluster, you must submit a ticket.

  • Spark clusters that are created on the EMR on ACK page are not supported.

Procedure

  1. Go to the DataStudio page.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspaces.

    3. In the top navigation bar, select the region where your workspace resides. Find your workspace and click DataStudio in the Actions column.

  2. Create a workflow.

    If you have an existing workflow, skip this step.

    1. Move the pointer over the Create icon and select Create Workflow.

    2. In the Create Workflow dialog box, configure the Workflow Name parameter.

    3. Click Create.

  3. Create an EMR Spark Streaming node.

    1. Move the pointer over the image.png icon and choose Create Node > EMR > EMR Spark Streaming.

      Alternatively, you can find the desired workflow, right-click the name of the workflow, and then choose Create Node > EMR > EMR Spark Streaming.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.

      Note

      The node name must be 1 to 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

    3. Click Confirm. The configuration tab of the EMR Spark Streaming node appears.

  4. Create and reference an EMR JAR resource.

    If you use an EMR DataLake cluster, you can perform the following steps to reference an EMR JAR resource:

    Note

    If an EMR Spark Streaming node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in Hadoop Distributed File System (HDFS) and then reference the resources in the code of the EMR Spark Streaming node. Sample code:

    spark-submit --master yarn
    --deploy-mode cluster
    --name SparkPi
    --driver-memory 4G
    --driver-cores 1
    --num-executors 5
    --executor-memory 4G
    --executor-cores 1
    --class org.apache.spark.examples.JavaSparkPi
    hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
    1. Create an EMR JAR resource. For more information, see Create and use an EMR JAR resource. The first time you use an EMR JAR resource, click Authorize to authorize DataWorks to access the EMR JAR resource.

    2. Reference the EMR JAR resource.

      1. Open the EMR Spark Streaming node. The configuration tab of the node appears.

      2. Find the resource that you want to reference below Resource in the EMR folder, right-click the resource name, and then select Insert Resource Path.

      3. If the clause that is in the ##@resource_reference{""} format appears on the configuration tab of the EMR Spark Streaming node, the resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and directory.

        ##@resource_reference{"examples-1.2.0-shaded.jar"}
        --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
  5. Use the EMR Spark Streaming node to develop data.

    1. Select the EMR compute engine instance.

      On the configuration tab of the EMR Spark Streaming node, select the EMR compute engine instance.

    2. Write code for the node.

      On the configuration tab of the EMR Spark Streaming node, write code for the node. Sample code:

      spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
      Note
      • In this example, the examples-1.2.0-shaded.jar JAR package is uploaded in the DataWorks console.

      • You must replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey Management.

      • You cannot add comments when you write code for the EMR Spark Streaming node.

    3. Configure a resource group for scheduling.

      • Click the 高级运行 icon in the top toolbar. In the Parameters dialog box, select the desired resource group for scheduling.

      • Click Run.

    4. Save and run the node.

      In the top toolbar, click the 保存 icon to save the node and click the 运行 icon to run the node.

  6. Configure the parameters on the Advanced Settings tab.

    If you use an EMR DataLake cluster, you can configure the following advanced parameters:

    • "queue": the scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    • "priority": the priority. Default value: 1.

    Note
    • You can also add a SparkConf parameter on the Advanced Settings tab for the EMR Spark Streaming node. When you commit the code for the EMR Spark Streaming node in DataWorks, DataWorks adds the custom parameter to the command. For example, you can add a custom parameter whose key is "spark.driver.memory" : "2g".

    • For more information about how to configure the parameters of the node, see Spark Configuration.

  7. Configure scheduling properties for the EMR Presto node.

    If you want the system to periodically run the EMR Presto node, you can click Properties in the right-side navigation pane on the configuration tab of the EMR Presto node to configure properties for the node based on your business requirements.

  8. Commit and deploy the node.

    1. Click the Save icon in the top toolbar to save the node.

    2. Click the Submit icon in the top toolbar to commit the node.

    3. In the Commit Node dialog box, configure the Change description parameter.

    4. Click OK.

    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.

  9. View the EMR Spark Streaming node.

    1. Click Operation Center in the upper-right corner of the configuration tab of the node to go to Operation Center.

    2. View the EMR Spark Streaming node that is running. For more information, see Manage real-time compute nodes.