E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance, which helps you restore data streams on which errors occur. This topic describes how to create an EMR Spark Streaming node and use the node to develop data.

Prerequisites

  • An Alibaba Cloud EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • The EMR cluster is associated with your DataWorks workspace as a compute engine instance. The EMR folder is displayed on the DataStudio page only after an EMR cluster is associated with your workspace as a compute engine instance on the Workspace Management page. For more information, see Configure a workspace.
  • A resource group is created.

    An exclusive resource group for scheduling is created. For more information, see Create and use an exclusive resource group for scheduling.

Limits

  • An EMR Spark Streaming node can run only on an exclusive resource group for scheduling.
  • If the exclusive resource group for scheduling and EMR cluster that you use are created before June 10, 2021, you must submit a ticket to upgrade the resource group and the EMR cluster.

Create an EMR Spark Streaming node and use the node to develop data

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where the required workspace resides, find the workspace, and then click Data Analytics.
  2. Create a workflow.
    If you have a workflow, skip this step.
    1. Move the pointer over the Create icon and select Workflow.
    2. In the Create Workflow dialog box, set the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark Streaming node.
    1. On the DataStudio page, move the pointer over the Create icon and choose EMR > EMR Spark Streaming.
      Alternatively, you can find the desired workflow, right-click the workflow name, and then choose Create > EMR > EMR Spark Streaming.
    2. Click Commit. Then, the configuration tab of the EMR Spark Streaming node appears.
  4. Use the EMR Spark Streaming node to develop data.
    1. Select the EMR compute engine instance.
      On the configuration tab of the EMR Spark Streaming node, select the EMR compute engine instance.
    2. Write code for the EMR Spark Streaming node.
      On the configuration tab of the EMR Spark Streaming node, write code for the node.
      Sample code:
      --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount /tmp/examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
      The system automatically generates spark-submit after the node is created. The following code is the code that is finally sent to the compute engine instance:
      spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount /tmp/examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
      • /tmp/examples-1.2.0-shaded.jar is the name of the JAR package generated by the node code. For more information about the code for the node, see Consume data in real time.
        Note The JAR package can be stored in the following objects:
        • Master node of the EMR cluster.
        • Object Storage Service (OSS). We recommend that you store the JAR package in OSS. For more information about how to store the JAR package in OSS, see Operations in the OSS console.
      • You must replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey Management.
      For more information about Spark Streaming parameters, see Spark documentation.
    3. Configure a resource group for scheduling.
      • Click the Run with Parameters icon in the top toolbar. In the Parameters dialog box, select the desired resource group for scheduling.
      • Click OK.
    4. Save and run the EMR Streaming SQL node.
      In the top toolbar, click the Save icon to save the EMR Streaming SQL node and click the Run icon to run the EMR Streaming SQL node.
  5. Configure properties for the EMR Spark Streaming node.
    If you want the system to periodically run the EMR Spark Streaming node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
    • Configure basic properties for the EMR Spark Streaming node. For more information, see Configure basic properties.
    • Configure time properties for the EMR Spark Streaming node.

      You can select a mode to start the node and a mode to rerun the node. For more information, see Configure time properties.

    • Configure resource properties for the EMR Spark Streaming node. For more information, see Configure the resource group.
  6. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. Click Deploy in the upper-right corner. For more information, see Deploy nodes.
  7. View the EMR Spark Streaming node.
    1. Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
    2. View the EMR Spark Streaming node that is running. For more information, see Manage real-time computing nodes.