E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance, which helps you quickly recover data streams on which errors occur. This topic describes how to create an EMR Spark Streaming node and use the node to develop data.

Prerequisites

  • An Alibaba Cloud EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with your workspace. The EMR folder is displayed on the DataStudio page only after you associate an EMR compute engine instance with your workspace on the Workspace Management page. For more information, see Configure a workspace.
  • A resource group is created.

    An exclusive resource group for scheduling is created. For more information, see Create and use an exclusive resource group for scheduling.

  • The hadoop.http.authentication.simple.anonymous.allowed parameter in the configurations for Hadoop Distributed File System (HDFS) of the EMR cluster is set to true, and the HDFS and YARN services are restarted. EMR parameters

Limits

  • An EMR Spark Streaming node can be run only on an exclusive resource group for scheduling.
  • If the exclusive resource group for scheduling and EMR cluster that you use are created before June 10, 2021, you must submit a ticket to upgrade the resource group and the EMR cluster.

Create an EMR Spark Streaming node and use the node to develop data

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. Create a workflow.
    If you have a workflow, skip this step.
    1. Move the pointer over the Create icon and select Workflow.
    2. In the Create Workflow dialog box, set the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark Streaming node.
    1. On the DataStudio page, move the pointer over the Create icon and choose EMR > EMR Spark Streaming.
      Alternatively, you can find the desired workflow, right-click the workflow name, and then choose Create > EMR > EMR Spark Streaming.
    2. In the Create Node dialog box, set the Node Name, Node Type, and Location parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Spark Streaming node appears.
  4. Use the EMR Spark Streaming node to develop data.
    1. Select the EMR compute engine instance.
      On the configuration tab of the EMR Spark Streaming node, select the EMR compute engine instance.
    2. Write code for the EMR Spark Streaming node.
      On the configuration tab of the EMR Spark Streaming node, write code for the node. The following code provides an example:
      spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount /tmp/examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
      • The system automatically generates spark-submit after the node is created.
      • /tmp/examples-1.2.0-shaded.jar is the name of the JAR package generated by the node code. For more information about the code for the node, see Consume data in real time.
        Note The JAR package can be stored in the master node of the EMR cluster or in Object Storage Service (OSS). We recommend that you store the JAR package in OSS. For more information about how to store the JAR package in OSS, see Operations in the OSS console.
      • You must replace access-key-id and access-key-id with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey Management.https://workbench.data.aliyun.com/console
      For more information about Spark Streaming parameters, see Spark documentation.
    3. Configure a resource group for scheduling.
      • Click the Run with Parameters icon in the top toolbar. In the Parameters dialog box, select the desired resource group for scheduling.
      • Click OK.
    4. Save and run the EMR Spark Streaming node.
      In the top toolbar, click the Save icon to save the EMR Spark Streaming node and click the Run icon to run the EMR Spark Streaming node.
  5. Click Advanced Settings in the right-side navigation pane. On the Advanced Settings tab, change the values of the parameters.
    • "USE_GATEWAY":true: If you set this parameter to true, the EMR Spark Streaming node is automatically committed to the master node of an EMR gateway cluster.
    • "SPARK_CONF": "--conf spark.driver.memory=2g --conf xxx=xxx": the parameters for running Spark jobs. You can configure multiple parameters in the --conf xxx=xxx format.
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. The value false indicates that only one SQL statement is executed at a time, and the value true indicates that multiple SQL statements are executed at a time.
  6. Configure properties for the EMR Spark Streaming node.
    If you want the system to periodically run the EMR Spark Streaming node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
    • Configure basic properties for the EMR Spark Streaming node. For more information, see Basic properties.
    • Configure time properties for the EMR Spark Streaming node.

      You can select a mode to start the node and a mode to rerun the node. For more information, see Configure time properties.

    • Configure resource properties for the EMR Spark Streaming node. For more information, see Configure the resource group.
  7. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. Click Deploy in the upper-right corner. For more information, see Deploy nodes.
  8. View the EMR Spark Streaming node.
    1. Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
    2. View the EMR Spark Streaming node that is running. For more information, see Manage real-time computing nodes.