E-MapReduce (EMR) Spark Streaming nodes can be used to process streaming data with high throughput. This type of node supports fault tolerance and can help you restore data streams on which errors occur. This topic describes how to create and use an EMR Spark Streaming node to develop data.

Prerequisites

The preparations for creating a node are complete for EMR and DataWorks. The preparations vary based on the type of your EMR cluster. EMR provides the following types of clusters:

Limits

  • EMR Spark Streaming nodes can be run only on an exclusive resource group for scheduling.
  • DataWorks no longer allows you to associate an EMR Hadoop cluster with a DataWorks workspace. However, the EMR Hadoop clusters that are associated with your DataWorks workspace can still be used.
  • If the exclusive resource group for scheduling and the EMR cluster that you use are created before June 10, 2022, you must upgrade the resource group and the EMR cluster. To upgrade the resource group and EMR cluster, you must submit a ticket.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace in which you want to create a MySQL node resides. Find the workspace and click DataStudio in the Actions column.
  2. Create a workflow.
    If you have an existing workflow, skip this step.
    1. Move the pointer over the Create icon and select Create Workflow.
    2. In the Create Workflow dialog box, configure the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark Streaming node.
    1. Move the pointer over the Create icon icon and choose Create Node > EMR > EMR Spark Streaming.
      You can also find the desired workflow, right-click the workflow, and then choose Create Node > EMR > EMR Spark Streaming.
    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Spark Streaming node appears.
  4. Create and reference an EMR JAR resource.
    If you use an EMR data lake cluster, you can perform the following steps to reference an EMR JAR resource:
    Note If an EMR Spark Streaming node depends on large amounts of resources, the resources cannot be uploaded by using the DataWorks console. In this case, you can store the resources in HDFS and then reference the resources in the code of the EMR Spark Streaming node. Sample code:
    spark-submit --master yarn
    --deploy-mode cluster
    --name SparkPi
    --driver-memory 4G
    --driver-cores 1
    --num-executors 5
    --executor-memory 4G
    --executor-cores 1
    --class org.apache.spark.examples.JavaSparkPi
    hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100
    1. For more information about how to create an EMR JAR resource, see Create and use an EMR JAR resource. The first time you use DataWorks to access Object Storage Service (OSS), click Authorize to the right of OSS to authorize DataWorks and EMR to access OSS.
    2. Reference the EMR JAR resource.
      1. Open the EMR Spark Streaming node. The configuration tab of the node appears.
      2. Find the resource that you want to reference under Resource in the EMR folder, right-click the resource name, and then select Insert Resource Path.
      3. If the clause in the format of ##@resource_reference{""} appears on the configuration tab of the EMR Spark Streaming node, the resource is referenced. Then, run the following code. You must replace the information in the following code with the actual information. The information includes the resource package name, bucket name, and directory.
        ##@resource_reference{"examples-1.2.0-shaded.jar"}
        --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
  5. Use the EMR Spark Streaming node to develop data.
    1. Select the EMR compute engine instance.
      On the configuration tab of the EMR Spark Streaming node, select the EMR compute engine instance.
    2. Write code for the EMR Spark Streaming node.
      On the configuration tab of the EMR Spark Streaming node, write code for the node. Sample code:
      spark-submit --master yarn-cluster --executor-cores 2 --executor-memory 2g --driver-memory 1g --num-executors 2 --class com.aliyun.emr.example.spark.streaming.JavaLoghubWordCount examples-1.2.0-shaded.jar <logService-project> <logService-store> <group> <endpoint> <access-key-id> <access-key-secret>
      Note
      • In this example, the examples-1.2.0-shaded.jar JAR package is uploaded in the DataWorks console.
      • You must replace access-key-id and access-key-secret with the AccessKey ID and AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey ID and AccessKey secret, you can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner, and then select AccessKey Management.
      • You cannot add comments when you write code for the EMR Spark Streaming node.
    3. Configure a resource group for scheduling.
      • Click the Run with Parameters icon icon in the top toolbar. In the Parameters dialog box, select the desired resource group for scheduling.
      • Click OK.
    4. Save and run the EMR Spark Streaming node.
      In the top toolbar, click the Save icon icon to save the EMR Spark Streaming node and click the Run icon icon to run the EMR Spark Streaming node.
  6. Configure the parameters on the Advanced Settings tab.
    If you use an EMR data lake cluster, you can configure the following advanced parameters:
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "priority": the priority. Default value: 1.
    Note
    • You can also add a SparkConf parameter on the Advanced Settings tab for the EMR Spark Streaming node. When you commit the code for the EMR Spark Streaming node in DataWorks, DataWorks adds the custom parameter to the command. For example, you can add a custom parameter whose key is spark.driver.memory and value is 2g.
    • You can use Spark nodes on YARN to submit jobs only if your nodes are in cluster or local mode. Spark 2.x in cluster mode supports metadata lineage.
  7. Configure scheduling properties for the EMR Presto node.
    If you want the system to periodically run the EMR Presto node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
    • Configure basic properties for the EMR Presto node. For more information, see Configure basic properties.
    • Configure the scheduling cycle, rerun properties, and scheduling dependencies of the EMR Presto node. For more information, see Configure time properties and Configure same-cycle scheduling dependencies.
      Note Before you commit the EMR Presto node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.
    • Configure resource properties for the EMR Presto node. For more information, see Configure a resource group. If the EMR Presto node that you created is an auto triggered node and you want the node to access the Internet or a virtual private cloud (VPC), you must select the resource group for scheduling that is connected to the node. For more information, see Select a network connectivity solution.
  8. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, configure the Change description parameter.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.
  9. View the EMR Spark Streaming node.
    1. Click Operation Center in the upper-right corner of the configuration tab of the node to go to Operation Center.
    2. View the EMR Spark Streaming node that is running. For more information, see Manage real-time computing nodes.