You can create an E-MapReduce (EMR) Spark Shell node and run the node by using the code editor. This topic describes how to create an EMR Spark Shell node and use the node to develop data.

Prerequisites

  • An Alibaba Cloud EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with the desired workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the code, ALISA.* and SKYNET.* are configurations in DataWorks.
    2. After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
  • An exclusive resource group for scheduling is created, and the resource group is associated with the virtual private cloud (VPC) where the EMR cluster resides. For more information, see Create and use an exclusive resource group for scheduling.
    Note You can use only exclusive resource groups for scheduling to run EMR Hive nodes.

Create an EMR Spark Shell node and use the node to develop data

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. Create a workflow.
    If you have a workflow, skip this step.
    1. Move the pointer over the Create icon and select Workflow.
    2. In the Create Workflow dialog box, set the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark Shell node.
    1. On the DataStudio page, move the pointer over the Create icon and choose EMR > EMR Spark Shell.
      Alternatively, you can find the desired workflow, right-click the workflow name, and then choose Create > EMR > EMR Spark Shell.
    2. In the Create Node dialog box, set the Node Name, Node Type, and Location parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Spark Shell node appears.
  4. Use the EMR Spark Shell node to develop data.
    The following code provides an example:
    val count = sc.parallelize(1 to 100).filter { _ =>
      val x = math.random
      val y = math.random
      x*x + y*y < 1
    }.count()
    println(s"Pi is roughly ${4.0 * count / 100}")
    
    println(${var})
                            
    You can add scheduling parameters to the code. For more information about the scheduling parameters, see Configure scheduling parameters.

    If you want to change the values that are assigned to the parameters in the code, click Run with Parameters in the top toolbar. For more information about value assignment for the scheduling parameters, see Scheduling parameters.

    sparkshellFor more information about how to configure a Spark Shell job, see Configure a Spark Shell job.
  5. Click Advanced Settings in the right-side navigation pane. On the Advanced Settings tab, change the values of the parameters.
    • "USE_GATEWAY":true: If you set this parameter to true, the EMR Presto node is automatically committed to the master node of an EMR gateway cluster.
    • "SPARK_CONF": "--conf spark.driver.memory=2g --conf xxx=xxx": the parameters for running Spark jobs. You can configure multiple parameters in the --conf xxx=xxx format.
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "vcores": the number of CPU cores. Default value:1.
    • "memory": the memory that is allocated to the launcher, in MB. Default value: 2048.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. The value false indicates that only one SQL statement is executed at a time, and the value true indicates that multiple SQL statements are executed at a time.
  6. Configure properties for the EMR Spark Shell node.
    If you want the system to periodically run the EMR Spark Shell node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
  7. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. Click Deploy in the upper-right corner. For more information, see Deploy nodes.
  8. View the MySQL node.
    1. On the editing tab of the MySQL node, click Operation Center in the upper-right corner to go to Operation Center.
    2. View the scheduled MySQL node. For more information, see View auto triggered nodes.