This topic describes how to create an E-MapReduce (EMR) Spark SQL node. EMR Spark SQL nodes allow you to use the distributed SQL query engine to process structured data. This helps improve the efficiency of jobs.

Prerequisites

  • An Alibaba Cloud EMR cluster is created, and an inbound rule that contains the following content is added to the security group to which the cluster belongs.
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with your workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR Hive nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR Hive nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the preceding code, ALISA.* and SKYNET.* are specific to DataWorks.
    2. After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
  • An exclusive resource group for scheduling is created, and the resource group is associated with the virtual private cloud (VPC) where the EMR cluster resides. For more information, see Create and use an exclusive resource group for scheduling.
    Note You can use only exclusive resource groups for scheduling to run EMR Hive nodes.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace in which you want to create a MySQL node resides. Find the workspace and click DataStudio in the Actions column.
  2. Create a workflow.
    If you have an existing workflow, skip this step.
    1. Move the pointer over the Create icon and select Create Workflow.
    2. In the Create Workflow dialog box, configure the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark SQL node.
    1. Move the pointer over the Create icon and choose Create Node > EMR > EMR Spark SQL.
      Alternatively, you can find the desired workflow, right-click the workflow, and then choose Create Node > EMR > EMR Spark SQL.
    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Spark SQL node appears.
  4. Use the EMR Spark SQL node to develop data.
    1. Select a resource group.
      In the top toolbar, click the Run with Parameters icon. In the Parameters dialog box, select the created resource group for scheduling.
      Note
      • If you want to access a data source over the Internet or a VPC, you must use the resource group for scheduling that is connected to the data source. For more information, see Select a network connectivity solution.
      • If you want to change the resource group in subsequent operations, you can change the resource group in the Parameters dialog box.
    2. Write code for the EMR Spark SQL node.
      In the SQL editor, enter the code for the EMR Spark SQL node. Example:
      show tables; 
      CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
      ip STRING COMMENT'IP address',
      uid STRING COMMENT'User ID'
      )PARTITIONED BY(
      dt STRING
      ); // You can assign a specific scheduling parameter to the var variable. 
      Note
      • The size of the SQL statement for the node cannot exceed 130 KB.
      • If you use an EMR Spark SQL node to query data, a maximum of 10,000 data records can be returned and the total size of the returned data records cannot exceed 10 MB.
      • If multiple EMR compute engine instances are associated with the current workspace, you must select one of the compute engine instances based on your business requirements. If only one EMR compute engine instance is associated with the current workspace, you do not need to select a compute engine instance.
      If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?.
      Note For more information about how to use scheduling parameters, see Overview of scheduling parameters.
    3. Save and execute the SQL statements.
      In the top toolbar, click the Save icon to save the SQL statements. Then, click the Run icon to execute the SQL statements.
  5. Configure the parameters on the Advanced Settings tab.
    The following table describes the advanced parameters that are configured for different types of EMR clusters.
    Cluster type Advanced parameter
    Data lake cluster
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. Valid values:
      • true: indicates that multiple SQL statements are executed at a time.
      • false: indicates that only one SQL statement is executed at a time.
    • "ENABLE_SPARKSQL_JDBC": specifies whether SQL statements are committed by using JDBC. Valid values:
      • true: indicates that the SQL statements are committed by using JDBC. If the EMR cluster does not have the Kyuubi service, the SQL statements are committed to Spark Thrift Server. If the EMR cluster has the Kyuubi service, the SQL statements are committed to Kyuubi by using JDBC. If the parameter is set to true, the data lineage feature is not supported.
      • false: indicates that the SQL statements are committed by using the spark-submit command. In this case, the data lineage feature is supported.
        Note
        • If you run the spark-submit command to commit the SQL statements, temporary files and directories are generated in the /tmp directory of the HDFS service of the EMR cluster by default. Make sure that you have the read and write permissions on the directory.
        • If you run the spark-submit command to commit the SQL statements, you can add a SparkConf parameter on the Advanced Settings tab. After you commit the code, DataWorks adds the custom parameter to the command. Example: "spark.driver.memory" : "2g".
      Default value: false.
    Hadoop cluster
  6. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, configure the Change description parameter.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.
  7. View the MySQL node.
    1. Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
    2. View the scheduled MySQL node. For more information, see View and manage auto triggered nodes.