This topic describes how to create an E-MapReduce (EMR) Spark SQL node. EMR Spark SQL nodes allow you to use the distributed SQL query engine to process structured data. This helps improve the efficiency of jobs.

Prerequisites

The preparations for creating a node are complete for EMR and DataWorks. The preparations vary based on the type of your EMR cluster. EMR provides the following types of clusters:

Limits

  • EMR Hive nodes can be run only on an exclusive resource group for scheduling.
  • DataWorks no longer allows you to associate an EMR Hadoop cluster with a DataWorks workspace. However, the EMR Hadoop clusters that are associated with your DataWorks workspace can still be used.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace in which you want to create a MySQL node resides. Find the workspace and click DataStudio in the Actions column.
  2. Create a workflow.
    If you have an existing workflow, skip this step.
    1. Move the pointer over the Create icon and select Create Workflow.
    2. In the Create Workflow dialog box, configure the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Spark SQL node.
    1. Move the pointer over the Create icon and choose Create Node > EMR > EMR Spark SQL.
      Alternatively, you can find the desired workflow, right-click the workflow, and then choose Create Node > EMR > EMR Spark SQL.
    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Spark SQL node appears.
  4. Use the EMR Spark SQL node to develop data.
    1. Select a resource group.
      In the top toolbar, click the Run with Parameters icon. In the Parameters dialog box, select the created resource group for scheduling.
      Note
      • If you want to access a data source over the Internet or a VPC, you must use the resource group for scheduling that is connected to the data source. For more information, see Establish a network connection between a resource group and a data source.
      • If you want to change the resource group in subsequent operations, you can change the resource group in the Parameters dialog box.
    2. Write code for the EMR Spark SQL node.
      In the SQL editor, enter the code for the EMR Spark SQL node. Example:
      show tables; 
      CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
      ip STRING COMMENT'IP address',
      uid STRING COMMENT'User ID'
      )PARTITIONED BY(
      dt STRING
      ); -- You can assign a specific scheduling parameter to the var variable. 
      Note
      • The size of the SQL statement for the node cannot exceed 130 KB.
      • If you use an EMR Spark SQL node to query data, a maximum of 10,000 data records can be returned and the total size of the returned data records cannot exceed 10 MB.
      • If multiple EMR compute engine instances are associated with the current workspace, you must select one of the compute engine instances based on your business requirements. If only one EMR compute engine instance is associated with the current workspace, you do not need to select a compute engine instance.
      If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?.
      Note For more information about how to use scheduling parameters, see Overview of scheduling parameters.
    3. Save and execute the SQL statements.
      In the top toolbar, click the Save icon to save the SQL statements. Then, click the Run icon to execute the SQL statements.
  5. Configure the parameters on the Advanced Settings tab.
    The following table describes the advanced parameters that are configured for different types of EMR clusters.
    Cluster type Advanced parameter
    Data lake cluster
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. Valid values:
      • true: indicates that multiple SQL statements are executed at a time.
      • false: indicates that only one SQL statement is executed at a time.
      Note This parameter is available only for testing in the data development environment of a DataWorks workspace.
    • "ENABLE_SPARKSQL_JDBC": specifies whether SQL statements are committed by using JDBC. Valid values:
      • true: indicates that the SQL statements are committed by using JDBC. If the EMR cluster does not have the Kyuubi service, the SQL statements are committed to Spark Thrift Server. If the EMR cluster has the Kyuubi service, the SQL statements are committed to Kyuubi by using JDBC. If the parameter is set to true, the data lineage feature is supported.
      • false: indicates that the SQL statements are committed by using the spark-submit command. In this case, the data lineage feature is not supported.
        Note
        • If you run the spark-submit command to commit the SQL statements, temporary files and directories are generated in the /tmp directory of the HDFS service of the EMR cluster by default. Make sure that you have the read and write permissions on the directory.
        • If you run the spark-submit command to commit the SQL statements, you can add a SparkConf parameter on the Advanced Settings tab. After you commit the code, DataWorks adds the custom parameter to the command. Example: "spark.driver.memory" : "2g".
      Default value: false.
    Hadoop cluster
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "vcores": the number of vCPUs. Default value: 1.
    • "memory": the memory that is allocated to the launcher. Unit: MB. Default value: 2048.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:
      • true: Multiple SQL statements are executed at a time.
      • false: Only one SQL statement is executed at a time.
    • USE_GATEWAY: specifies whether a gateway cluster is used to commit jobs on the current node. Valid values:
      • true: A gateway cluster is used to commit jobs.
      • false: No gateway cluster is used to commit jobs. Jobs are automatically committed to the master node.
      Note If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to true, jobs may fail to be committed.
  6. Configure scheduling properties for the EMR Presto node.
    If you want the system to periodically run the EMR Presto node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
  7. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, configure the Change description parameter.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.
  8. View the MySQL node.
    1. Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
    2. View the scheduled MySQL node. For more information, see View and manage auto triggered nodes.