All Products
Search
Document Center

DataWorks:Create an EMR Spark SQL node

最終更新日:Nov 29, 2023

This topic describes how to create an E-MapReduce (EMR) Spark SQL node. EMR Spark SQL nodes allow you to use the distributed SQL query engine to process structured data. This helps improve the efficiency of jobs.

Prerequisites

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks.

    Before you create an EMR node and use the EMR node to develop EMR tasks in DataWorks, you must register a desired EMR cluster to a DataWorks workspace. For more information, see Register an EMR cluster to DataWorks.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Development or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.

  • An exclusive resource group for scheduling is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use an exclusive resource group for scheduling.

Limits

  • This type of node can be run only on an exclusive resource group for scheduling.

  • If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. For more information about how to configure EMR-HOOK, see Use the Spark SQL extension feature to record data lineage and historical access information.

Procedure

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create an EMR Spark SQL node.

    1. Move the pointer over the image.png icon and choose Create Node > EMR > EMR Spark SQL.

      Alternatively, you can find the desired workflow, right-click the name of the workflow, and then choose Create Node > EMR > EMR Spark SQL.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.

    3. Click Confirm. The configuration tab of the EMR Spark SQL node appears.

  3. Use the EMR Spark SQL node to develop data.

    1. Select a resource group.

      In the top toolbar, click the Run with Parameters icon. In the Parameters dialog box, select the created resource group for scheduling.

      Note
      • If you want to access a data source over the Internet or a VPC, you must use the resource group for scheduling that is connected to the data source. For more information, see Establish a network connection between a resource group and a data source.

      • If you want to change the resource group in subsequent operations, you can also click the Run with Parameters icon to change the resource group in the Parameters dialog box.

    2. Write code for the node.

      In the SQL editor, enter the code for the node. Example:

      show tables; 
      CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
      ip STRING COMMENT'IP address',
      uid STRING COMMENT'User ID'
      )PARTITIONED BY(
      dt STRING
      ); -- You can assign a specific scheduling parameter to the var variable.

      Note
      • The size of an SQL statement for the node cannot exceed 130 KB.

      • If you use an EMR Spark SQL node to query data, a maximum of 10,000 data records can be returned and the total size of the returned data records cannot exceed 10 MB.

      • If multiple EMR compute engines are associated with your workspace, you must select one from the compute engines based on your business requirements. If only one EMR compute engine is associated with your workspace, you do not need to select a compute engine.

      If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the value assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?

      Note

      For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

    3. Save and execute the SQL statements.

      In the top toolbar, click the Save icon to save the SQL statements. Then, click the Run icon to execute the SQL statements.

  4. Configure the parameters on the Advanced Settings tab.

    The following table describes the advanced parameters that are configured for different types of EMR clusters. For more information about how to configure the parameters of the node, see Spark Configuration.

    Cluster type

    Advanced parameter

    DataLake cluster or custom cluster (EMR on ECS)

    • "queue": the scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    • "priority": the priority. Default value: 1.

    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:

      • true: Multiple SQL statements are executed at a time.

      • false: Only one SQL statement is executed at a time.

      Note

      This parameter is available only for testing in the development environment of a DataWorks workspace.

    • "ENABLE_SPARKSQL_JDBC": specifies whether SQL statements are committed by using Java Database Connectivity (JDBC). Valid values:

      • true: The SQL statements are committed by using JDBC. If the EMR cluster does not have the Kyuubi service, the SQL statements are committed to Spark Thrift Server. If the EMR cluster has the Kyuubi service, the SQL statements are committed to Kyuubi by using JDBC. You can configure custom Spark parameters.

        If the parameter is set to true, the metadata lineage feature is supported. However, if the SQL statements are committed to Spark Thrift Server, you cannot view the output information of the metadata of a node.

      • false: The SQL statements are committed by using the spark-submit command. In this case, both Spark 2 and Spark 3 support the metadata lineage feature and allow you to view the output information of a node. You can also configure custom Spark parameters.

        Note
        • If you run the spark-submit command to commit the SQL statements, temporary files and directories are generated in the /tmp directory of the HDFS service of the EMR cluster by default. Make sure that you have the read and write permissions on the directory.

        • If you run the spark-submit command to commit the SQL statements, you can add a SparkConf parameter on the Advanced Settings tab. After you commit the code, DataWorks adds the custom parameter to the command. Example: "spark.driver.memory" : "2g".

      Default value: false.

    Hadoop cluster (EMR on ECS)

    • "queue": the scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    • "priority": the priority. Default value: 1.

    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:

      • true: Multiple SQL statements are executed at a time.

      • false: Only one SQL statement is executed at a time.

    • USE_GATEWAY: specifies whether to use a gateway cluster to commit jobs on the current node. Valid values:

      • true: Use a gateway cluster to commit jobs.

      • false: Use no gateway cluster to commit jobs. Jobs are automatically committed to the master node.

    Note

    If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to true, jobs may fail to be committed.

    Spark cluster (EMR on ACK)

    "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false: Only one SQL statement is executed at a time.

    You can configure custom Spark parameters. However, you must use complete parameter names. Example: "spark.executor.instances":20. In addition, you can configure global Spark parameters. For more information, see Configure global Spark parameters.

  5. Configure scheduling properties

    If you want the system to periodically run the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure properties for the node based on your business requirements.

  6. Commit and deploy the node.

    1. Click the Save icon in the top toolbar to save the node.

    2. Click the Submit icon in the top toolbar to commit the node.

    3. In the Commit Node dialog box, configure the Change description parameter.

    4. Click OK.

    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.

  7. View the node.

    1. Click Operation Center in the upper-right corner of the configuration tab of the node to go to Operation Center in the production environment.

    2. View the scheduled node. For more information, see View and manage auto triggered nodes.

    To view more information about the node, click Operation Center in the top navigation bar of the DataStudio page. For more information, see Overview.