All Products
Search
Document Center

DataWorks:EMR Spark SQL node

Last Updated:Feb 05, 2026

You can create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine, improving job execution efficiency.

Prerequisites

  • To customize the component environment for a node, create a custom image based on the official dataworks_emr_base_task_pod image. For more information, see Custom images and use the image in Data Development.

    For example, you can replace a Spark JAR package or add dependencies on specific libraries, files, or JAR packages when you create a custom image.

  • You have created an Alibaba Cloud E-MapReduce (EMR) Cluster and registered it with DataWorks. For more information, see Data Studio: Associate an EMR computing resource.

  • (Optional, required for RAM users) Add the Resource Access Management (RAM) user responsible for task development to the Workspace and assign them the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information about adding members, see Add members to a workspace.

    If you are using an Alibaba Cloud account, you can skip this step.
  • If your task requires a specific environment, you can use the DataWorks custom image feature to build an image with the necessary components. For more information, see Custom images.

Limitations

  • This task type runs only on a Serverless resource group (recommended) or an exclusive resource group for scheduling. If you use an image in Data Development, you must use a Serverless resource group.

  • To manage metadata for DataLake or custom clusters in DataWorks, you must configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Spark SQL.

    Note

    If EMR-HOOK is not configured on the cluster, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related data governance tasks in DataWorks.

  • EMR on ACK Spark clusters do not support viewing data lineage, while EMR Serverless Spark clusters do.

  • DataLake and custom clusters support visual function registration, but EMR on ACK and EMR Serverless Spark clusters do not.

Notes

If you enabled Ranger access control for Spark in the EMR cluster that is bound to the current workspace:

  • This feature is available by default when you run Spark tasks that use the default image.

  • To run Spark tasks that use a custom image, you must submit a ticket to upgrade the image to support this feature.

Procedure

  1. In the EMR Spark SQL node editor, complete the following steps.

    Develop SQL code

    Write your task code in the SQL editor. You can define variables by using the ${variable_name} format in your code. Assign a value to the variable in the Schedule section under Parameters on the right side of the node editor page to enable dynamic parameter passing for scheduled jobs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following is an example.

    SHOW TABLES; 
    -- Defines a variable named var by using ${var}. If you assign ${yyyymmdd} to this variable, you can create a table with the business date as a suffix.
    CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
    ip STRING COMMENT 'IP address',
    uid STRING COMMENT 'User ID'
    )PARTITIONED BY(
    dt STRING
    ); -- Can be used with scheduling parameters.

    (Optional) Configure advanced parameters

    You can configure the advanced parameters listed in the following tables in the EMR Node Parameters or DataWorks Parameters section under on the right side of the page.

    Note
    • The available advanced parameters vary based on the EMR cluster type, as shown in the following tables.

    • Additional open-source Spark properties can be configured in the EMR Node ParametersSpark Parameters on the right side of the page.

    Datalake and custom clusters: EMR on ECS

    Parameter

    Description

    queue

    The scheduling queue for job submission. The default queue is default. For more information about EMR YARN, see Basic queue configuration.

    priority

    The job priority. The default value is 1.

    FLOW_SKIP_SQL_ANALYZE

    Specifies how SQL statements are executed. Valid values:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is applicable only to test runs in the Data Development environment.

    ENABLE_SPARKSQL_JDBC

    Specifies the method for submitting SQL code. Valid values:

    • true: Submits SQL code by using JDBC (Java Database Connectivity). If the EMR cluster does not have a Kyuubi service, the SQL code is submitted to the Spark Thrift-Server. If the EMR cluster has a Kyuubi service, the SQL code is submitted to Kyuubi by using JDBC. This mode supports custom Spark parameters.

      Both methods support metadata and data lineage. However, when submitting jobs to the Thrift-Server, the output information of the corresponding node is not included in the metadata.

    • false (default): Submits SQL code by using the spark-submit cluster mode. In this mode, both Spark 2 and Spark 3 support metadata, data lineage, and output information. You can also customize Spark parameters.

      Note
      • By default, the spark-submit cluster mode creates temporary files and directories in the /tmp directory of the EMR cluster HDFS. You must have read and write permissions for this directory.

      • When you use the `spark-submit cluster` mode, you can add custom SparkConf parameters in the advanced configuration section. DataWorks automatically adds these parameters to the command when submitting the code. For example, "spark.driver.memory" : "2g".

    DATAWORKS_SESSION_DISABLE

    Specifies the JDBC connection behavior for test runs in the Data Development environment. Valid values:

    • true: Creates a new JDBC connection for each SQL statement execution.

    • false (default): Reuses the same JDBC connection when you run different SQL statements within a single node.

    Note

    If this parameter is set to false, the Hive yarn applicationId is not printed in the logs. To print the yarn applicationId, set this parameter to true.

    Others

    Custom Spark Configuration parameters. You can add Spark-specific properties.

    Use the following format: spark.eventLog.enabled : false . DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the --conf key=value format. For more information about parameter configuration, see Configure global Spark parameters.

    Note
    • DataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters.

    • To enable Ranger permission control, add the spark.hadoop.fs.oss.authorization.method=ranger configuration in the Spark Parameters section to ensure that permission control takes effect.

    EMR Serverless Spark

    For information about how to set the relevant parameters, see Set parameters for submitting a Spark job.

    Parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    Specifies how SQL statements are executed. Valid values:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is applicable only to test runs in the Data Development environment.

    DATAWORKS_SESSION_DISABLE

    Specifies the job submission method. When you run a job in Data Development, the job is submitted to SQL Compute by default. You can use this parameter to specify whether to submit the job to SQL Compute or a resource queue.

    • true: The job is submitted to a resource queue. By default, it uses the default queue specified when the compute resources were bound. If you set DATAWORKS_SESSION_DISABLE to true, you can also configure the SERVERLESS_QUEUE_NAME parameter to specify the queue for job submission during development and execution.

    • false (default): The job is submitted to SQL Compute.

      Note

      This parameter takes effect only when you run jobs in Data Development. It does not take effect for scheduled runs.

    SERVERLESS_RELEASE_VERSION

    Specifies the Spark engine version. By default, the job uses the default engine version configured for the cluster in the Compute Engines section of the Management Center. Use this parameter to specify a different engine version for a specific job.

    Note

    The SERVERLESS_RELEASE_VERSION parameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.

    SERVERLESS_QUEUE_NAME

    Specifies the resource queue for job submission. By default, jobs are sent to the default resource queue configured for the cluster in the Cluster Management section of the Management Center. If you have resource isolation and management requirements, you can add queues and use this parameter to select a different queue. For more information, see Manage resource queues.

    Configuration methods:

    Note
    • The SERVERLESS_QUEUE_NAME parameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.

    • When running a job in Data Development: You must set DATAWORKS_SESSION_DISABLE to true to submit the job to a queue. The SERVERLESS_QUEUE_NAME parameter takes effect only in this scenario.

    • When running a scheduled job from the Operation Center: The job is always submitted to a queue and cannot be submitted to SQL Compute.

    SERVERLESS_SQL_COMPUTE

    Specifies the SQL Compute (SQL session). By default, the default SQL Compute instance configured for the cluster in the Compute Engines section of the Management Center is used. If you need to set different SQL sessions for different jobs, you can configure this parameter. For more information about how to create and manage SQL sessions, see Manage SQL sessions.

    Others

    Custom Spark Configuration parameters. You can add Spark-specific properties.

    Use the following format: spark.eventLog.enabled : false . DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the --conf key=value format.

    Note

    DataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters. For more information about how to set global Spark parameters, see Configure global Spark parameters.

    EMR on ACK Spark

    Parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    Specifies how SQL statements are executed. Valid values:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is applicable only to test runs in the Data Development environment.

    Others

    Custom Spark Configuration parameters. You can add Spark-specific properties.

    Use the following format: spark.eventLog.enabled : false . DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the --conf key=value format.

    Note

    DataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters. For more information about how to set global Spark parameters, see Configure global Spark parameters.

    EMR on ECS Hadoop

    Parameter

    Description

    queue

    The scheduling queue for job submission. The default queue is default. For more information about EMR YARN, see Basic queue configuration.

    priority

    The job priority. The default value is 1.

    FLOW_SKIP_SQL_ANALYZE

    Specifies how SQL statements are executed. Valid values:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is applicable only to test runs in the Data Development environment.

    USE_GATEWAY

    Specifies whether to submit jobs for this node through a gateway cluster. Valid values:

    • true: Submits jobs through a gateway cluster.

    • false (default): Submits jobs to the header node instead of a gateway cluster.

    Note

    If you set this parameter to true but the cluster is not associated with a gateway cluster, the EMR job submission fails.

    Others

    Custom Spark Configuration parameters. You can add Spark-specific properties.

    Use the following format: spark.eventLog.enabled : false . DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the --conf key=value format. For more information about parameter configuration, see Configure global Spark parameters.

    Note
    • DataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters.

    • To enable Ranger permission control, add the spark.hadoop.fs.oss.authorization.method=ranger configuration in the Spark Parameters section to ensure that permission control takes effect.

    Run SQL job

    1. Configure the Run ConfigurationCompute Engines and resource groups.

      Note
      • You can also configure the scheduling CUs based on the job's resource requirements. The default value is 0.25.

      • To access a data source in a public network or a VPC, you must use a scheduling resource group that has established connectivity with the data source. For more information, see Network connectivity solutions.

    2. In the parameter dialog box on the toolbar, select the appropriate data source and click Run to execute the SQL job.

  2. To run the task on a recurring basis, configure its schedule based on your business requirements. For more information about the configuration, see Node scheduling configuration.

    Note

    To customize the component environment, create a custom dataworks_emr_base_task_pod based on the official image and Custom images, and use the image in Data Development.

    For example, you can replace a Spark JAR package or add dependencies on specific libraries, files, or jar packages when you create the custom image.

  3. After configuring the task, you must deploy the node. For more information, see Node and workflow deployment.

  4. After the task is deployed, you can view its status in Operation Center. For more information, see Getting started with Operation Center.

FAQ

  • Spark-submit failure with Kerberos

  • When running an EMR Spark SQL task in Data Development that submits jobs to SQL Compute, ensure the SQL Compute instance is in the Running state. Otherwise, the job will fail. To view the status of an SQL Compute instance, see Manage SQL sessions.