All Products
Search
Document Center

DataWorks:EMR Spark SQL node

Last Updated:Jun 17, 2026

You can create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine, improving job execution efficiency.

Prerequisites

  • Before you start node development, if you need a custom component environment, create a custom image based on the official dataworks_emr_base_task_pod image and then use it in Data Studio. For more information, see Create a custom image and Use images in data development.

    For example, you can replace Spark JAR packages or add dependencies on specific libraries, files, or JAR packages when you create a custom image.

  • Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.

  • (Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

    If you are using a root account, skip this step.
  • You can use a Custom image custom image to build a specific development environment for your task.

Limitations

  • EMR Shell nodes can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling. Using a custom image for data development requires a serverless resource group.

  • To manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Spark SQL.

    Note

    Without EMR-HOOK configured on the cluster, you cannot view real-time metadata, generate audit logs, view data lineage, or perform EMR-related governance tasks in DataWorks.

  • EMR on ACK Spark clusters do not support viewing data lineage. EMR Serverless Spark clusters support viewing data lineage.

  • Function registration through the UI is supported on DataLake and custom clusters, but not on EMR on ACK Spark or EMR Serverless Spark clusters.

Notes

If you have enabled Ranger access control for Spark in the EMR cluster associated with the current workspace:

  • Spark tasks that use the default image automatically support this feature.

  • To run a Spark task with a custom image, submit a ticket to technical support to request an image upgrade.

Procedure

  1. On the EMR Spark SQL node configuration tab, perform the following development operations.

    Develop SQL code

    In the SQL editor, write your task code. Define variables using the ${variable_name} format and assign their values in the Scheduling Settings > Scheduling Parameters section on the right side of the node configuration tab to dynamically pass parameters to scheduled tasks. For more information about how to use scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.

    SHOW TABLES; 
    -- Define a variable named var by using ${var}. If you set this variable to ${yyyymmdd}, you can create a table with the business date as a suffix.
    CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
    ip STRING COMMENT'IP address',
    uid STRING COMMENT'User ID'
    )PARTITIONED BY(
    dt STRING
    ); -- Can be used with scheduling parameters.

    (Optional) Configure EMR node parameters

    In the EMR Node Parameters section of the Run Configuration pane, configure the following parameters:

    • Spark parameter: Spark built-in property parameters. For more information, see open source Spark configuration. You can directly load a Spark configuration template from Serverless Spark without manual input, simplifying the configuration process and ensuring consistency.

    • DataWorks parameters: The advanced parameters that can be configured vary by EMR cluster type. The following table describes the details.

    DataLake cluster/Custom cluster: EMR on ECS

    Advanced parameter

    Description

    queue

    The scheduling queue for job submission. The default value is default. For more information about EMR YARN, see Basic queue configurations.

    priority

    The priority. Default value: 1.

    FLOW_SKIP_SQL_ANALYZE

    The SQL statement execution mode. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false (default): A single SQL statement is executed at a time.

    Note

    This parameter is supported only for test runs in the data development environment.

    ENABLE_SPARKSQL_JDBC

    The method used to submit SQL code. Valid values:

    • true: SQL code is submitted through JDBC. If the EMR cluster does not have the Kyuubi service, SQL code is submitted to Spark Thrift-Server. If the EMR cluster has the Kyuubi service, SQL code is submitted to Kyuubi through JDBC, and custom Spark parameters are supported.

      Both methods support metadata lineage. However, tasks submitted to Thrift-Server lack output information for the corresponding node tasks in the metadata.

    • false (default): SQL code is submitted using the Spark-submit cluster method. In this mode, both Spark2 and Spark3 support metadata lineage and output information. Custom Spark parameters are also supported.

      Note
      • The Spark-submit cluster method creates temporary files and directories in the /tmp directory on the HDFS of the EMR cluster by default. Make sure that the directory has read and write permissions.

      • When you use the Spark-submit cluster method, you can directly append custom SparkConf parameters in the advanced configurations. DataWorks automatically adds the new parameters to the command when you submit the code. Example: "spark.driver.memory" : "2g".

    DATAWORKS_SESSION_DISABLE

    This parameter is applicable to test runs in the development environment. Valid values:

    • true: A new JDBC connection is created each time an SQL statement is run.

    • false (default): The same JDBC connection is reused when different SQL statements are run within a single node.

    Note

    When this parameter is set to false, the Hive yarn applicationId is not printed. To print the yarn applicationId, set this parameter to true.

    Others

    Custom Spark Configuration parameters. Add Spark-specific property parameters.

    Configuration format: "spark.eventLog.enabled" : false, "spark.eventLog.memory" : "12g". DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format: --conf key=value. For more information about parameter configurations, see Configure global Spark parameters.

    Note
    • DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module.

    • To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger to the global Spark parameter configurations to ensure that Ranger access control takes effect.

    EMR Serverless Spark cluster

    For information about related parameter settings, see Spark task parameter settings.

    Advanced parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    The SQL statement execution mode. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false (default): A single SQL statement is executed at a time.

    Note

    This parameter is supported only for test runs in the data development environment.

    DATAWORKS_SESSION_DISABLE

    The task submission method. During data development, tasks are submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed through SQL Compute or submitted to a queue for execution.

    • true: The task is submitted to a queue for execution. By default, the default queue specified when the compute resource was associated is used. When DATAWORKS_SESSION_DISABLE is set to true, you can configure the SERVERLESS_QUEUE_NAME parameter to specify the queue to which tasks are submitted during data development.

    • false (default): The task is submitted to SQL Compute for execution.

      Note

      This parameter takes effect only during data development execution, not during scheduled runs.

    SERVERLESS_RELEASE_VERSION

    The Spark engine version. By default, the Default Engine Version configured in the cluster settings under Compute Resource in Management Center is used. To set a different engine version for a specific task, configure this parameter.

    Note

    The SERVERLESS_RELEASE_VERSION parameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.

    SERVERLESS_QUEUE_NAME

    The resource queue to which tasks are submitted. When a task is configured to be submitted to a queue for execution, the Default Resource Queue configured in the cluster settings under Clusters in Management Center is used by default. If you have resource isolation and management requirements, you can add queues. For more information, see Manage resource queues.

    Configuration methods:

    • Specify the resource queue for task submission by configuring node parameters.

    • Specify the resource queue for task submission by configuring global Spark parameters.

    Note
    • The SERVERLESS_QUEUE_NAME parameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.

    • During data development execution: You must first set DATAWORKS_SESSION_DISABLE to true so that tasks are submitted to a queue for execution. Only then does the SERVERLESS_QUEUE_NAME parameter take effect for specifying the task queue.

    • During Operation Center scheduled execution: Tasks are forcibly submitted to a queue for execution and cannot be submitted to SQL Compute.

    SERVERLESS_SQL_COMPUTE

    The SQL Compute (SQL session) to use. By default, the Default SQL Compute configured in the cluster settings under Compute Resource in Management Center is used. To set a different SQL session for a specific task, configure this parameter. For information about how to create and manage SQL sessions, see Manage SQL Compute sessions.

    Others

    Custom Spark Configuration parameters. Add Spark-specific property parameters.

    Configuration format: "spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format: --conf key=value.

    Note

    DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters.

    Spark cluster: EMR on ACK

    Advanced parameter

    Description

    FLOW_SKIP_SQL_ANALYZE

    The SQL statement execution mode. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false (default): A single SQL statement is executed at a time.

    Note

    This parameter is supported only for test runs in the data development environment.

    Others

    Custom Spark Configuration parameters. Add Spark-specific property parameters.

    Configuration format: "spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format: --conf key=value.

    Note

    DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters.

    Hadoop cluster: EMR on ECS

    Advanced parameter

    Description

    queue

    The scheduling queue for job submission. The default value is default. For more information about EMR YARN, see Basic queue configurations.

    priority

    The priority. Default value: 1.

    FLOW_SKIP_SQL_ANALYZE

    The SQL statement execution mode. Valid values:

    • true: Multiple SQL statements are executed at a time.

    • false (default): A single SQL statement is executed at a time.

    Note

    This parameter is supported only for test runs in the data development environment.

    USE_GATEWAY

    Specifies whether to submit jobs through a Gateway cluster. Valid values:

    • true: Jobs are submitted through a Gateway cluster.

    • false (default): Jobs are not submitted through a Gateway cluster. Instead, jobs are submitted to the header node by default.

    Note

    If the cluster of the current node is not associated with a Gateway cluster, manually setting this parameter to true causes subsequent EMR job submissions to fail.

    Others

    Custom Spark Configuration parameters. Add Spark-specific property parameters.

    Configuration format: "spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format: --conf key=value. For more information about parameter configurations, see Configure global Spark parameters.

    Note
    • DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module.

    • To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger to the global Spark parameter configurations to ensure that Ranger access control takes effect.

    Run an SQL task

    1. In Run Configuration, select and configure the Compute Resource and Resource Group.

      Note
      • You can also configure CUs for Scheduling based on the resources required for task execution. The default CU value is 0.25.

      • To access data sources over the Internet or in a VPC, use a resource group for scheduling that has passed the connectivity test with the data source. For more information, see Configure network connectivity.

    2. In the toolbar parameter dialog, select the corresponding data source and click Run to run the SQL task.

    (Optional) Configure assignment parameters

    To pass the query results of this node to downstream nodes, go to the Node Context Parameters section in the Scheduling Settings pane on the right side, and click Add Assignment Parameter. The system automatically adds an output parameter named outputs. The value of this parameter is the query result of the last line of code in this node. For more information, see Use node context parameters.

  2. To run the node task periodically, configure scheduling information based on your business requirements. For more information, see Configure schedule settings.

    Note

    If you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the image in Data Development.

    For example, when you create a custom image, you can replace Spark JAR packages or depend on specific libraries, files, or JAR packages.

  3. After the node task is configured, you must deploy the node. For more information, see Deploy a node.

  4. After the task is deployed, you can view the running status of scheduled tasks in Operation Center. For more information, see View scheduled tasks.

FAQ