All Products
Search
Document Center

DataWorks:EMR Kyuubi node

Last Updated:Feb 08, 2026

Apache Kyuubi is a distributed, multitenancy gateway that provides query services, such as SQL, for data lake query engines such as Spark, Flink, or Trino. The EMR Kyuubi node in DataWorks lets you develop and periodically schedule Kyuubi tasks and integrate them with other jobs. This topic describes how to configure and use an EMR Kyuubi node for data development.

Prerequisites

  • You have created an Alibaba Cloud EMR cluster and bound it to DataWorks. For more information, see Data Studio: Associate an EMR computing resource.

  • (Optional) If you are a Resource Access Management (RAM) user, ensure that you have been added to the workspace for task development and have been assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about adding members, see Add members to a workspace.

    If you use an Alibaba Cloud account, you can skip this step.

Limitations

You can run this type of task only on a Serverless resource group (recommended) or an exclusive resource group for scheduling.

Procedure

  1. On the editing page for the EMR Kyuubi node, develop the task.

    Develop SQL code

    In the SQL editing area, develop the task code. You can define variables in the code using the ${variable_name} format. On the right side of the node editing page, you can assign a value to the variable in the Scheduling Parameters field of the Scheduling section. This lets you dynamically pass parameters to the code in scheduling scenarios. For more information about the supported formats for scheduling parameters, see Supported formats for scheduling parameters. The following is an example.

    SHOW TABLES;
    SELECT * FROM kyuubi040702 WHERE age >= '${a}'; --This can be used with scheduling parameters.
    Note

    The maximum size of an SQL statement is 130 KB.

    (Optional) Configure advanced parameters

    In the Scheduling section on the right side of the node, you can configure the unique attribute parameters in the table below on the EMR Node Parameters > DataWorks Parameters tabs.

    Note

    You can configure additional open-source Spark properties under EMR Node Parameters > Spark Parameters in the Scheduling panel on the right.

    Advanced parameters

    Configuration description

    queue

    The scheduling queue to which the job is submitted. The default is the default queue.

    Note

    If an EMR cluster is registered to a DataWorks workspace and a workspace-level YARN Resource Queue is configured, the following rules apply for selecting a scheduling queue when a Kyuubi task runs:

    • If Prioritize Global Configuration is set to Yes, the scheduling queue configured during EMR cluster registration is used.

    • If Prioritize Global Configuration is not configured, the scheduling queue configured for the EMR Kyuubi node is used.

    For more information about EMR YARN, see Basic queue configuration. For more information about queue configuration when you register an EMR cluster, see Set a global YARN resource queue.

    priority

    Priority. The default is 1.

    FLOW_SKIP_SQL_ANALYZE

    Specifies how SQL statements are executed. The valid values are as follows:

    • true: Executes multiple SQL statements at a time.

    • false (default): Executes one SQL statement at a time.

    Note

    This parameter is supported only for test run flows in the Data Development environment.

    DATAWORKS_SESSION_DISABLE

    This parameter is for running tests in a development environment. The valid values are as follows:

    • true: A new Java Database Connectivity (JDBC) connection is created each time you run an SQL statement.

    • false (default): The same JDBC connection is reused when you run different SQL statements on the same node.

    Note

    If this parameter is set to false, the Hive yarn applicationId is not printed. To print the yarn applicationId, set this parameter to true.

    Execute the SQL task

    1. In the Run Configuration section of Debug Configuration, configure the following settings: Computing Resource and DataWorks Resource Group.

      Note
      • You can also schedule CUs based on the resource requirements for task execution. The default CU is 0.25.

      • If you want to access a data source over the internet or through a VPC, you must use a scheduling resource group that is connected to the data source. For more information, see Network Connectivity Solutions.

    2. In the toolbar's parameter dialog box, select your data source and click Run to execute the SQL Job.

  2. To run a task on the node periodically, configure its scheduling properties. For more information, see Node Scheduling.

  3. After you configure the node, deploy it. For more information, see Node/workflow deployment.

  4. After the node is deployed, you can view its status in Operation Center. For more information, see Getting started with Operation Center.

FAQ

  • Q: The node fails to run and a connection timeout error is reported. What should I do?

    A: Ensure network connectivity between the resource group and the cluster. Go to the computing resource list page to initialize the resource. In the dialog box that appears, click Re-initialize and verify that the initialization is successful.

    image

    image