All Products
Search
Document Center

DataWorks:EMR Kyuubi node

Last Updated:Mar 26, 2026

Apache Kyuubi is a distributed, multi-tenancy gateway that provides query services, such as SQL, for data lake query engines such as Spark, Flink, or Trino. The EMR Kyuubi node in DataWorks lets you develop Kyuubi tasks, schedule them periodically, and integrate them with other jobs. This topic describes how to configure and use an EMR Kyuubi node for data development.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud EMR cluster associated with DataWorks. For details, see Data Studio: Associate an EMR computing resource.

  • (Optional) If you are a Resource Access Management (RAM) user, confirm that you have been added to the workspace and assigned the Developer or Workspace Administrator role.

    The Workspace Administrator role has extensive permissions. Grant it with caution. For instructions, see Add members to a workspace. Alibaba Cloud account users can skip this step.

Limitations

EMR Kyuubi tasks can only run on a Serverless resource group (recommended) or an exclusive resource group for scheduling. Other resource group types are not supported.

Develop and run an EMR Kyuubi node

Step 1: Write SQL code

In the SQL editing area on the node editing page, write your task code.

To pass dynamic values at scheduling time, define variables in the ${variable_name} format, then assign values to them in the Scheduling Parameters field under the Scheduling section on the right panel. For supported parameter formats, including built-in scheduling variables such as $bizdate and $cyctime, see Supported formats for scheduling parameters.

SHOW TABLES;
SELECT * FROM kyuubi040702 WHERE age >= '${a}'; -- Use scheduling parameters to pass dynamic values.
Note The maximum size of a single SQL statement is 130 KB.

By default, each SQL statement in the node runs one at a time. To run multiple statements in a single batch instead, set FLOW_SKIP_SQL_ANALYZE to true in the advanced parameters (see Step 2).

Step 2: Configure advanced parameters (optional)

Under Scheduling > EMR Node Parameters > DataWorks Parameters on the right panel, configure the following parameters as needed.

Note To configure open-source Spark properties, go to EMR Node Parameters > Spark Parameters in the same panel. For available properties, see Apache Spark configuration.
Parameter Default Description
queue default The YARN resource queue to which the job is submitted.
priority 1 The job priority.
FLOW_SKIP_SQL_ANALYZE false Controls how multiple SQL statements are executed. Set to true to run all statements in a single batch. Set to false to run them one at a time. Applies to test runs in the Data Development environment only.
DATAWORKS_SESSION_DISABLE false Controls Java Database Connectivity (JDBC) connection behavior during development testing. Set to true to create a new JDBC connection for each SQL statement (also prints the Hive yarn applicationId). Set to false to reuse the same connection across statements on the same node.

Queue selection rules

If a workspace-level YARN Resource Queue is configured when the EMR cluster is registered:

  • If Prioritize Global Configuration is set to Yes, the scheduling queue from EMR cluster registration is used.

  • If Prioritize Global Configuration is not configured, the queue configured on the EMR Kyuubi node is used.

For more information, see Basic queue configuration and Set a global YARN resource queue.

Step 3: Run the task

  1. In the Debug Configuration > Run Configuration section, set the Computing Resource and DataWorks Resource Group:

    • The default compute unit (CU) is 0.25. Adjust it based on your task's resource requirements.

    • If you need to access a data source over the internet or through a VPC, select a scheduling resource group that is connected to that data source. For details, see Network connectivity solutions.

  2. In the toolbar's parameter dialog box, select your data source and click Run.

Step 4: Configure scheduling

To run the node on a recurring schedule, configure its scheduling properties. For details, see Node scheduling.

Step 5: Deploy the node

After configuration is complete, deploy the node. For details, see Node/workflow deployment.

Step 6: Monitor in Operation Center

After deployment, track the node's run status and history in Operation Center. For details, see Getting started with Operation Center.

FAQ

The node fails with a connection timeout error. What should I do?

This error typically means the resource group cannot reach the EMR cluster. To resolve it, go to the computing resource list page and initialize the resource. In the dialog box that appears, click Re-initialize and verify that initialization completes successfully before retrying the task.

imageimage