Apache Kyuubi is a distributed, multi-tenancy gateway that provides query services, such as SQL, for data lake query engines such as Spark, Flink, or Trino. The EMR Kyuubi node in DataWorks lets you develop Kyuubi tasks, schedule them periodically, and integrate them with other jobs. This topic describes how to configure and use an EMR Kyuubi node for data development.
Prerequisites
Before you begin, ensure that you have:
-
An Alibaba Cloud EMR cluster associated with DataWorks. For details, see Data Studio: Associate an EMR computing resource.
-
(Optional) If you are a Resource Access Management (RAM) user, confirm that you have been added to the workspace and assigned the Developer or Workspace Administrator role.
The Workspace Administrator role has extensive permissions. Grant it with caution. For instructions, see Add members to a workspace. Alibaba Cloud account users can skip this step.
Limitations
EMR Kyuubi tasks can only run on a Serverless resource group (recommended) or an exclusive resource group for scheduling. Other resource group types are not supported.
Develop and run an EMR Kyuubi node
Step 1: Write SQL code
In the SQL editing area on the node editing page, write your task code.
To pass dynamic values at scheduling time, define variables in the ${variable_name} format, then assign values to them in the Scheduling Parameters field under the Scheduling section on the right panel. For supported parameter formats, including built-in scheduling variables such as $bizdate and $cyctime, see Supported formats for scheduling parameters.
SHOW TABLES;
SELECT * FROM kyuubi040702 WHERE age >= '${a}'; -- Use scheduling parameters to pass dynamic values.
By default, each SQL statement in the node runs one at a time. To run multiple statements in a single batch instead, set FLOW_SKIP_SQL_ANALYZE to true in the advanced parameters (see Step 2).
Step 2: Configure advanced parameters (optional)
Under Scheduling > EMR Node Parameters > DataWorks Parameters on the right panel, configure the following parameters as needed.
| Parameter | Default | Description |
|---|---|---|
queue |
default |
The YARN resource queue to which the job is submitted. |
priority |
1 |
The job priority. |
FLOW_SKIP_SQL_ANALYZE |
false |
Controls how multiple SQL statements are executed. Set to true to run all statements in a single batch. Set to false to run them one at a time. Applies to test runs in the Data Development environment only. |
DATAWORKS_SESSION_DISABLE |
false |
Controls Java Database Connectivity (JDBC) connection behavior during development testing. Set to true to create a new JDBC connection for each SQL statement (also prints the Hive yarn applicationId). Set to false to reuse the same connection across statements on the same node. |
Queue selection rules
If a workspace-level YARN Resource Queue is configured when the EMR cluster is registered:
-
If Prioritize Global Configuration is set to Yes, the scheduling queue from EMR cluster registration is used.
-
If Prioritize Global Configuration is not configured, the queue configured on the EMR Kyuubi node is used.
For more information, see Basic queue configuration and Set a global YARN resource queue.
Step 3: Run the task
-
In the Debug Configuration > Run Configuration section, set the Computing Resource and DataWorks Resource Group:
-
The default compute unit (CU) is
0.25. Adjust it based on your task's resource requirements. -
If you need to access a data source over the internet or through a VPC, select a scheduling resource group that is connected to that data source. For details, see Network connectivity solutions.
-
-
In the toolbar's parameter dialog box, select your data source and click Run.
Step 4: Configure scheduling
To run the node on a recurring schedule, configure its scheduling properties. For details, see Node scheduling.
Step 5: Deploy the node
After configuration is complete, deploy the node. For details, see Node/workflow deployment.
Step 6: Monitor in Operation Center
After deployment, track the node's run status and history in Operation Center. For details, see Getting started with Operation Center.
FAQ
The node fails with a connection timeout error. What should I do?
This error typically means the resource group cannot reach the EMR cluster. To resolve it, go to the computing resource list page and initialize the resource. In the dialog box that appears, click Re-initialize and verify that initialization completes successfully before retrying the task.
