Apache Kyuubi is a distributed, multi-tenant gateway that lets you run SQL queries against data lake engines—Spark, Flink, and Trino. EMR Kyuubi nodes in DataWorks bring Kyuubi into DataStudio so you can author, test, and schedule Kyuubi tasks alongside other node types in a unified workflow.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud E-MapReduce (EMR) cluster created and registered to DataWorks. For more information, see DataStudio (old version): Associate an EMR computing resource.
A serverless resource group purchased and configured with workspace association and network settings. EMR Kyuubi nodes can only run on a serverless resource group or an exclusive resource group for scheduling; a serverless resource group is the recommended option. For more information, see Create and use a serverless resource group.
A workflow created in DataStudio. All node development in DataStudio is organized around workflows, so you must create one before adding a node. For more information, see Create a workflow.
(Required if using a RAM user) The RAM user added to the workspace as a member with the Develop or Workspace Administrator role assigned. The Workspace Administrator role grants broader permissions than most tasks require—assign it only when necessary. For more information, see Add workspace members and assign roles to them.
Step 1: Create an EMR Kyuubi node
Go to the DataStudio page. Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the workspace from the drop-down list and click Go to Data Development.
Create an EMR Kyuubi node.
Find the target workflow, right-click the workflow name, and choose Create Node > EMR > EMR Kyuubi.
NoteAlternatively, hover over the Create icon and choose Create Node > EMR > EMR Kyuubi.
In the Create Node dialog box, configure Name, Engine Instance, Node Type, and Path, then click Confirm. The configuration tab of the EMR Kyuubi node opens.
NoteThe node name can contain only letters, digits, underscores (
_), and periods (.).
Step 2: Develop an EMR Kyuubi task
Write SQL code
In the SQL editor, write the node code. Define variables using ${Variable} syntax, then bind scheduling parameter values to those variables in the Scheduling Parameter section of the Properties tab. When the node runs on schedule, DataWorks substitutes the scheduling parameter values into the code automatically. For details on supported formats, see Supported formats of scheduling parameters.
Sample code:
show tables;
select * from kyuubi040702 where age >= '${a}'; -- Assign a scheduling parameter to variable a.The SQL code for a single task cannot exceed 130 KB.
If multiple EMR computing resources are associated with your workspace, select one from the data source selector. If only one is associated, no selection is required.
Configure advanced parameters (optional)
On the Advanced Settings tab, set any of the following parameters. For the full list of supported Spark properties, see Spark Configuration.
| Parameter | Default | Description |
|---|---|---|
queue | default | The YARN scheduling queue for submitted jobs. If a workspace-level YARN queue is configured when you register the EMR cluster, the following precedence rules apply: if Whether global configuration takes precedence is set to Yes, the queue from cluster registration is used; otherwise, the queue set on the EMR Kyuubi node is used. See YARN schedulers and Configure a global YARN queue. |
priority | 1 | The scheduling priority for the job. |
FLOW_SKIP_SQL_ANALYZE [Dev only] | false | Controls how SQL statements execute. Set to true to run multiple SQL statements simultaneously; set to false to run one statement at a time. |
DATAWORKS_SESSION_DISABLE [Dev only] | false | Controls JDBC connection reuse. Set to true to open a new JDBC connection for each SQL statement. Set to false to reuse the same connection across statements in the same node. When set to false, the yarn applicationId for the EMR Hive node is not displayed in the logs; set to true if you need it. |
Run the task
In the toolbar, click the
icon. In the Parameters dialog box, select a resource group from the Resource Group Name drop-down list and click Run.NoteTo access a computing resource over the Internet or a virtual private cloud (VPC), use the resource group for scheduling connected to that resource. For more information, see Network connectivity solutions.
To change the resource group later, click the
(Run with Parameters) icon again and select a different group in the Parameters dialog box.
Click the
icon in the toolbar to save the SQL code.(Optional) Run smoke testing on the node in the development environment before or after committing. For more information, see Perform smoke testing.
Step 3: Configure scheduling properties
To schedule the task to run periodically, click Properties in the right-side navigation pane and configure the scheduling settings.
Configure the Rerun and Parent Nodes parameters before committing the task.
For a full reference on scheduling options, see Overview.
Step 4: Deploy the task
After configuring the node, commit and deploy the task so DataWorks runs it on the defined schedule.
Click the
icon in the toolbar to save the task.Click the
icon in the toolbar to commit the task. In the Submit dialog box, fill in the Change description field and choose whether to require a code review.NoteConfigure Rerun and Parent Nodes on the Properties tab before committing.
When code review is enabled, the committed code can be deployed only after it passes review. For more information, see Code review.
(Standard mode workspaces only) Click Deploy in the upper-right corner of the node configuration tab to deploy the task to the production environment. For more information, see Deploy nodes.
What's next
After the task is committed and deployed, it runs automatically on the configured schedule. To monitor execution status, click Operation Center in the upper-right corner of the node configuration tab. For more information, see View and manage auto triggered tasks.