A CDH Impala node lets you write and run Impala SQL scripts in DataWorks. It offers faster query performance than CDH Hive. Use this guide to configure and run a CDH Impala node end to end.
Prerequisites
Before you begin, ensure that you have:
-
An Alibaba Cloud CDH cluster bound to a DataWorks workspace. For details, see Data Studio: Associate a CDH computing resource.
ImportantThe Impala component must be installed on your CDH cluster, and its connection information must be configured when you bind the cluster.
-
(Optional, RAM users only) The RAM user added to the workspace with the Developer or Workspace Administrator role. Grant the Workspace Administrator role with caution — it carries extensive permissions. For details, see Add members to a workspace. Root account users can skip this step.
-
A Hive data source configured in DataWorks with a successful connectivity test. For details, see Data Source Management.
Create a node
For instructions, see Create a node.
Develop a node
Write your task code in the SQL editor. To pass dynamic values at runtime, define variables in your SQL using the ${VariableName} format. Then assign values to each variable in Scheduling Configuration > Scheduling Parameters on the right side of the node editor. DataWorks substitutes those values when the node runs. For more information, see Sources and expressions of scheduling parameters.
Example:
SHOW TABLES;
SELECT * FROM userinfo;
-- You can use this with Scheduling Parameters.
SELECT '${var}';
Debug a node
-
In Run Configuration > Compute Resource, configure the following:
Field What to select Compute Resource The CDH cluster you registered in DataWorks. Resource Group A Scheduling Resource Group with a successful connection to your data source. For details, see Network connectivity solutions. -
On the toolbar at the top of the node editor, click Run.
What's next
-
Node scheduling configuration: To run the node automatically on a recurring schedule, configure Time Property and related scheduling properties in the Scheduling configuration panel on the right side of the page.
-
Publish a node: To move the node to the production environment, click the
icon. Only nodes published to the production environment are scheduled. -
Getting started with Operation Center: After publishing, monitor scheduled runs in the O&M Center.