If you have a Cloudera Distribution for Hadoop (CDH) cluster, you can use CDH Hive nodes in DataWorks to run Hive tasks, such as data query jobs and batch data processing. This topic describes how to configure and use CDH Hive nodes.
Prerequisites
An Alibaba Cloud CDH cluster is created and attached to a DataWorks workspace. For more information, see Data Development (New): Attach a CDH computing resource.
ImportantThe Hive component must be installed on the CDH cluster, and the Hive connection information must be configured when you attach the cluster.
(Optional) If you use a RAM user, the user must be added to the corresponding workspace for task development and granted the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be granted with caution. For more information about adding members, see Add members to a workspace.
NoteIf you use an Alibaba Cloud account, you can skip this step.
A Hive data source is configured in DataWorks and passes the connectivity test. For more information, see Data Source Management.
Limits
You can run this type of task on Serverless resource groups (recommended) or legacy exclusive resource groups.
Create a node
For more information, see Create a node.
Develop the node
In the SQL editing area, you can develop the code for a node. In your code, use the ${variable_name} format to define a variable. Then, on the right side of the node editing page, you can assign a value to the variable in the Scheduling Configurations section under Scheduling Parameters. This lets you dynamically pass parameters to the code in scheduling scenarios. For more information, see Supported formats for scheduling parameters. The following is an example.
SHOW TABLES;
SELECT * FROM userinfo ;
-- You can use this with scheduling parameters.
SELECT '${var}'; Test the node
In the Computing Resources section of Debug Configuration, you can configure the Computing Resource and Resource Group.
Set Computing Resource to the name of the CDH cluster that you registered in DataWorks.
Set Resource Group to the scheduling resource group that passed the connectivity test with the data source. For more information, see Network connectivity solutions.
Click Run Job on the toolbar at the top of the node editing page.
What to do next
Node scheduling: To execute a node in a project folder periodically, you need to set a Scheduling Policy in the Scheduling Configuration section on the right of the node and configure the scheduling properties.
Publish a node: If the node needs to run in the production environment, click the
icon to publish it. A node in the project folder runs on a schedule only after it is published to the production environment.Node O&M: After the node is published, you can view the status of the auto triggered task in Operation Center. For more information, see Get started with Operation Center.