In DataWorks, a Cloudera's Distribution Including Apache Hadoop (CDH) Presto node allows you to use a distributed SQL query engine to analyze real-time data. This further enhances the data analysis capabilities in the CDH environment. This topic describes how to create and use a CDH Presto node.
Prerequisites
-
A workflow is created in DataStudio.
In DataStudio, development tasks are organized into workflows. You must create a workflow before you can create a node. For more information, see Create a workflow.
-
A CDH cluster is created and registered with your DataWorks workspace.
You must register your CDH cluster with a DataWorks workspace before creating CDH nodes and tasks. For more information, see Bind a CDH compute resource in the old version of DataStudio.
-
(Optional) If you are using a RAM user, the user must be added to the workspace and assigned the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so assign it with caution. For more information on adding members, see Add members to a workspace.
-
A serverless resource group is purchased and configured. The configuration includes binding the resource group to your workspace and setting up the network. For more information, see Use a serverless resource group.
Limitations
You can run this type of task on a serverless resource group or an old-version exclusive resource group for scheduling. We recommend using a serverless resource group.
Step 1: Create a CDH Presto node
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
-
Right-click a workflow and choose .
NoteAlternatively, you can hover over the New button at the top and follow the on-screen instructions to create a CDH node.
-
In the Create Node dialog box, enter a Name for the node and click OK. You can then develop and configure the task in the new node.
Step 2: Develop a Presto task
You can double-click the name of the created node to go to the configuration tab of the node and then perform the following operations to develop a task.
(Optional) Select a CDH compute engine instance
If multiple CDH clusters are registered with your workspace, select the appropriate cluster at the top of the page. If only one CDH cluster is bound, you do not need to make a selection. From the CDH engine instance drop-down list, select the target cluster instance, such as CDH production + test environment. To access an endpoint that is restricted by a whitelist, use a pay-as-you-go resource group for scheduling.
Simple SQL code development example
In the SQL editor, enter code for the node. Example:
show tables;
select * from userinfo ;
Develop SQL code: Use scheduling parameters
DataWorks provides Scheduling Parameter, which let you dynamically pass values to your code. You can define variables in your code by using the ${variable_name} format and assign values to them under Scheduling Settings > Parameter. For more information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.
select '${var}'; -- You can assign a specific scheduling parameter to the var variable.
Step 3: Configure task scheduling
If you need to run the task on a recurring schedule, click Scheduling in the right-side pane to configure its scheduling properties:
-
Configure the basic scheduling properties. For more information, see Configure basic properties.
-
Configure the scheduling cycle, rerun properties, and dependencies. For more information, see Configure time properties and Configure same-cycle scheduling dependencies.
NoteYou must configure the Rerun attribute properties and specify the Parent Nodes before committing the node.
-
Configure resource properties. For more information, see Configure resource properties. If the task needs to access the public internet or a VPC, you must select a scheduling resource group with the necessary network connectivity. For more information, see Network connectivity solutions.
Step 4: Debug the code
-
(Optional) Select a runtime resource group and assign values to custom parameters.
-
In the toolbar, click the
icon. In the Parameter dialog box, select the resource group to use for debugging. -
If your task code uses scheduling parameters, assign their values here for debugging. For more information about the value assignment logic, see What is the difference in value assignment logic between Run, Advanced Run, and development-environment smoke testing?.
-
-
Save and run the SQL statements.
In the toolbar, click the
icon to save the SQL statements, and then click the
icon to run the task. -
(Optional) Perform smoke testing.
To run smoke testing in the development environment, you can do so during the commit process or after you commit the node. For more information, see Perform smoke testing.
What to do next
-
Commit and deploy the task.
-
In the toolbar, click the
icon to save the node. -
In the toolbar, click the
icon to commit the task. -
In the Commit Node dialog box, enter a Change Description.
-
Click Determine.
In a standard mode workspace, you must deploy the task to the production environment after you commit it. In the top menu bar, click Deploy. For more information, see Deploy tasks.
-
-
View scheduled tasks.
-
In the upper-right corner of the editor, click O&M Personnel to open the production environment's Operation Center.
-
View the scheduled tasks that are running. For more information, see Manage scheduled tasks.
To view more details about scheduled tasks, click Operation Center in the top menu bar. For more information, see Operation Center overview.
-