Build a CDH Presto Node to Run Distributed SQL Queries - DataWorks

A Cloudera's Distribution Including Apache Hadoop (CDH) Presto node lets you run distributed SQL queries against real-time data in your CDH environment directly from DataWorks DataStudio. Use this workflow to create the node, write Presto SQL, configure scheduling, and debug and deploy your task.

Prerequisites

Before you begin, ensure that you have:

A workflow created in DataStudio. See Create a workflow
A CDH cluster registered to your DataWorks workspace. See Register a CDH or CDP cluster to DataWorks
A serverless resource group purchased and configured — associated with your workspace and with network access set up. See Create and use a serverless resource group
(RAM users only) The RAM user added to the workspace with the Development role. The Workspace Administrator role also works but grants broader permissions than needed — assign it with caution. See Add workspace members and assign roles to them

Limitations

CDH Presto tasks run on serverless resource groups or old-version exclusive resource groups. We recommend that you run tasks on serverless resource groups.

Step 1: Create a CDH Presto node

Go to the DataStudio page. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O\&M \> Data Development. Select your workspace from the drop-down list and click Go to Data Development.
On the DataStudio page, find the desired workflow, right-click the workflow name, and choose Create Node \> CDH \> CDH Presto.
Alternatively, move the pointer over the Create icon at the top of the Scheduled Workflow pane and create a CDH node as prompted.
In the Create Node dialog box, configure the Name parameter and click Confirm.

Step 2: Develop a Presto task

Double-click the node name to open its configuration tab, then perform the following operations.

Select a CDH compute engine instance (optional)

If multiple CDH clusters are registered to your workspace, select one from the Engine Instance CDH drop-down list. If only one CDH cluster is registered, skip this step.

Write SQL code

In the SQL editor, enter your Presto SQL statements. Example:

show tables;

select * from userinfo;

Use scheduling parameters

DataWorks scheduling parameters let you substitute dynamic values into task code at run time. Define variables in your SQL using the ${Variable} format, then assign values in the Scheduling Parameter section of the Properties tab.

select '${var}'; -- Replace var with a scheduling parameter value.

For supported formats, see Supported formats of scheduling parameters.

Step 3: Configure task scheduling properties

Click Properties in the right-side navigation pane to configure how and when the task runs.

Configuration area	What to set	Reference
Basic properties	Basic task settings	Configure basic properties
Scheduling cycle and rerun	Run frequency, rerun policy, and parent node dependencies	Configure time properties
Scheduling dependencies	Same-cycle dependencies between nodes	Configure same-cycle scheduling dependencies
Resource properties	Resource group assignment for scheduling	Configure the resource property

Important

Configure Rerun and Parent Nodes on the Properties tab before you commit the task.

If the node needs access to the internet or a virtual private cloud (VPC), select the resource group for scheduling connected to the node. See Network connectivity solutions.

Step 4: Debug task code

(Optional) Select a resource group and assign values to custom parameters.
- Click the icon in the top toolbar. In the Parameters dialog box, select the resource group to use for debugging.
- If your task code uses scheduling parameters, assign values to those variables for the debug run. See Differences in value assignment logic among Run, Run with Parameters, and Perform Smoke Testing modes.
Save and run the SQL statements. Click the icon to save, then click the icon to run.
(Optional) Perform smoke testing. You can perform smoke testing on the task in the development environment when you commit the task or after you commit the task. See Perform smoke testing.

What's next

Commit and deploy the task:

Click the icon to save the task.
Click the icon to commit the task.
In the Submit dialog box, fill in the Change description field and click Confirm.
If your workspace is in standard mode, deploy the task to the production environment: click Deploy in the top navigation bar of DataStudio. See Deploy tasks.

View and monitor the task:

Click Operation Center in the upper-right corner of the node configuration tab to go to Operation Center in the production environment.
View your scheduled task. See View and manage auto triggered tasks.

To view more information about the task, click Operation Center in the top navigation bar of the DataStudio page. For more information, see Overview.