ClickHouse SQL lets you run distributed SQL queries and process structured data for more efficient job execution. In DataWorks, you can use a ClickHouse SQL node to develop and periodically schedule ClickHouse SQL tasks, and integrate them with other jobs. This topic describes how to develop tasks using a ClickHouse SQL node.
Prerequisites
The EMR engine can be either the DataLake or Hadoop type. Required preparations vary by engine type. Complete the necessary steps for your setup in both EMR and DataWorks.
DataLake: For more information, see Configure an EMR data lake cluster and Configure DataWorks.
A ClickHouse Data Source is created and bound to your workspace.
You must create a ClickHouse Data Source in DataWorks and bind it to Data Studio. This allows you to access ClickHouse data for development. For more information, see Associate a ClickHouse computing resource.
A ClickHouse SQL node is created. For more information, see Create a node for a scheduled workflow.
Procedure
On the ClickHouse SQL node edit page, follow these steps:
Develop SQL code
In the SQL editor, write the code for your task. You can define variables in the code using the ${variable_name} format, and then assign values to these variables in Schedule > Scheduling parameters on the right side of the node configuration page. This enables dynamic parameter passing for scheduled tasks. For more information about how to use scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example:
CREATE DATABASE IF NOT EXISTS ck_test; CREATE TABLE IF NOT EXISTS ck_test.first_table ( `product_code` String, `package_name` String ) ENGINE = MergeTree ORDER BY package_name SETTINGS index_granularity = 8192; INSERT INTO ck_test.first_table (product_code, package_name) VALUES ('1', ${var}); SELECT * FROM ck_test.first_table;NoteIn this example, set ${var} to
1.Run the SQL task
In the Run Configuration pane, configure Compute Resource and Resource Group.
For Compute Resource, select your registered CDH cluster.
For Resource Group, select a scheduling resource group that passes the Data Source connectivity test. For more information, see Network connectivity solutions.
On the toolbar, click the Select Data Source drop-down list. In the dialog box that appears, select the ClickHouse Data Source that you created, and then click Run.
To run the task periodically, configure its scheduling settings. For more information, see Node scheduling configuration.
After you configure the node, publish it. For more information, see Node and workflow deployment.
After the node is published, you can view the run status of the scheduled task in the Operation Center. For more information, see Getting started with Operation Center.