ClickHouse SQL lets you run distributed SQL queries on structured data to improve job efficiency. In DataWorks, you can use a ClickHouse SQL node to develop tasks, schedule them to run periodically, and integrate them with other jobs. This topic describes the main workflow for developing a task using a ClickHouse SQL node.
Prerequisites
-
EMR engine types include DataLake and Hadoop. Each engine type requires different preparations before you create a node. Based on your scenario, complete the required preparations in EMR and DataWorks.
-
DataLake: For more information, see Configure a DataLake cluster and Configure DataWorks.
-
Hadoop: For more information, see Prepare a Hadoop cluster for development.
-
-
You have created a ClickHouse data source and bound it to a workspace.
To access data from ClickHouse for development, you must create a ClickHouse data source in DataWorks and bind it to Data Studio. For more information, see Bind a ClickHouse compute resource.
-
You have created a ClickHouse SQL node. For more information, see Create a scheduled workflow node.
Procedure
-
On the ClickHouse SQL node editing page, follow these steps:
Develop SQL code
In the SQL editor, write your task code. You can define variables in your code using the ${variable_name} format. Assign values to these variables under Scheduling Parameters in the Scheduling Settings pane on the right. This enables dynamic parameter passing for scheduled job runs. For more information about scheduling parameters, see Sources and expressions for scheduling parameters. Example:
CREATE DATABASE IF NOT EXISTS ck_test; CREATE TABLE IF NOT EXISTS ck_test.first_table ( `product_code` String, `package_name` String ) ENGINE = MergeTree ORDER BY package_name SETTINGS index_granularity = 8192; INSERT INTO ck_test.first_table (product_code, package_name) VALUES ('1', ${var}); SELECT * FROM ck_test.first_table;NoteIn this example, the
${var}parameter can be set to1.Run SQL task
-
In the Run Configuration pane, configure the Compute Resource and Resource Group.
-
For Computing Resource, select the name of the CDH cluster that you registered in DataWorks.
-
For Resource Group, select a scheduling resource group that has a confirmed connection to the data source. For more information, see Network connectivity solutions.
-
-
On the toolbar, click the Select a data source drop-down menu. In the dialog box, select the ClickHouse data source that you created, and click Run to run the SQL task.
-
-
To run the node periodically, configure its scheduling properties based on your business requirements. For configuration details, see Node scheduling configuration.
-
After configuring the node, deploy it. For more information, see Node and workflow deployment.
-
After you deploy the task, view its run status in Operation Center. For more information, see Get started with Operation Center.