Flink SQL Batch nodes let you run standard SQL against bounded datasets directly in DataWorks. They are designed for batch workloads—ETL pipelines, data cleansing, and aggregation jobs—that process a finite dataset and terminate when complete. Unlike streaming nodes, batch nodes optimize for throughput over latency, making them the right choice when all input data is available before the job starts.
Prerequisites
Before you begin, make sure you have:
-
A DataWorks workspace with Realtime Compute for Apache Flink computing resources associated in Management Center. For details, see Associate a computing resource.
-
A Flink SQL Batch node created in your workspace. For details, see Node development.
Step 1: Write SQL code
Open the configuration tab of the Flink SQL Batch node and write your task code in the SQL editor.
Using variables in your code
Define variables in the ${Variable name} format anywhere in your SQL. Then, in the Properties tab, go to the Scheduling Parameters section and assign values to those variables. When the task runs, DataWorks substitutes the scheduling parameter values into your code automatically.
For details on available expressions, see Sources and expressions of scheduling parameters.
Example
The following SQL creates a source table using the datagen connector and inserts rows into a blackhole result table. The ${var} placeholder is replaced at runtime by a scheduling parameter.
-- Create a source table named datagen_source.
CREATE TEMPORARY TABLE datagen_source_${var}(
name VARCHAR
) WITH (
'connector' = 'datagen',
'number-of-rows' = '1000'
);
-- Create a result table named blackhole_sink.
CREATE TEMPORARY TABLE blackhole_sink_${var}(
name VARCHAR
) WITH (
'connector' = 'blackhole'
);
-- Insert data from the source table into the result table.
INSERT INTO blackhole_sink_${var}
SELECT
name
FROM datagen_source_${var};
To process daily incremental data, set the bizdate parameter to $[yyyymmdd]. DataWorks replaces this expression with the business date each time the task runs.
Step 2: Configure the task
In the Properties tab, configure the resource and scheduling settings for the task.
Configure Flink resource information
Set the following parameters in the Flink Resource Information section. For a full reference, see Configure job deployment information.
| Parameter | Description |
|---|---|
| Flink Cluster | The Realtime Compute for Apache Flink workspace associated with your DataWorks workspace in Management Center. |
| Flink Engine Version | The engine version for the job. Select an engine version based on your business requirements. |
| Resource Group For Scheduling | The serverless resource group connected to the Realtime Compute for Apache Flink workspace. Use separate resource groups to isolate workloads with different priority levels—for example, keep production jobs in a dedicated group so they are not affected by ad-hoc or low-priority tasks. |
| Job Manager CPU | CPU cores for the JobManager, which handles task scheduling and coordination. Minimum: 0.5 cores. Recommended: 1 core. Maximum: 16 cores. |
| Job Manager Memory | Memory for the JobManager. Recommended range: 2–64 GiB. Set higher values for complex jobs with many tasks. |
| Task Manager CPU | CPU cores per TaskManager, which executes your SQL operators. Minimum: 0.5 cores. Recommended: 1 core per TaskManager. Maximum: 16 cores. |
| Task Manager Memory | Memory per TaskManager. Recommended range: 2–64 GiB. Increase this value if your job processes large data volumes or uses memory-intensive operators. |
| Parallelism | The number of tasks that run in parallel. Configure based on workspace resources and deployment characteristics. |
| Maximum Number Of Slots | The total number of slots across all TaskManagers. Each slot runs one task or operator in parallel. |
| Slots For Each TaskManager | The number of slots in each TaskManager. Increasing this value lets a single TaskManager handle more parallel tasks, which reduces the total number of TaskManagers needed. |
(Optional) Configure scheduling parameters
-
In the Properties tab, go to the Scheduling Parameter section.
-
Click Add Parameter.
-
Enter the Parameter Name and Parameter Value.
The parameters you define here are substituted into your SQL code at runtime wherever you used the ${Variable name} syntax.
(Optional) Configure Flink runtime parameters
In the Properties tab, go to the Flink Runtime Parameters section and enter any additional job configuration in YAML format. These parameters must be compatible with Ververica Platform (VVP) configuration syntax. Do not use semicolons (;) as line breaks.
To run this node on a schedule, also configure the Scheduling Policies, Scheduling Time, Scheduling Dependencies, and Node Output Parameters sections. For details, see Node scheduling configuration.
After you finish configuring the task, click Save.
Step 3: Deploy and monitor the task
-
Commit and deploy the configured node. For details, see Node and workflow deployment.
-
After deployment, click Perform O&M under Prod Online to open Operation Center. For more information, see Getting started with Operation Center.
-
In Operation Center, check the running status of your task.