DataWorks provides the MaxCompute Script node, which supports development in script mode based on the MaxCompute 2.0 SQL engine. This topic describes how to use the MaxCompute Script node.
Background
The MaxCompute SQL engine supports script mode, which combines multiple SQL statements into a single script that is compiled and executed as one unit. This is useful for complex queries, such as nested subqueries or operations that require multiple steps. By submitting the entire script at once, you generate a unified execution plan, and the job is queued and run only once. This approach uses resources more efficiently. For more information, see SQL script mode. In DataWorks, you can use the MaxCompute Script node to create and schedule script-based tasks.
Use case
Script mode is ideal for rewriting deeply nested subqueries or splitting complex logic into multiple statements.
For example, in script mode, you can assign a constant to a variable and then use a SELECT * FROM variable statement to convert it into a scalar for use in calculations with other columns. The constant can also be stored in a single-row table. See the following example. For the conversion syntax, see Subquery (SUBQUERY).
@a := SELECT 10; -- Assign the constant 10 to @a. Alternatively, assign a value from a single-row table t1, such as SELECT col1 FROM t1.
@b := SELECT key,VALUE+(SELECT * FROM @a) FROM t2 WHERE key >10000; -- Calculate a new value by using the value column from table t2 and the value from @a.
SELECT * FROM @b;
Avoid using a single script with table variables to combine data sources that become ready at significantly different times (for example, one source is ready at 01:00 and another at 07:00). The job will not start until all data is available, which can cause significant delays.
Prerequisites
-
A MaxCompute compute resource is attached to the DataWorks workspace.
-
Ensure the required tables are created in MaxCompute and populated with sample data. For details, see Table creation statements and sample data.
Syntax
A complete MaxCompute script follows this order: SET statement > DDL statement > DML statement. Each type can have zero or more statements, but you cannot interleave them. Statements that start with an at sign (@) declare variables to pass results between statements. The syntax is as follows:
--SET statements
SET odps.sql.type.system.odps2=true;
[SET odps.stage.reducer.num=***;]
[...]
--DDL statements
CREATE TABLE table1 xxx;
[CREATE TABLE table2 xxx;]
[...]
--DML statements
@var1 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table3
[WHERE where_condition];
@var2 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table4
[WHERE where_condition];
@var3 := SELECT [ALL | DISTINCT] var1.select_expr, var2.select_expr, ...
FROM @var1 JOIN @var2 ON ...;
INSERT OVERWRITE|INTO TABLE [PARTITION (partcol1=val1, partcol2=val2 ...)]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM @var3;
[@var4 := SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var1
UNION ALL | UNION
SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var2;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
AS
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM var4;]Limitations
The following limits apply to the MaxCompute Script node:
-
Script mode supports SET statements, DML statements, and some DDL statements. DDL statements that display results to the screen, such as
DESCandSHOW, are not supported. -
A script can contain at most one statement that displays results to the screen, such as a standalone
SELECTstatement. Otherwise, an error is returned. We recommend that you do not run screen-displaySELECTstatements in scripts. -
A script can contain at most one
CREATE TABLE ASstatement, and it must be the last statement. We recommend that you write table creation statements andINSERTstatements separately. -
In script mode, if any statement fails, all statements in the script fail.
-
In script mode, a job is generated for data processing only after all input data is ready.
-
In script mode, an error is returned if a table is read after being written to.
Incorrect example:
INSERT OVERWRITE TABLE src2 SELECT * FROM src WHERE key > 0; @a := SELECT * FROM src2; SELECT * FROM @a;Correct example:
To avoid read-after-write issues, you can make the following modification.
@a := SELECT * FROM src WHERE key > 0; INSERT OVERWRITE TABLE src2 SELECT * FROM @a; SELECT * FROM @a;
Create a MaxCompute Script node
For information about how to create a node, see Create a MaxCompute Script node.
Develop a MaxCompute Script node
On the MaxCompute Script node editing page, you can select the following script examples based on your business needs for development.
Basic script development
In MaxCompute script mode, the SQL compilation process is straightforward. The following simple example shows how to use the MaxCompute Script node.
CREATE TABLE IF NOT EXISTS dest(key string , VALUE bigint) ;
CREATE TABLE IF NOT EXISTS dest2(key string,VALUE bigint ) ;
@a := SELECT * FROM src WHERE VALUE >0;
@b := SELECT * FROM src2 WHERE key IS NOT NULL;
@c := SELECT * FROM src3 WHERE VALUE IS NOT NULL;
@d := SELECT a.key,b.value FROM @a LEFT OUTER JOIN @b ON a.key=b.key AND b.value>0;
@e := SELECT a.key,c.value FROM @a INNER JOIN @c ON a.key=c.key;
@f := SELECT * FROM @d UNION SELECT * FROM @e UNION SELECT * FROM @a;
INSERT OVERWRITE TABLE dest SELECT * FROM @f;
@g := SELECT e.key,c.value FROM @e JOIN @c ON e.key=c.key;
INSERT OVERWRITE TABLE dest2 SELECT * FROM @g;
Advanced script development
MaxCompute script mode supports IF statements, which allow a program to automatically select execution logic based on conditions. The IF syntax in MaxCompute is categorized by Condition type: BOOLEAN expressions and BOOLEAN Scalar SubQueries.
-
IF condition is a BOOLEAN expression
This type of
IF ELSEstatement determines which branch to execute at compile time. Example:--Data processing SET odps.sql.allow.fullscan=true; SET odps.optimizer.cbo.rule.filter.black=LM; @date := '${var}'; @row TABLE(key int,VALUE bigint); --Declare the variable row. Its type is Table and its schema is string. IF ( cast(@date AS bigint) % 2 == 0 ) BEGIN @row := SELECT key,VALUE FROM src1; END ELSE BEGIN @row := SELECT key,VALUE FROM src2; END INSERT OVERWRITE TABLE dest1 PARTITION(p='${var}') SELECT key,VALUE FROM @row;NoteThe code defines a variable named var. You must assign a value to the var variable in the scheduling parameter settings.
-
IF condition is a BOOLEAN Scalar SubQuery
This type of
IF ELSEstatement cannot determine which branch to execute at compile time. Instead, it is determined at runtime. Therefore, multiple jobs must be submitted. Example:@i bigint; @t TABLE(id bigint, VALUE bigint); IF ((SELECT count(*) FROM src WHERE a = '5') > 1) BEGIN @i := 1; @t := SELECT @i, @i*2; END ELSE BEGIN @i := 2; @t := SELECT @i, @i*2; END SELECT id, VALUE FROM @t; -
Embedded UDF development
You can also embed Java or Python code in SQL scripts to develop embedded UDFs in MaxCompute script mode. For more information, see Embedded UDF.
Debug a MaxCompute Script node
-
Configure the relevant parameters in Run Configuration on the right side of the node editing page.
Parameter
Description
Compute resource
Select the MaxCompute compute resource that you have associated.
Compute quota
Select a compute quota that you created to provide the computing resources (CPU and memory) required by your compute jobs.
If no compute quota is available, click Create Compute Quota in the drop-down list, and then create a compute quota in the MaxCompute console.
Resource group
Select a scheduling resource group that has passed the connectivity test with the compute resource. For more information, see Network connectivity.
-
Select the MaxCompute data source that you created in the parameter dialog on the toolbar, and then click Run to run the MaxCompute Script task.
Next steps
-
Configure schedule settings: If you want the node in the project directory to be periodically scheduled, configure Scheduling Policy and the related scheduling properties in Scheduling Settings on the right side of the node.
-
Deploy a node: If you want to deploy the task to the production environment, click the
icon to initiate the deployment process. Nodes in the project directory are periodically scheduled only after they are deployed to the production environment.
Table creation statements and sample data
Example:
--Create tables
CREATE TABLE IF NOT EXISTS src(key string ,VALUE BIGINT);
INSERT INTO src VALUES ('1',11) ;
INSERT INTO src VALUES ('1',11) ;
CREATE TABLE IF NOT EXISTS src2(key string ,VALUE BIGINT);
INSERT INTO src2 VALUES ('1',22);
INSERT INTO src2 VALUES ('2',22);
CREATE TABLE IF NOT EXISTS src3(key string ,VALUE BIGINT);
INSERT INTO src3 VALUES ('1',33);
INSERT INTO src3 VALUES ('3',33);