DataWorks provides MaxCompute Script nodes that support script development using the MaxCompute V2.0 SQL engine. This topic describes how to use a MaxCompute Script node.
Background information
The MaxCompute SQL engine supports a script mode that lets you combine multiple SQL statements into a single script for compilation and execution. This mode is ideal for complex queries, such as those that involve nested subqueries or multi-step operations. When you submit the entire script at once, a unified execution plan is generated. The job is then queued and run only once, resulting in more efficient resource usage. For more information, see SQL in script mode.
Scenarios
The script mode is suitable for multilayer nested subqueries that require rewriting, or for complex scripts that must be split into multiple statements.
For example, in script mode, you can assign a constant value to a variable. Then, you can convert a SELECT * FROM variable statement into a scalar and use it in calculations with other columns. The constant value can also be stored in a single-row table. The following code provides an example. For information about the conversion syntax, see Subqueries (SUBQUERY).
@a := SELECT 10; -- Assign the constant 10 to @a. You can also assign a value from the single-row table t1 by running SELECT col1 FROM t1.
@b := SELECT key,VALUE+(SELECT * FROM @a) FROM t2 WHERE key >10000; -- Perform a calculation using the value from the value column in table t2 and the value of @a.
SELECT * FROM @b;If data from multiple data sources becomes available at very different times, for example, one at 01:00 and another at 07:00, it is not suitable to use table variables to combine statements into a large SQL script.
Prerequisites
A MaxCompute computing resource is attached to the DataWorks workspace.
The required tables have been created in MaxCompute, and sample data has been added to the tables. For more information, see Table creation statements and sample test data.
Syntax
A complete MaxCompute script follows this structure: SET statements > DDL statements > DML statements. Each section can contain zero or more statements, but you cannot mix statement types. Statements that start with an at sign (@) indicate a variable. The syntax is as follows:
-- SET statements
SET odps.sql.type.system.odps2=true;
[SET odps.stage.reducer.num=***;]
[...]
-- DDL statements
CREATE TABLE table1 xxx;
[CREATE TABLE table2 xxx;]
[...]
-- DML statements
@var1 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table3
[WHERE where_condition];
@var2 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table4
[WHERE where_condition];
@var3 := SELECT [ALL | DISTINCT] var1.select_expr, var2.select_expr, ...
FROM @var1 JOIN @var2 ON ...;
INSERT OVERWRITE|INTO TABLE [PARTITION (partcol1=val1, partcol2=val2 ...)]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM @var3;
[@var4 := SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var1
UNION ALL | UNION
SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var2;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
AS
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM var4;]Limits
The following limits apply to MaxCompute Script nodes:
Script mode supports SET statements, DML statements, and some DDL statements. DDL statements that display results on the screen, such as
DESCandSHOW, are not supported.A script can contain only one statement that displays results on the screen, such as a standalone
SELECTstatement. If you include more than one, an error occurs. Do not runSELECTstatements that display results in a script.A script can contain a maximum of one
CREATE TABLE ASstatement. This statement must be the last one in the script. You must writeINSERTstatements and CREATE TABLE statements separately.In script mode, the entire script fails if one of its statements fails.
In script mode, a job is generated to process data only after all input data is ready.
In script mode, an error occurs if you write data to a table and then read data from the same table within the same script.
Incorrect example:
INSERT OVERWRITE TABLE src2 SELECT * FROM src WHERE key > 0; @a := SELECT * FROM src2; SELECT * FROM @a;Correct example:
To avoid this error, modify the statements as follows.
@a := SELECT * FROM src WHERE key > 0; INSERT OVERWRITE TABLE src2 SELECT * FROM @a; SELECT * FROM @a;
Create a MaxCompute Script node
For more information about how to create a node, see Create a MaxCompute Script node.
Develop a MaxCompute Script node
On the editor page for the MaxCompute Script node, you can use the sample script code below for development as needed.
Basic script development
In MaxCompute script mode, the SQL compilation process is straightforward. The following example shows how to use a MaxCompute Script node.
CREATE TABLE IF NOT EXISTS dest(key string , VALUE bigint) ;
CREATE TABLE IF NOT EXISTS dest2(key string,VALUE bigint ) ;
@a := SELECT * FROM src WHERE VALUE >0;
@b := SELECT * FROM src2 WHERE key IS NOT NULL;
@c := SELECT * FROM src3 WHERE VALUE IS NOT NULL;
@d := SELECT a.key,b.value FROM @a LEFT OUTER JOIN @b ON a.key=b.key AND b.value>0;
@e := SELECT a.key,c.value FROM @a INNER JOIN @c ON a.key=c.key;
@f := SELECT * FROM @d UNION SELECT * FROM @e UNION SELECT * FROM @a;
INSERT OVERWRITE TABLE dest SELECT * FROM @f;
@g := SELECT e.key,c.value FROM @e JOIN @c ON e.key=c.key;
INSERT OVERWRITE TABLE dest2 SELECT * FROM @g;Advanced script development
MaxCompute script mode supports IF statements. An IF statement allows the program to automatically select the execution logic based on a condition. The MaxCompute IF syntax is divided into two types based on the condition: BOOLEAN expression and BOOLEAN scalar subquery.
The IF condition is a BOOLEAN expression
This type of
IF ELSEstatement determines which branch to execute during the compilation stage. The following code provides an example:-- Data processing SET odps.sql.allow.fullscan=true; SET odps.optimizer.cbo.rule.filter.black=LM; @date := '${var}'; @row TABLE(key int,VALUE bigint); -- Declare the variable row. The variable type is Table, and the schema is string. IF ( cast(@date AS bigint) % 2 == 0 ) BEGIN @row := SELECT key,VALUE FROM src1; END ELSE BEGIN @row := SELECT key,VALUE FROM src2; END INSERT OVERWRITE TABLE dest1 PARTITION(p='${var}') SELECT key,VALUE FROM @row;NoteA variable named var is defined in the code. You must assign a value to the var variable in the scheduling parameter settings.
The IF condition is a BOOLEAN scalar subquery
This type of
IF ELSEstatement cannot determine which branch to execute during the compilation stage. The decision is made at runtime. Therefore, you must submit multiple jobs. The following code provides an example:@i bigint; @t TABLE(id bigint, VALUE bigint); IF ((SELECT count(*) FROM src WHERE a = '5') > 1) BEGIN @i := 1; @t := SELECT @i, @i*2; END ELSE BEGIN @i := 2; @t := SELECT @i, @i*2; END SELECT id, VALUE FROM @t;Embedded user-defined function (UDF) development
You can also use MaxCompute script mode to develop embedded user-defined functions (UDFs) by embedding Java or Python code into an SQL script. For more information, see Embedded UDFs.
Test a MaxCompute Script node
On the node editor page, configure the parameters in the Debugging Configurations section on the right.
Parameter
Description
Computing resource
Select the attached MaxCompute computing resource.
Computing quota
Select a computing quota (Quota) that you created. The quota provides computing resources, such as CPU and memory, for the job.
If no computing quota is available, click Create Computing Quota in the drop-down list to create and configure a quota in the MaxCompute console.
Resource group
Select the scheduling resource group that passed the connectivity test with the computing resource. For more information, see Network connectivity solutions.
In the parameter dialog box on the toolbar, select the MaxCompute data source that you created, and click Run to execute the MaxCompute Script task.
What to do next
Node scheduling: If the node in the project folder must be run on a schedule, set a Scheduling and configure the scheduling properties in the Scheduling Policies section on the right side of the node.
Node deployment: If the task must be run in the production environment, click the
icon to start the deployment process. This process publishes the task to the production environment. Nodes in a project folder run on a schedule only after they are published to the production environment.
Table creation statements and sample test data
For example:
-- Create tables.
CREATE TABLE IF NOT EXISTS src(key string ,VALUE BIGINT);
INSERT INTO src VALUES ('1',11) ;
INSERT INTO src VALUES ('1',11) ;
CREATE TABLE IF NOT EXISTS src2(key string ,VALUE BIGINT);
INSERT INTO src2 VALUES ('1',22);
INSERT INTO src2 VALUES ('2',22);
CREATE TABLE IF NOT EXISTS src3(key string ,VALUE BIGINT);
INSERT INTO src3 VALUES ('1',33);
INSERT INTO src3 VALUES ('3',33);