MaxCompute Script node - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides MaxCompute Script nodes that support script development using the MaxCompute V2.0 SQL engine. This topic describes how to use a MaxCompute Script node.

Background information

The MaxCompute SQL engine supports a script mode that lets you combine multiple SQL statements into a single script for compilation and execution. This mode is ideal for complex queries, such as those that involve nested subqueries or multi-step operations. When you submit the entire script at once, a unified execution plan is generated. The job is then queued and run only once, resulting in more efficient resource usage. For more information, see SQL in script mode.

Scenarios

The script mode is suitable for multilayer nested subqueries that require rewriting, or for complex scripts that must be split into multiple statements.

For example, in script mode, you can assign a constant value to a variable. Then, you can convert a SELECT * FROM variable statement into a scalar and use it in calculations with other columns. The constant value can also be stored in a single-row table. The following code provides an example. For information about the conversion syntax, see Subqueries (SUBQUERY).

@a := SELECT 10; -- Assign the constant 10 to @a. You can also assign a value from the single-row table t1 by running SELECT col1 FROM t1.
@b := SELECT key,VALUE+(SELECT * FROM @a) FROM t2 WHERE key >10000; -- Perform a calculation using the value from the value column in table t2 and the value of @a.
SELECT * FROM @b;

Note

If data from multiple data sources becomes available at very different times, for example, one at 01:00 and another at 07:00, it is not suitable to use table variables to combine statements into a large SQL script.

Prerequisites

A MaxCompute computing resource is attached to the DataWorks workspace.
The required tables have been created in MaxCompute, and sample data has been added to the tables. For more information, see Table creation statements and sample test data.

Syntax

A complete MaxCompute script follows this structure: SET statements > DDL statements > DML statements. Each section can contain zero or more statements, but you cannot mix statement types. Statements that start with an at sign (@) indicate a variable. The syntax is as follows:

-- SET statements
SET odps.sql.type.system.odps2=true;
[SET odps.stage.reducer.num=***;]
[...]
-- DDL statements
CREATE TABLE table1 xxx;
[CREATE TABLE table2 xxx;]
[...]
-- DML statements
@var1 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM table3
    [WHERE where_condition];
@var2 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM table4
    [WHERE where_condition];
@var3 := SELECT [ALL | DISTINCT] var1.select_expr, var2.select_expr, ...
    FROM @var1 JOIN @var2 ON ...;
INSERT OVERWRITE|INTO TABLE [PARTITION (partcol1=val1, partcol2=val2 ...)]
    SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM @var3;
[@var4 := SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var1
    UNION ALL | UNION
    SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var2;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
    AS
    SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM var4;]

Limits

The following limits apply to MaxCompute Script nodes:

Script mode supports SET statements, DML statements, and some DDL statements. DDL statements that display results on the screen, such as DESC and SHOW, are not supported.
A script can contain only one statement that displays results on the screen, such as a standalone SELECT statement. If you include more than one, an error occurs. Do not run SELECT statements that display results in a script.
A script can contain a maximum of one CREATE TABLE AS statement. This statement must be the last one in the script. You must write INSERT statements and CREATE TABLE statements separately.
In script mode, the entire script fails if one of its statements fails.
In script mode, a job is generated to process data only after all input data is ready.
In script mode, an error occurs if you write data to a table and then read data from the same table within the same script.
Incorrect example:
```
INSERT OVERWRITE TABLE src2 SELECT * FROM src WHERE key > 0;
@a := SELECT * FROM src2;
SELECT * FROM @a;
```
Correct example:
To avoid this error, modify the statements as follows.
```
@a := SELECT * FROM src WHERE key > 0;
INSERT OVERWRITE TABLE src2 SELECT * FROM @a;
SELECT * FROM @a;
```

Create a MaxCompute Script node

For more information about how to create a node, see Create a MaxCompute Script node.

Develop a MaxCompute Script node

On the editor page for the MaxCompute Script node, you can use the sample script code below for development as needed.

Basic script development

In MaxCompute script mode, the SQL compilation process is straightforward. The following example shows how to use a MaxCompute Script node.

CREATE TABLE IF NOT EXISTS dest(key string , VALUE bigint) ;
CREATE TABLE IF NOT EXISTS dest2(key string,VALUE bigint ) ;
@a := SELECT * FROM src WHERE VALUE >0;
@b := SELECT * FROM src2 WHERE key IS NOT NULL;
@c := SELECT * FROM src3 WHERE VALUE IS NOT NULL;
@d := SELECT a.key,b.value FROM @a LEFT OUTER JOIN @b ON a.key=b.key AND b.value>0;
@e := SELECT a.key,c.value FROM @a INNER JOIN @c ON a.key=c.key;
@f := SELECT * FROM @d UNION SELECT * FROM @e UNION SELECT * FROM @a;
INSERT OVERWRITE TABLE dest  SELECT * FROM @f;
@g := SELECT e.key,c.value  FROM @e JOIN @c ON e.key=c.key;
INSERT OVERWRITE TABLE dest2 SELECT * FROM @g;

Advanced script development

MaxCompute script mode supports IF statements. An IF statement allows the program to automatically select the execution logic based on a condition. The MaxCompute IF syntax is divided into two types based on the condition: BOOLEAN expression and BOOLEAN scalar subquery.

The IF condition is a BOOLEAN expression

This type of IF ELSE statement determines which branch to execute during the compilation stage. The following code provides an example:

-- Data processing
SET odps.sql.allow.fullscan=true;
SET odps.optimizer.cbo.rule.filter.black=LM; 
@date := '${var}';
@row  TABLE(key int,VALUE bigint); -- Declare the variable row. The variable type is Table, and the schema is string. 
IF ( cast(@date  AS bigint) % 2 == 0 ) BEGIN 
@row  := SELECT key,VALUE FROM src1; 
END ELSE BEGIN
@row  := SELECT key,VALUE FROM src2; 
END
INSERT OVERWRITE TABLE dest1 PARTITION(p='${var}')  SELECT key,VALUE FROM @row;

Note

A variable named var is defined in the code. You must assign a value to the var variable in the scheduling parameter settings.

The IF condition is a BOOLEAN scalar subquery
This type of IF ELSE statement cannot determine which branch to execute during the compilation stage. The decision is made at runtime. Therefore, you must submit multiple jobs. The following code provides an example:
```
@i bigint;
@t TABLE(id bigint, VALUE bigint);
IF ((SELECT count(*) FROM src WHERE a = '5') > 1) BEGIN
@i := 1;
@t := SELECT @i, @i*2;
END ELSE
BEGIN
@i := 2;
@t := SELECT @i, @i*2;
END
SELECT id, VALUE FROM @t; 
```
Embedded user-defined function (UDF) development
You can also use MaxCompute script mode to develop embedded user-defined functions (UDFs) by embedding Java or Python code into an SQL script. For more information, see Embedded UDFs.

Test a MaxCompute Script node

On the node editor page, configure the parameters in the Run Configuration section on the right.

Parameter	Description
Computing resource	Select the attached MaxCompute computing resource.
Computing quota	Select a computing quota (Quota) that you created. The quota provides computing resources, such as CPU and memory, for the job. If no computing quota is available, click Create Computing Quota in the drop-down list to create and configure a quota in the MaxCompute console.
Resource group	Select the scheduling resource group that passed the connectivity test with the computing resource. For more information, see Network connectivity solutions.

In the parameter dialog box on the toolbar, select the MaxCompute data source that you created, and click Run to execute the MaxCompute Script task.

What to do next

Node scheduling configuration: If the node in the project folder must be run on a schedule, set a Scheduling and configure the scheduling properties in the Scheduling Policies section on the right side of the node.
Node deployment: If the task must be run in the production environment, click the icon to start the deployment process. This process publishes the task to the production environment. Nodes in a project folder run on a schedule only after they are published to the production environment.

Table creation statements and sample test data

For example:

-- Create tables.
CREATE TABLE IF NOT EXISTS src(key string ,VALUE BIGINT);
INSERT INTO src VALUES ('1',11) ;
INSERT INTO src VALUES ('1',11) ;
CREATE TABLE IF NOT EXISTS  src2(key string ,VALUE BIGINT);
INSERT INTO src2  VALUES ('1',22);
INSERT INTO src2  VALUES ('2',22);
CREATE TABLE IF NOT EXISTS  src3(key string ,VALUE BIGINT);
INSERT INTO src3 VALUES ('1',33);
INSERT INTO src3 VALUES ('3',33);