Build MaxCompute SQL Pipelines Using ODPS Script Nodes - DataWorks

ODPS Script nodes let you develop MaxCompute tasks in script mode using the MaxCompute V2.0 SQL engine. In script mode, an entire SQL script file — including multiple statements — is compiled as a single unit and submitted to MaxCompute as one execution plan. All statements run in one queue, which lets MaxCompute optimize across statements and fully utilize available resources.

Prerequisites

Before you begin, make sure you have:

An ODPS Script node created in DataWorks. For details, see Create and manage ODPS nodes.

When to use script mode

Use script mode when you need to:

Rewrite a single statement with nested subqueries into multiple simpler statements that share intermediate results.
Split a complex script into multiple statements connected by variables.

Script mode is not suitable when data from multiple sources has large time gaps between availability. For example, if source A is ready at 01:00 and source B is ready at 07:00, table variables cannot efficiently package these statements into one script.

For more background, see SQL in script mode.

Script syntax

A MaxCompute script can contain SET statements, DDL statements, and DML statements. Each type can appear zero or more times, but you cannot mix statement types.

The general structure is:

-- SET statements
set odps.sql.type.system.odps2=true;
[set odps.stage.reducer.num=***;]
[...]
-- DDL statements
create table table1 xxx;
[create table table2 xxx;]
[...]
-- DML statements
@var1 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM table3
    [WHERE where_condition];
@var2 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM table4
    [WHERE where_condition];
@var3 := SELECT [ALL | DISTINCT] var1.select_expr, var2.select_expr, ...
    FROM @var1 join @var2 on ...;
INSERT OVERWRITE|INTO TABLE [PARTITION (partcol1=val1, partcol2=val2 ...)]
    SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM @var3;
[@var4 := SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var1
    UNION ALL | UNION
    SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var2;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
    AS
    SELECT [ALL | DISTINCT] select_expr, select_expr, ...
    FROM var4;]

Variable syntax

Assign query results to a variable using the := operator:

@<variable_name> := <SELECT statement>;

Element	Description
`@<variable_name>`	Variable name prefixed with `@`. Reference the variable in subsequent statements using `@variable_name`.
`:=`	Assignment operator. Assigns the SELECT result to the variable as a table result set.
`<SELECT statement>`	Any valid SELECT statement.

To assign a constant to a variable and use it as a scalar value, combine SELECT <constant> with a subsequent SELECT * FROM @variable. For example:

@a := SELECT 10; -- Assign the constant 10 to @a. You can also use SELECT col1 FROM t1 to assign a value from a single-row table.
@b := SELECT key, value + (SELECT * FROM @a) FROM t2 WHERE key > 10000;
SELECT * FROM @b;

For more information about this syntax, see Subqueries.

Limitations

The following limitations apply when you write ODPS Script node code.

Supported statement types

ODPS Script nodes support SET statements, DML statements, and a subset of DDL statements. DDL statements that display results — such as DESC and SHOW — are not supported.

Result-displaying statements

Write at most one result-displaying statement (such as a SELECT statement) per script. Writing multiple result-displaying statements causes an error.

Error:

SELECT * FROM src1;
SELECT * FROM src2; -- Second result-displaying statement causes an error.

Fix: Keep only one result-displaying SELECT, or avoid using result-displaying SELECT statements in production scripts altogether.

CREATE TABLE AS

Write at most one CREATE TABLE AS statement per script, and place it as the last statement. To keep scripts maintainable, write CREATE TABLE statements and INSERT statements in separate sections.

Script failure behavior

If any statement in a script fails, the entire script fails.

Job generation

MaxCompute generates a job for the script only after all input data is ready.

Write-then-read in the same script

Writing data to a table and then reading from it in the same script causes an error.

Error:

insert overwrite table src2 select * from src where key > 0;
@a := select * from src2; -- Reading from the table written above causes an error.
select * from @a;

Fix: Use a variable to hold the intermediate result instead of writing and reading the same table:

@a := select * from src where key > 0;
insert overwrite table src2 select * from @a;
select * from @a;

Basic example

The following script joins multiple source tables and writes results to two destination tables. Intermediate results are stored in variables — you do not need to manage statement execution order manually.

create table if not exists dest(key string, value bigint);
create table if not exists dest2(key string, value bigint);
@a := select * from src where value > 0;
@b := select * from src2 where key is not null;
@c := select * from src3 where value is not null;
@d := select a.key, b.value from @a left outer join @b on a.key = b.key and b.value > 0;
@e := select a.key, c.value from @a inner join @c on a.key = c.key;
@f := select * from @d union select * from @e union select * from @a;
insert overwrite table dest select * from @f;
@g := select e.key, c.value from @e join @c on e.key = c.key;
insert overwrite table dest2 SELECT * from @g;

This example uses src, src2, and src3 as source tables. To create them with sample data:

create table if not exists src(key string, value BIGINT);
insert into src values ('1', 11);
insert into src values ('1', 11);
create table if not exists src2(key string, value BIGINT);
insert into src2 values ('1', 22);
insert into src2 values ('2', 22);
create table if not exists src3(key string, value BIGINT);
insert into src3 values ('1', 33);
insert into src3 values ('3', 33);

After the script runs, dest contains the union of all filtered and joined rows, and dest2 contains the rows from the inner join of @e and @c.

Advanced examples

IF statement with a BOOLEAN expression

When the IF condition is a BOOLEAN-type expression, the branch is determined at code compilation time. MaxCompute evaluates the condition during compilation and selects the branch before job submission.

set odps.sql.allow.fullscan=true;
set odps.optimizer.cbo.rule.filter.black=LM;
@date := '${var}';
@row TABLE(key int, value bigint); -- Declare a table variable with schema (key int, value bigint).
IF (cast(@date as bigint) % 2 == 0) BEGIN
@row := SELECT key, value from src1;
END ELSE BEGIN
@row := SELECT key, value from src2;
END
INSERT OVERWRITE TABLE dest1 partition(p='${var}') SELECT key, value FROM @row;

The var variable used in the script must be assigned a value in the Scheduling Parameter section of the Properties tab.

When var is 2 (an even number), MaxCompute selects the src1 branch at compile time and inserts rows from src1 into dest1. When var is 3 (an odd number), it inserts rows from src2 instead.

IF statement with a BOOLEAN-type scalar subquery

When the IF condition is a BOOLEAN-type scalar subquery, the branch is determined at job runtime rather than at compilation time. MaxCompute must submit multiple jobs to evaluate the condition.

@i bigint;
@t table(id bigint, value bigint);
IF ((SELECT count(*) FROM src WHERE a = '5') > 1) BEGIN
@i := 1;
@t := select @i, @i*2;
END ELSE
BEGIN
@i := 2;
@t := select @i, @i*2;
END
select id, value from @t;

If src contains more than one row where a = '5', the script sets @i to 1 and @t to (1, 2). Otherwise, it sets @i to 2 and @t to (2, 4).

Code-embedded UDFs

Script mode supports embedding Java or Python code directly into an SQL script to implement user-defined functions (UDFs). For details, see Code-embedded UDFs.

What to do next

After completing node development, take the following steps to put the task into production.

Step	Description
Configure scheduling	Set up rerun behavior, scheduling cycles, and task dependencies so DataWorks runs the task on a schedule. See Overview.
Debug the node	Run the code in DataWorks to verify your logic before deploying. If the output does not match your expectations, fix the code and re-run. See Debugging procedure.
Deploy the node	Deploy the node to activate periodic scheduling based on the properties you configured. After deployment, DataWorks schedules the task automatically. See Deploy nodes.