ODPS SQL nodes let you run MaxCompute SQL tasks on a schedule in DataWorks and chain them with other node types in the same workflow. MaxCompute SQL handles terabyte-scale data processing in distributed, non-real-time scenarios using SQL-like syntax.
Prerequisites
Before you begin, ensure that you have:
-
Created an ODPS SQL node. For more information, see Create and manage ODPS nodes.
Limits
| Constraint | Limit |
|---|---|
| Node code size | 128 KB |
| SQL statements per node | 200 |
| Query result rows displayed | 10,000 |
| Query result data returned | 10 MB |
Comment restrictions:
-
Do not add comments to a SET or USE statement in isolation.
-
Do not use semicolons (
;) inside comments. -
Do not add a comment at the end of a complete SQL statement. A semicolon marks the end of a statement.
SET and USE statement restrictions:
-
Do not use a SET or USE statement alone in an ODPS SQL node. Execute them together with other SQL statements.
-
Only SELECT or WITH statements return formatted result sets.
How SET and USE statements execute
SET and USE statements behave differently depending on the environment.
Sample node code:
SET a=b;
CREATE TABLE name1(id STRING);
SET c=d;
CREATE TABLE name2(id STRING);
The following table shows how these statements are grouped and executed in each environment:
| SQL statement | DataStudio | Scheduling environment |
|---|---|---|
| First | SET a=b; SET c=d; CREATE TABLE name1(id STRING); |
SET a=b; CREATE TABLE name1(id STRING); |
| Second | SET a=b; SET c=d; CREATE TABLE name2(id STRING); |
SET c=d; CREATE TABLE name2(id STRING); |
-
DataStudio: All SET and USE statements in the node are combined and run before any other SQL statement.
-
Scheduling environment: All statements run sequentially, in the order they appear.
Write SQL statements
MaxCompute SQL supports DDL, DML, and DQL statements along with MaxCompute-specific syntax. Common SQL statements such as SELECT, INSERT, UPDATE, and DELETE are supported. For a full syntax reference, see Overview of MaxCompute SQL.
Create a table
Use CREATE TABLE to create non-partitioned, partitioned, external, or clustered tables. For full syntax, see CREATE TABLE.
-- Create a partitioned table.
CREATE TABLE IF NOT EXISTS students
(id BIGINT, name STRING, age BIGINT, birth DATE)
PARTITIONED BY (gender STRING);
Insert data
Use INSERT INTO or INSERT OVERWRITE to load data into a table. For syntax details, see INSERT INTO and INSERT OVERWRITE.
INSERT INTO students PARTITION(gender='boy') VALUES (1, 'John', 15, DATE '2008-05-15');
INSERT INTO students PARTITION(gender='boy') VALUES (2, 'Jack', 17, DATE '2006-07-20');
INSERT INTO students PARTITION(gender='girl') VALUES (3, 'Alice', 20, DATE '2003-04-20');
INSERT INTO students PARTITION(gender='girl') VALUES (4, 'Lily', 21, DATE '2002-01-08');
INSERT INTO students PARTITION(gender='boy') VALUES (5, 'Bob', 17, DATE '2006-09-12');
INSERT INTO can produce duplicate rows. Use INSERT OVERWRITE to replace existing partition data instead.
Query data
Use SELECT for nested queries, sorting, and grouped queries. For syntax details, see SELECT syntax.
-- Enable full table scan for this session.
SET odps.sql.allow.fullscan=true;
-- Query all students in the 'boy' partition, sorted by ID.
SELECT * FROM students WHERE gender='boy' ORDER BY id;
RAM users do not have permission to query MaxCompute tables in the production environment by default. To grant access, request the required permissions in Security Center in the DataWorks console. For details, see Manage permissions on data in a MaxCompute compute engine instance and Manage user permissions by using commands.
Use scheduling parameters
To pass dynamic values into node code at runtime, define variables in the ${Variable name} format, then assign values to them in the Scheduling Parameter section of the Properties tab. For supported formats, see Supported formats of scheduling parameters.
Use MaxCompute V2.0 data types
If your SQL uses functions that rely on MaxCompute V2.0 data types, add SET odps.sql.type.system.odps2=true; before those statements and submit both together. For details, see MaxCompute V2.0 data type edition.
Use SQL functions
MaxCompute supports built-in functions and user-defined functions (UDFs) for data processing and analysis.
Built-in functions
Call built-in functions directly without any setup. The following example uses the dateadd function to add one month to each value in the birth column.
-- Enable full table scan for this session.
SET odps.sql.allow.fullscan=true;
SELECT id, name, age, birth, dateadd(birth, 1, 'mm') AS birth_dateadd
FROM students;
For a full list of built-in functions, see Overview of built-in functions.
User-defined functions (UDFs)
To use a UDF, write the function code, upload it as a resource, and register the function. For step-by-step instructions, see Create and use a MaxCompute UDF.
View running results and logs
After running a query, results appear in the workbook. A maximum of 10,000 rows and 10 MB of result data can be returned.
Sample output — students in the 'boy' partition, sorted by ID:
+----+------+-----+------------+--------+
| id | name | age | birth | gender |
+----+------+-----+------------+--------+
| 1 | John | 15 | 2008-05-15 | boy |
| 2 | Jack | 17 | 2006-07-20 | boy |
| 5 | Bob | 17 | 2006-09-12 | boy |
+----+------+-----+------------+--------+
Sample output — birth column with `dateadd` applied:
+----+-------+-----+------------+---------------+
| id | name | age | birth | birth_dateadd |
+----+-------+-----+------------+---------------+
| 4 | Lily | 21 | 2002-01-08 | 2002-02-08 |
| 3 | Alice | 20 | 2003-04-20 | 2003-05-20 |
| 2 | Jack | 17 | 2006-07-20 | 2006-08-20 |
| 1 | John | 15 | 2008-05-15 | 2008-06-15 |
| 5 | Bob | 17 | 2006-09-12 | 2006-10-12 |
+----+-------+-----+------------+---------------+
On the Result tab, sort results by column or download them. For details, see Process query results.
To view operation logs, click the Runtime Log tab. Logs open in LogView. For details, see Use LogView V2.0 to view job information.
Due to an International Organization for Standardization (ISO) adjustment to the UTC+8 time zone, timestamps for years between 1900 and 1928 differ from the actual time by 352 seconds, and timestamps before 1900 differ by 9 seconds.
What's next
Explore more advanced use cases with ODPS SQL nodes: