Build & Run MaxCompute SQL Nodes for Distributed Workloads - DataWorks

The MaxCompute SQL node in DataWorks lets you periodically schedule and integrate MaxCompute SQL tasks with other node types. MaxCompute SQL tasks use a familiar SQL-like syntax, suitable for processing massive (terabyte-scale) data in scenarios without high real-time requirements. This topic describes how to develop MaxCompute SQL node tasks in DataWorks.

Introduction

MaxCompute SQL is used to process and query data in MaxCompute. It supports common SQL operations, such as SELECT, INSERT, UPDATE, and DELETE, along with specific MaxCompute syntax and functions. You can then use SQL-like statements to query and process data without writing complex logic. For more information about the SQL syntax, see SQL overview.

Prerequisites

A MaxCompute computing resource is bound to the DataWorks workspace.
(Optional) If you are using a RAM user, ensure they have been added to the workspace and granted either the Development or Workspace Admin role. Grant the Workspace Admin role with caution, as it includes extensive permissions. For more information about how to add a member to a workspace, see Add members to a workspace.
Note
If you are using an Alibaba Cloud account, you can skip this step.

Limitations

The following limitations apply when you use a MaxCompute SQL node to develop SQL tasks:

Category	Description
Comments	Only single-line comments starting with `--` are supported. Multi-line comments (`/* ... */`) are not. For more information, see MaxCompute SQL comments. The following limitations also apply to comments: You cannot add a comment after a statement that contains only a keyword, such as SET or USE. You cannot use a semicolon (;) in a comment. Do not add a comment after a statement that ends with a semicolon (;).


SQL submission	Statements containing only SET or USE must be submitted with another SQL statement.
SQL development	The SQL code cannot exceed 128 KB in size, and the number of SQL commands cannot exceed 200.
Query results	Only SELECT or WITH statements can output a formatted result set. The following limitations apply to query results: Query results are truncated to a maximum of 10,000 rows. If a query result exceeds 10 MB in size, the system reports a `Result is too large, exceed the limit size: 10MB` error. Note If you encounter these limitations, you can download the query results to your local computer for viewing using one of the following methods: Method 1: Use the local client (odpscmd) to run the Tunnel Download command. Method 2: Use the export feature of the SQL Query module in DataWorks to export the query results to your local computer.

Usage notes

Ensure the account that runs the MaxCompute SQL task has the required permissions on the corresponding MaxCompute project. For more information, see DataWorks On MaxCompute permission control and MaxCompute permissions.
MaxCompute SQL task execution consumes quota resources. If a task runs for a long time, go to the MaxCompute console to monitor resource consumption and ensure sufficient quota is available. For more information, see Compute Resources-Quota Management.
Enclose special parameters, such as an OSS address, in double quotation marks (""). Otherwise, parsing may fail and cause the task to fail.
The execution order of statements that contain keywords, such as SET and USE, varies based on the DataWorks environment in which they are run. For more information, see Appendix: SQL execution order in different environments.
In rare cases, such as an abnormal server power-off or a primary/secondary failover, DataWorks may fail to terminate the associated MaxCompute task processes. If this happens, go to the associated MaxCompute project and terminate the jobs manually.

Create a MaxCompute SQL node

For more information about how to create a MaxCompute SQL node, see Create a MaxCompute SQL node.

Develop a MaxCompute SQL node

On the node editing page, perform the following development tasks.

Develop SQL code

DataWorks provides scheduling parameters to dynamically pass values to variables in your code. In a MaxCompute SQL node, you can define variables in the ${variable_name} format and assign a value to the variable in the Scheduling Parameters section on the Schedule tab. For more information about the supported formats of scheduling parameters, see Sources and expressions for scheduling parameters. MaxCompute SQL commands are similar to standard SQL syntax. They support Data Definition Language (DDL), Data Manipulation Language (DML), and Data Query Language (DQL) statements, as well as specific MaxCompute syntax. For more information about the syntax and examples, see SQL overview.

The following sections provide examples for different scenarios:

Note

If you use a new data type in a MaxCompute 2.0 extension function, you must add SET odps.sql.type.system.odps2=true; before the SQL statement of the function and submit them together for execution. This command enables the new data types for the session. For more information about 2.0 data types, see Data type editions.
The execution order of MaxCompute SQL statements varies between Data Studio and Operation Center. For more information, see Appendix: SQL execution order in different environments.

Create a table

You can use the CREATE TABLE statement to create non-partitioned tables, partitioned tables, external tables, and clustered tables. For more information, see CREATE TABLE. The following code provides an example:

-- Create a partitioned table named students.
CREATE TABLE IF NOT EXISTS students
( id BIGINT,
  name STRING,
  age BIGINT,
  birth DATE)
partitioned BY (gender STRING);

Insert data

Use the INSERT INTO or INSERT OVERWRITE statement to add or replace data in a destination table. For more information, see Insert or overwrite data (INSERT INTO | INSERT OVERWRITE).

We recommend that you use INSERT OVERWRITE instead of INSERT INTO to prevent unexpected data duplication. For more information, see Insert or overwrite data.

The following code provides an example:

-- Insert data.
INSERT OVERWRITE students PARTITION(gender='boy') VALUES (1,'ZhangSan',15,DATE '2008-05-15') ;

The INSERT statement can trigger the Compare DDL Fields feature, which validates the fields in your SELECT clause against the destination table's schema.

This feature is not supported in scenarios where a MaxCompute project has the three-layer schema model enabled but the tenant-level model is not enabled.

Query data

You can use the SELECT statement to perform operations such as nested queries, group queries, and sorting. For more information, see SELECT syntax. The following code provides an example:

-- (Optional) Enable a full table scan at the project level. This operation requires high privileges. 
-- SETPROJECT odps.sql.allow.fullscan=true;  

-- Enable a full table scan at the session level. This is valid only for the current session.
SET odps.sql.allow.fullscan=true; 
-- Query information about all male students and sort the results by ID in ascending order.
SELECT * FROM students WHERE gender='boy' ORDER BY id;

Note

By default, a RAM user cannot query data from production tables. To do so, you must apply for permissions in Security Center. For more information about the preset data permissions and access control for MaxCompute in DataWorks, see Details about data permission control for MaxCompute. For more information about how to grant permissions using commands in MaxCompute, see Manage user permissions by using commands.

Use SQL functions

MaxCompute supports built-in functions and user-defined functions (UDFs) for data development and analysis. You can create and use SQL functions based on your business requirements. For more information about built-in functions, see Built-in functions overview. For more information about UDFs, see MaxCompute UDF overview. The following examples describe how to use SQL functions.

built-in function: MaxCompute provides built-in functions that you can call directly. Based on the preceding examples of creating tables, inserting data, and querying data, the following sample command uses the dateadd function to change the values in the birth column based on a specified unit and interval:
```
-- Enable a full table scan at the session level. This is valid only for the current session.
SET odps.sql.allow.fullscan=true; 
SELECT id, name, age, birth, dateadd(birth,1,'mm') AS birth_dateadd FROM students;
```
user-defined function (UDF): To use a UDF, you must write the function code, upload the code as a resource, and register the function. For more information, see Resource Management.

Debug a MaxCompute SQL node

On the right side of the node editing page, configure the parameters in the Run Configuration pane.

Parameter	Description
Computing resource	Select the MaxCompute computing resource that you have bound.
Computing quota	Select a computing quota that you created. The quota provides the computing resources such as CPU and memory that are required for the computing job. If no computing quota is available, click Create Computing Quota in the drop-down list to go to the MaxCompute console and configure a quota.
Resource group	Select a scheduling resource group that has passed the connectivity test with the computing resource. For more information, see Network connection solutions.

In the parameter dialog box on the toolbar, select the created MaxCompute data source and click Run to execute the MaxCompute SQL task.

View results

Execution results are displayed in a workbook within DataWorks. From there, you can open the results or copy and paste the content into a local spreadsheet.
Note
Due to an adjustment in the China time zone information released by the International Organization for Standardization, you may encounter a time difference for specific periods when you run related SQL statements in DataWorks. The difference is 5 minutes and 52 seconds for dates and times from 1900 to 1928, and 9 seconds for dates and times before 1900.
View running logs: On the running results tab , click the LogView link. For more information, see Use Logview 2.0 to view job running information.
Sort results: On the results page, click the drop-down arrow for a field, select an ascending or descending order from the Sort list, and then click OK.

Next steps

Configure scheduling properties for a node: To run a node periodically, configure its scheduling properties, including Scheduling Policies, in the Schedule pane.
Node Publishing: To deploy a task to the production environment, click the icon to start the publishing process. Nodes in the project directory are periodically scheduled only after they are published to the production environment.

Appendix: SQL execution in different environments

The execution order of statements that contain keywords, such as SET and USE, in a MaxCompute SQL node varies based on the DataWorks environment in which they are run.

When run in Data Studio: All keyword statements such as SET and USE in the current task code are merged and executed before all other SQL statements.
In the scheduling environment: The statements are executed in the order they are written.

Assume that the following code is defined in the node.

SET a=b;
CREATE TABLE name1(id string);
SET c=d;
CREATE TABLE name2(id string);

The execution order in different environments is as follows:

SQL statement	Data Studio	Scheduling O&M
First SQL statement	`SET a=b; SET c=d; CREATE TABLE name1(id string);`	`SET a=b; CREATE TABLE name1(id string);`
Second SQL statement	`SET a=b; SET c=d; CREATE TABLE name2(id string);`	`SET c=d; CREATE TABLE name2(id string);`

DataWorks:MaxCompute SQL node