DataWorks provides ODPS SQL nodes that allow you to schedule MaxCompute SQL tasks regularly, integrate them with other node types, and schedule them together. MaxCompute SQL tasks can process terabytes of data in distributed scenarios that do not require real-time processing using SQL-like syntax. This topic describes the precautions and guidance for developing MaxCompute SQL tasks in DataWorks.
Prerequisites
An ODPS SQL node is created. For more information, see Create and manage MaxCompute nodes.
Background information
MaxCompute SQL is used to process and query data in MaxCompute. MaxCompute SQL supports common SQL statements, such as SELECT, INSERT, UPDATE, and DELETE statements, along with specific MaxCompute syntaxes and functions. You can use MaxCompute SQL to write SQL-like statements to query and process data without writing complex data processing logic. For more information about SQL syntax, see Overview of MaxCompute SQL.
Limits
The following table describes the limits on the development of MaxCompute SQL tasks in DataWorks.
Item | Description |
Use of comments | You cannot separately add comments to a SET or USE statement. |
You cannot use semicolons (;) in comments. | |
You cannot add comments to the end of a complete statement. If a semicolon (;) is added to the end of an SQL statement, the SQL statement is considered complete. | |
Execution of SQL statements | You cannot separately use a SET or USE statement in the code of a MaxCompute SQL node. They must be executed with other SQL statements. |
SQL development | The code of an ODPS SQL node cannot exceed 128 KB in size or contain more than 200 SQL statements. |
Query result | You can use only statements that start with SELECT or WITH to return formatted result sets. A maximum of 10,000 rows of result data can be displayed, and a maximum of 10 MB result data can be returned. Note If your query results exceed 10,000 rows, you can download the query results to your local computer for viewing:
|
Precautions
You must ensure that the account used to run your MaxCompute SQL task has permissions on the related MaxCompute project. For more information, see Usage notes for development of MaxCompute tasks in DataWorks and MaxCompute permissions.
A computing resource quota is required to run a MaxCompute SQL task. If your MaxCompute SQL task runs for a long period of time, you can log on to the MaxCompute console to view the resource consumption of the quota. You must ensure that sufficient resources are available for running the MaxCompute SQL task. For more information, see Manage quotas for computing resources in the MaxCompute console.
If special parameters, such as a parameter used to specify an OSS endpoint, are involved when you develop a task based on an ODPS SQL node, you must enclose the parameters in double quotation marks ("). Otherwise, a parsing error occurs and the task on the ODPS SQL node fails.
In some extreme cases (such as unexpected server power outages or primary/secondary failover), DataWorks may not be able to completely terminate the related MaxCompute task processes. In this case, go to the corresponding MaxCompute project to terminate the job.
The sequence in which the SET and USE statements are executed in the code of an ODPS SQL node varies based on the environment in DataWorks.
DataStudio: The SET and USE statements are combined in the node code and are executed before you execute other SQL statements.
Scheduling environment: All statements are executed in sequence.
Sample code that is defined in a MaxCompute SQL node:
SET a=b; CREATE TABLE name1(id string); SET c=d; CREATE TABLE name2(id string);
The execution order for different environments is as follows:
SQL statement
DataStudio
Scheduling environment
First SQL statement
SET a=b; SET c=d; CREATE TABLE name1(id string);
SET a=b; CREATE TABLE name1(id string);
Second SQL statement
SET a=b; SET c=d; CREATE TABLE name2(id string);
SET c=d; CREATE TABLE name2(id string);
Simple code editing example
SQL statements
The MaxCompute SQL syntax is similar to standard SQL syntax. DDL, DML, and DQL statements and MaxCompute-specific statements are supported. For information about the syntax requirements and usage examples of each SQL statement, see Overview of MaxCompute SQL. The following example shows how to write and execute SQL statements.
If new data types are used for the additional functions of MaxCompute V2.0, you must add
SET odps.sql.type.system.odps2=true;
before the SQL statements that use the functions, and commit and execute the SET statement together with the SQL statements. For more information about data types in MaxCompute V2.0, see MaxCompute V2.0 data type edition.DataWorks provides scheduling parameters. To enable parameters to be dynamically passed in the code of an ODPS SQL node, you can define variables in the node code in the
${Variable name}
format, and assign parameters to the variables as values in the Scheduling Parameter section of the Parameters tab. For information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.
Create a table
You can execute the
CREATE TABLE
statement to create a non-partitioned table, a partitioned table, an external table, or a clustered table. For more information, see CREATE TABLE. Sample SQL statement:-- Create a partitioned table named test1 CREATE TABLE if NOT EXISTS students ( id BIGINT, name STRING, age BIGINT, birth DATE) partitioned BY (gender STRING);
Insert data
You can execute the
INSERT INTO
orINSERT OVERWRITE
statement to insert data into a destination table. For more information, see Insert data into or overwrite data in a table or a static partition (INSERT INTO and INSERT OVERWRITE). Sample SQL statements:-- Insert data INSERT INTO students PARTITION(gender='boy') VALUES (1,'John',15,DATE '2008-05-15') ; INSERT INTO students PARTITION(gender='boy') VALUES (2,'Jack',17,DATE '2006-07-20') ; INSERT INTO students PARTITION(gender='girl') VALUES (3,'Alice',20,DATE '2003-04-20') ; INSERT INTO students PARTITION(gender='girl') VALUES (4,'Lily',21,DATE '2002-01-08') ; INSERT INTO students PARTITION(gender='boy') VALUES (5,'Bob',17,DATE '2006-09-12') ;
ImportantThe
INSERT INTO
statement may result in unexpected data duplication. We recommend that you execute theINSERT OVERWRITE
statement, instead of the INSERT INTO statement. For more information, see Insert data into or overwrite data in a table or a static partition (INSERT INTO and INSERT OVERWRITE).Query data
You can execute
SELECT
statements to perform operations, such as nested queries, sorting, and queries by group. For more information, see SELECT syntax. Sample SQL statements:-- Enable the full table scan feature, which is valid only for the current session. SET odps.sql.allow.fullscan=true; -- Query the information about all boys and sort the information by ID in ascending order. SELECT * FROM students WHERE gender='boy' ORDER BY id;
NoteBy default, RAM users do not have the required permissions to query MaxCompute tables in the production environment. If you want to query MaxCompute tables in the production environment as a RAM user, you must request the required permissions for the RAM user in Security Center in the DataWorks console. For information about the permissions of built-in workspace-level roles on MaxCompute data and how to manage the permissions, see Manage permissions on data in a MaxCompute compute engine instance. For information about how to grant permissions to users by running commands, see Manage user permissions using commands.
SQL functions
MaxCompute lets you use built-in functions and user-defined functions (UDFs) for data development and analysis. For information about built-in functions, see Overview of built-in functions. For information about UDFs, see Overview of UDFs. The following example shows how to use SQL functions.
Built-in functions
MaxCompute provides many built-in functions. You can directly call these built-in functions. You can use the
dateadd
function to change data in thebirth
column by specified unit and specified interval based on the example in the preceding section. Sample SQL statements:-- Enable the full table scan feature, which is valid only for the current session. SET odps.sql.allow.fullscan=true; SELECT id, name, age, birth, dateadd(birth,1,'mm') AS birth_dateadd FROM students;
UDFs
To use a UDF, you must write the code, upload the code as a resource, and register the function. For more information, see Create and use a MaxCompute UDF.
Running results and returned data
Running results are displayed in a workbook. You can perform operations on the running results in DataWorks, view or manage the running results in the workbook, or copy the running results to an Excel file on your local computer. For more information, see Debugging procedure.
NoteDue to the adjustment made by the International Organization for Standardization (ISO) on the UTC+8 time zone, differences exist between the actual time and the output time when you execute related SQL statements in DataWorks. In a year from 1900 to 1928, the time difference is 352 seconds. Before the year of 1900, the time difference is 9 seconds.
Operation logs: You can click the Runtime Log tab to view the operation logs in LogView. For more information, see Use LogView V2.0 to view job information.
Returned data:
Query the information about all boys and sort the information by ID in ascending order.
+------------+------------+------------+------------+------------+ | id | name | age | birth | gender | +------------+------------+------------+------------+------------+ | 1 | John | 15 | 2008-05-15 | boy | | 2 | Jack | 17 | 2006-07-20 | boy | | 5 | Bob | 17 | 2006-09-12 | boy | +------------+------------+------------+------------+------------+
Change data in the birth column by specified unit and specified interval.
+------------+------------+------------+------------+---------------+ | id | name | age | birth | birth_dateadd | +------------+------------+------------+------------+---------------+ | 4 | Lily | 21 | 2002-01-08 | 2002-02-08 | | 3 | Alice | 20 | 2003-04-20 | 2003-05-20 | | 2 | Jack | 17 | 2006-07-20 | 2006-08-20 | | 1 | John | 15 | 2008-05-15 | 2008-06-15 | | 5 | Bob | 17 | 2006-09-12 | 2006-10-12 | +------------+------------+------------+------------+---------------+
On the Result tab, you can sort result data by specific field and perform operations such as download on the result data. For more information, see Process query results.
Advanced code editing examples
The following example demonstrates a more complex ODPS SQL task:
FAQ
Q: Why does my MaxCompute SQL task run for a long period of time?
A: A computing resource quota is required to run a MaxCompute SQL task. If your MaxCompute SQL task runs for a long period of time, you can log on to the MaxCompute console to view the resource consumption of the quota. You must ensure that sufficient resources are available for running the MaxCompute SQL task. For more information, see Manage quotas for computing resources in the MaxCompute console.
Q: When I run a MaxCompute SQL task, the error message "You have No privilege 'odps:xxxx' on xxxx" is displayed. What do I do?
A: You must ensure that the account used to run the MaxCompute SQL task has permissions on the related MaxCompute project. For more information, see MaxCompute permissions.