Using the SQL-like syntax, ODPS SQL nodes can process terabytes of data in distributed processing scenarios that do not require real-time processing.

Prerequisites

A MaxCompute compute engine is bound to the workspace where you want to create an ODPS SQL node. The MaxCompute service is available in a workspace only after you bind a MaxCompute compute engine to the workspace on the Workspace Management page. For more information, see Configure a workspace.

Background information

Generally, it takes a long time from preparing to committing a job. You can use ODPS SQL nodes to process thousands to tens of thousands of transactions. ODPS SQL nodes are online analytical processing (OLAP) applications designed to deal with large amounts of data.

ODPS SQL nodes have the following limits:
  • You cannot use SET statements, USE statements, or SQL alias statements independently in the code of an ODPS SQL node. They must be executed together with other SQL statements. For example, you can use a SET statement together with a CREATE TABLE statement.
    set a=b;
    create table name(id string);
  • You cannot add comments to statements that contain keywords, including SET statements, USE statements, and SQL alias statements, in the code of an ODPS SQL node. For example, the following comment is not allowed:
    create table name(id string);
    set a=b; // Comment.
    create table name1(id string);
  • The running of an ODPS SQL node during workflow development and the scheduled running of an ODPS SQL node have the following differences:
    • Running during workflow development: combines all the statements that contain keywords, including SET statements, USE statements, and SQL alias statements, in the node code and executes them before executing other SQL statements.
    • Scheduled running: executes all SQL statements in sequence.
    set a=b;
    create table name1(id string);
    set c=d;
    create table name2(id string);
    The following table describes the differences between the two running modes for the preceding SQL statements.
    SQL statement Running during workflow development Scheduled running
    First SQL statement
    set a=b;
    set c=d;
    create table name1(id string);
    set a=b;
    create table name1(id string);
    Second SQL statement
    set a=b;
    set c=d;
    create table name2(id string);
    set c=d;
    create table name2(id string);
  • You must specify a scheduling parameter in the format of key=value. Do not add any spaces before or after the equation mark (=). Examples:
    time = {yyyymmdd hh:mm:ss} // Incorrect format.
    a =b // Incorrect format.
  • If you use keywords such as bizdate and date as scheduling parameters, you must specify the values in the format of yyyymmdd. If you want to use other time formats, do not use the preceding keywords as scheduling parameters. Example:
    bizdate=201908 // Incorrect format.
  • You can only use statements that start with SELECT, READ, or WITH to query the result data for a node during workflow development. Otherwise, no results are returned.
  • Separate multiple SQL statements with semicolons (;) and place them in different lines.
    • Incorrect example
      create table1;create table2
    • Correct example
      create table1;
      create table2;
  • If extension functions in MaxCompute V2.0 use new data types, you must add set odps.sql.type.system.odps2=true; before the SQL statements that use the functions, and commit and execute the code together with the SQL statements.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the Data Development tab, move the pointer over the Create icon icon and choose MaxCompute > ODPS SQL.
    Alternatively, you can click a workflow in the Business process section, right-click MaxCompute, and then choose New > ODPS SQL.
  3. In the Create Node dialog box, set the Node Name and Location parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Commit.
  5. Write and run the code of the ODPS SQL node.
    After the node is created, write the code of the ODPS SQL node. The code must conform to the syntax. For more information about the SQL syntax, see MaxCompute SQL overview.
    Note Due to the adjustment made by the International Organization for Standardization (ISO) on the UTC+8 time zone, differences exist between the real time and the output time when you execute related SQL statements in DataWorks. Between the years of 1900 and 1928, the time difference is 352 seconds. Before the year of 1900, the time difference is 9 seconds.
    You cannot use SET statements independently in the node code. The SET statements must be executed together with other SQL statements. For example, you can use a SET statement together with a SELECT statement.
    set project odps.sql.allow.fullscan=true;
    select 1;

    For more information about the SET statements, see SET operations.

    The following example creates a table, inserts data to the table, and queries data in the table:

    1. Create a table named test1.
      CREATE TABLE IF NOT EXISTS test1 
      ( id BIGINT COMMENT '' ,
        name STRING COMMENT '' ,
        age BIGINT COMMENT '' ,
        sex STRING COMMENT '');
    2. Insert data to the table.
      INSERT INTO test1 VALUES (1,'Zhang San',43,'Male');
      INSERT INTO test1 VALUES (1,'Li Si',32,'Male');
      INSERT INTO test1 VALUES (1,'Chen Xia',27,'Female');
      INSERT INTO test1 VALUES (1,'Wang Wu',24,'Male');
      INSERT INTO test1 VALUES (1,'Ma Jing',35,'Female');
      INSERT INTO test1 VALUES (1,'Zhao Qian',22,'Female');
      INSERT INTO test1 VALUES (1,'Zhou Zhuang',55,'Male');
    3. Query data in the table.
      select * from test1;
    4. After you enter the preceding SQL statements in the code editor, click the Run icon icon in the toolbar. DataWorks executes your SQL statements from top to bottom and displays logs.
      Note
      • If multiple MaxCompute compute engines are bound to the current workspace, select one MaxCompute compute engine before you click the Run icon.
      • If the MaxCompute compute engine you selected uses the custom resource group in pay-as-you-go mode, click the Cost Estimation icon icon in the toolbar to estimate the cost. The expenses incurred for running this node are subject to the bill.
      The INSERT INTO statement may result in unexpected data duplication. Although DataWorks does not re-execute the INSERT INTO statement, it may rerun corresponding nodes. We recommend that you avoid using the INSERT INTO statement. When DataWorks executes the INSERT INTO statement, the following information appears in logs:
      The INSERT INTO statement in SQL may cause repeated data insertion. Although SQL-level retries have been revoked for the INSERT INTO statement, task-level retries may still happen. We recommend that you avoid the use of the INSERT INTO statement. 
      If you continue to use INSERT INTO statements, we deem that you are aware of the associated risks and are willing to take the consequences of potential data duplication.
    5. After the SQL statements are executed, click the Save icon icon in the toolbar to save the SQL code.
    6. View the query result.
      DataWorks displays the query result in a workbook.
      You can view or manage the query result in the workbook, or copy the query result to a local Excel file. View the query result
      Action Description
      Hide Column Select one or more columns and click Hide Column in the lower part to hide the selected columns.
      Copy Row Select one or more rows and click Copy Row in the lower part to copy the selected rows.
      Copy Column Select one or more columns and click Copy Column in the lower part to copy the selected columns.
      Copy Selected Select one or more cells in the workbook and click Copy Selected in the lower part to copy the selected cells.
      Data analysis Click Data analysis in the lower part to go to the workbook editing page.
      Search Click Search in the lower part to search for data in the workbook. After you click the button, a search box appears in the upper-right corner of the Results tab.
      Download Download the result data in the encoding format of GBK or UTF-8.
  6. On the node configuration tab, click the Scheduling configuration tab in the right-side navigation pane. On the Scheduling configuration tab, set the scheduling properties for the node. For more information, see Basic properties.
  7. Commit the node.
    Notice You must set the Rerun attribute and Dependent upstream node parameters before you can commit the node.
    1. Click the Submit icon icon in the toolbar.
    2. In the Submit New version dialog box, set the Change description parameter and select I confirm to proceed with the commit operation.
      If an alert appears, indicating that the input and output that you set do not match with those detected in code lineage analysis, you can choose to ignore the alert or click Cancel and then modify the dependencies.
    3. Click OK.
    In a workspace in standard mode, you must click Publish in the upper-right corner after you commit the ODPS SQL node.
  8. Test the node. For more information, see View auto triggered nodes.