All Products
Search
Document Center

MaxCompute:CREATE TABLE

Last Updated:Jul 10, 2024

Creates a non-partitioned table, a partitioned table, an external table, or a clustered table.

Limits

  • A partitioned table can have a maximum of six levels of partitions. For example, if a table uses date columns as partition key columns, the six levels of the partitions are year/month/week/day/hour/minute.

  • By default, a table can have a maximum of 60,000 partitions. You can adjust the maximum number of partitions in a table based on your business requirements.

For more information about the limits on tables, see MaxCompute SQL limits.

Syntax

-- Create a table. 
 create [external] table [if not exists] <table_name>
 [(<col_name> <data_type> [not null] [default <default_value>] [comment <col_comment>], ...)]
 [comment <table_comment>]
 [partitioned by (<col_name> <data_type> [comment <col_comment>], ...)]
 -- Configure the shuffle and sort properties of a clustered table that you want to create. 
 [clustered by | range clustered by (<col_name> [, <col_name>, ...]) [sorted by (<col_name> [asc | desc] [, <col_name> [asc | desc] ...])] into <number_of_buckets> buckets] 
 -- Used only for external tables. 
 [stored by StorageHandler] 
 -- Used only for external tables. 
 [with serdeproperties (options)] 
 -- Used only for external tables. 
 [location <osslocation>] 
 -- Set the table to a transactional table. You can later modify or delete the data of the transactional table. Transactional tables have specific limits. Create a transactional table base on your business requirements. 
 [tblproperties("transactional"="true")]   
 [lifecycle <days>];

-- Create a table based on an existing table and replicate data from the existing table to the new table. Partition properties are not replicated. 
create table [if not exists] <table_name> [lifecycle <days>] as <select_statement>;

-- Create a table that has the same schema as an existing table. Data in the existing table is not replicated. 
create table [if not exists] <table_name> like <existing_table_name> [lifecycle <days>];

Parameters

Parameter

Required

Description

external

Optional.

This parameter specifies that the table you want to create is an external table.

if not exists

Optional.

If you create a table that has the same name as an existing table but you do not specify the if not exists parameter, an error is returned. If you create a table that has the same name as an existing table and you specify the if not exists parameter, a success message is returned even if the schema of the existing table is different from the schema of the table that you created. The metadata of the existing table remains unchanged.

table_name

Required.

The name of the table that you want to create. The name must be 1 to 128 bytes in length, and can contain letters, digits, and underscores (_). The name must start with a letter. The name is not case-sensitive. If the value of this parameter does not meet the requirements, an error is returned.

col_name

Optional.

The name of a table column. The name must be 1 to 128 bytes in length, and can contain letters, digits, and underscores (_). The name must start with a letter. The name is not case-sensitive. If the value of this parameter does not meet the requirements, an error is returned.

col_comment

Optional.

The comment of a column. The comment must be a valid string that is 1 to 1,024 bytes in length. If the value of this parameter does not meet the requirements, an error is returned.

data_type

Optional.

The data type of a column. The following data types are supported: BIGINT, DOUBLE, BOOLEAN, DATETIME, DECIMAL, and STRING. For more information about data types, see Data type editions.

not null

Optional.

If you specify this parameter for a column, the values of the column cannot be null. For more information about how to modify the parameter, see Change the non-nullable property of a non-partition key column in a table.

default_value

Optional.

The default value of the specified column. If a column is not specified in an INSERT operation, the default value is used for the column.

table_comment

Optional.

The comment of a table. The comment must be a valid string that is 1 to 1,024 bytes in length. If the value of this parameter does not meet the requirements, an error is returned.

partitioned by (<col_name> <data_type> [comment <col_comment>], ...

Optional.

The partition fields of a partitioned table.

  • col_name: the name of a partition key column. The name must be 1 to 128 bytes in length, and can contain letters, digits, and underscores (_). The name must start with a letter. The name is not case-sensitive. If the value of this parameter does not meet the requirements, an error is returned.

  • data_type: the data type of a partition key column. In the MaxCompute V1.0 data type edition, partition key columns must be of the STRING type. In the MaxCompute V2.0 data type edition, partition key columns can be of the TINYINT, SMALLINT, INT, BIGINT, VARCHAR, or STRING type. For more information, see Data type editions. If you use a partition field to partition a table, a full table scan is not required when you add partitions, update partition data, or read partition data. This improves the efficiency of data processing.

  • col_comment: the comment of a partition key column. The comment must be a valid string that is 1 to 1,024 bytes in length. If the value of this parameter does not meet the requirements, an error is returned.

Note

The value of a partition key column cannot contain double-byte characters, such as Chinese characters. It must start with a letter and can contain letters, digits, and supported special characters. It must be 1 to 128 bytes in length. The following special characters are supported: spaces, colons (:), underscores (_), dollar signs ($), number signs (#), periods (.), exclamation points (!), and at signs (@). The behavior of other characters is not defined, such as escaped characters \t, \n, and /.

clustered by | range clustered by (<col_name> [, <col_name>, ...]) [sorted by (<col_name> [asc | desc] [, <col_name> [asc | desc] ...])] into <number_of_buckets> buckets

Optional.

The shuffle and sort properties of the clustered table that you want to create.

Clustered tables are classified into hash-clustered tables and range-clustered tables.

  • Hash-clustered tables

    • CLUSTERED BY: the hash key. MaxCompute performs a hash operation on specified columns and distributes data to each bucket based on the hash values. To prevent data skew and hot spots and to improve the efficiency of concurrent executions, we recommend that you specify columns with a large value range and a small number of duplicate key values in CLUSTERED BY. To optimize the performance of JOIN operations, we recommend that you select commonly used join or aggregate keys. Join and aggregate keys are similar to primary keys in conventional databases.

    • SORTED BY: the sequence of fields in a bucket. To improve performance, we recommend that you keep the configuration of the SORTED BY clause consistent with that of the CLUSTERED BY clause. After you specify fields in the SORTED BY clause, MaxCompute automatically generates indexes, which can be used to accelerate data queries.

    • number_of_buckets: the number of hash buckets. This parameter is required and the value of this parameter varies based on the amount of data. By default, MaxCompute supports a maximum of 1,111 reducers. This means that MaxCompute supports a maximum of 1,111 hash buckets. You can execute the set odps.stage.reducer.num =<Number of concurrent reducers statement to increase the maximum number of hash buckets. However, the maximum number of hash buckets cannot exceed 4,000. Otherwise, performance may be affected.

      To maintain optimal performance, we recommend that you take note of the following rules when you specify the number of hash buckets:

      • Keep the size of each hash bucket around 500 MB. For example, if you want to add 1,000 hash buckets to a partition whose size is 500 GB, the size of each hash bucket is 500 MB on average. If a table contains a large amount of data, you can increase the size of each hash bucket from 500 MB to a size in the range of 2 GB to 3 GB. You can also execute the set odps.stage.reducer.num =<Number of concurrent reducers>; statement to set the maximum number of hash buckets to a value greater than 1,111.

      • To optimize the performance of JOIN operations, we recommend that you do not configure the shuffle and sort properties for hash-clustered tables. The number of hash buckets of a table must be a multiple of the number of hash buckets of the other table. For example, one table has 256 hash buckets and the other table has 512 hash buckets. We recommend that you set the number of hash buckets to 2n, such as 512, 1024, 2048, and 4096. This way, MaxCompute can automatically split and merge hash buckets. To improve execution efficiency, you can also skip the step of configuring the shuffle and sort properties for hash-clustered tables.

  • Range-clustered tables

    • RANGE CLUSTERED BY: the range-clustered columns. MaxCompute performs the bucket operation on the specified columns and distributes data to each bucket based on the bucket ID.

    • SORTED BY: the sequence of fields in a bucket. You can use this parameter in the same way as you use it for a hash-clustered table.

    • number_of_buckets: the number of hash buckets. Compared with hash-clustered tables, range-clustered tables have no limits on the number of buckets when data is evenly distributed. If you do not specify the number of buckets in a range-clustered table, MaxCompute automatically determines the optimal number based on the amount of data.

    • If JOIN and AGGREGATE operations are performed on range-clustered tables and the join key or group key is the range-clustered key or the prefix of the range-clustered key, you can manage flags to disable shuffling. This improves execution efficiency. You can set odps.optimizer.enable.range.partial.repartitioning to true to enable shuffling. By default, this parameter is set to false, which indicates that shuffling is disabled.

      Note
      • Clustered tables help optimize the following aspects:

        • Bucket pruning

        • Aggregation

        • Storage

      • Limits on clustered tables:

        • The INSERT INTO statement is not supported. You can insert data only by using the INSERT OVERWRITE statement.

        • The data that is imported by using Tunnel statements is not arranged in order. Therefore, you cannot import data into a range-clustered table by using Tunnel statements.

stored by StorageHandler

Optional.

This parameter specifies StorageHandler based on the data format of the external table.

with serdeproperties (options)

Optional.

The parameters related to the authorization, compression, and character parsing of the external table.

osslocation

Optional.

The Object Storage Service (OSS) bucket where the data of the external table is stored. For more information, see Create an OSS external table.

tblproperties("transactional"="true")

Optional.

This parameter is used to set the table to a transactional table. You can later perform the UPDATE or DELETE operation on the transactional table to update or delete data by rows. For more information, see UPDATE and DELETE.

Take note of the following limits when you use a transactional table:

  • MaxCompute allows you to set a table to a transactional table only when you create the table. If you execute the ALTER TABLE statement to change an existing table to a transactional table, an error is returned.

    alter table not_txn_tbl set tblproperties("transactional"="true");
    -- The following error is returned: 
    FAILED: Catalog Service Failed, ErrorCode: 151, Error Message: Set transactional is not supported
  • When you create a clustered table or an external table, you cannot set the clustered table or external table to a transactional table.

  • You cannot convert between transactional tables and MaxCompute internal tables, external tables, or clustered tables.

  • Jobs from other systems, such as Spark on MaxCompute, Machine Learning Platform for AI, and Graph, cannot access transactional tables.

  • CLONE TABLE and MERGE PARTITION operations are not supported.

  • Before you execute the UPDATE, DELETE, or INSERT OVERWRITE statement for important data in transactional tables, you must execute the SELECT and INSERT statements to back up the data to other tables.

lifecycle

Optional.

The lifecycle of the table. The value must be a positive integer. Unit: days.

  • Non-partitioned tables: If data in a non-partitioned table remains unchanged for the number of days specified by days after the last data update, MaxCompute executes a statement, such as DROP TABLE, to reclaim the table.

  • Partitioned tables: MaxCompute determines whether to reclaim a partition based on the value of LastModifiedTime. Unlike non-partitioned tables, a partitioned table is not deleted even if all of its partitions have been reclaimed. You can configure lifecycles for tables, but not for partitions.

Note
  • You can execute the create table [if not exists] <table_name> [lifecycle <days>] as <select_statement>; statement to create a table and replicate data to the table. However, partition properties and the lifecycle attribute of the source table are not replicated to the created table. The partition key columns of the source table are considered common columns in the created table. You can also configure the lifecycle parameter to reclaim the table.

  • You can execute the create table [if not exists] <table_name> like <existing_table_name> [lifecycle <days>]; statement to create a table that has the same schema as the source table. However, tables that are created by using this statement do not replicate data or replicate the lifecycle attribute of the source table. You can configure the lifecycle parameter to reclaim the table.

Examples:

  • Example 1: Create a non-partitioned table named test1.

    create table test1 (key STRING);
  • Example 2: Create a partitioned table named sale_detail.

    create table if not exists sale_detail(
     shop_name     STRING,
     customer_id   STRING,
     total_price   DOUBLE)
    partitioned by (sale_date STRING, region STRING); 
  • Example 3: Create a table named sale_detail_ctas1, replicate data from the sale_detail table to the sale_detail_ctas1 table, and then configure the lifecycle for the sale_detail_ctas1 table.

    create table sale_detail_ctas1 lifecycle 10 as select * from sale_detail;

    You can execute the desc extended sale_detail_ctas1; statement to view table details, such as the schema and lifecycle of a table.

    The sale_detail table is a partitioned table, but the sale_detail_ctas1 table that is created by using create table ... as select_statement ... does not replicate partition properties. The partition key columns of the source table are considered common columns in the created table. The sale_detail_ctas1 is a non-partitioned table that has five columns.

  • Example 4: Create a table named sale_detail_ctas2 and use constants as column values in the SELECT clause.

    -- Column names are specified. 
    create table sale_detail_ctas2
    as
    select shop_name, customer_id, total_price, '2013' as sale_date, 'China' as region
    from sale_detail;
    -- Column names are not specified. 
    create table sale_detail_ctas3
    as
    select shop_name, customer_id, total_price, '2013', 'China' 
    from sale_detail;
    Note

    If you use constants as column values in the SELECT clause, we recommend that you specify column names. In this example, the names of the fourth and fifth columns in the sale_detail_ctas3 table contain suffixes that are similar to _c4 and _c5.

  • Example 5: Create a table named sale_detail_like that uses the same schema as the sale_detail table and configure the lifecycle for the sale_detail_like table.

    create table sale_detail_like like sale_detail lifecycle 10;

    You can execute the desc extended sale_detail_like; statement to view table details, such as the schema and lifecycle of a table.

    The schema of the sale_detail_like table is the same as that of the sale_detail table. The two tables have the same properties, such as column names, column comments, and table comments, aside from the lifecycle. However, data in the sale_detail table is not replicated to the sale_detail_like table.

  • Example 6: Create a table named test_newtype that uses new data types.

    set odps.sql.type.system.odps2=true;
    CREATE TABLE test_newtype (
        c1 TINYINT
        ,c2 SMALLINT
        ,c3 INT
        ,c4 BIGINT
        ,c5 FLOAT
        ,c6 DOUBLE
        ,c7 DECIMAL
        ,c8 BINARY
        ,c9 TIMESTAMP
        ,c10 ARRAY<MAP<BIGINT,BIGINT>>
        ,c11 MAP<STRING,ARRAY<BIGINT>>
        ,c12 STRUCT<s1:STRING,s2:BIGINT>
        ,c13 VARCHAR(20))
    LIFECYCLE 1
    ;
  • Example 7: Create a hash-clustered table named t1. This table is a non-partitioned table.

    create table t1 (a STRING, b STRING, c BIGINT) clustered by (c) sorted by (c) into 1024 buckets; 
  • Example 8: Create a hash-clustered table named t2. This table is a partitioned table.

    create table t2 (a STRING, b STRING, c BIGINT) partitioned by (dt STRING) clustered by (c) sorted by (c) into 1024 buckets; 
  • Example 9: Create a range-clustered table named t3. This table is a non-partitioned table.

    create table t3 (a STRING, b STRING, c BIGINT) range clustered by (c) sorted by (c) into 1024 buckets;
  • Example 10: Create a range-clustered table named t4. This table is a partitioned table.

    create table t4 (a STRING, b STRING, c BIGINT) partitioned by (dt STRING) range clustered by (c) sorted by (c); 
  • Example 11: Create a transactional table named t5. This table is a non-partitioned table.

    create table t5(id bigint) tblproperties("transactional"="true");
  • Example 12: Create a transactional table t6. This table is a partitioned table.

    create table if not exists t6(id bigint) partitioned by(ds string) tblproperties ("transactional"="true");

Related statements

  • ALTER TABLE: Alters information about a table.

  • TRUNCATE: Clears data in a specified table.

  • DROP TABLE: Drops a partitioned table or a non-partitioned table.

  • DESC TABLE/VIEW: Views the information about MaxCompute internal tables, views, materialized views, external tables, clustered tables, or transactional tables.

  • SHOW: Queries SQL DDL statements that are used to list all tables and views in a project or list all partitions in a table.