All Products
Search
Document Center

MaxCompute:OSS external tables

Last Updated:Oct 12, 2025

In MaxCompute, you can create an Object Storage Service (OSS) external table that maps to an OSS directory. This lets you access unstructured data from files in the OSS directory or write data from a MaxCompute project to the OSS directory. This topic describes the syntax and parameters for creating OSS external tables and reading data from or writing data to them.

Prerequisites

  • Your Alibaba Cloud account, RAM user, or RAM role must have the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.

  • (Optional) You have created an OSS bucket, OSS directories, and OSS data files. For more information, see Create buckets, Manage directories, and Simple upload.

    MaxCompute can automatically create directories in OSS. You can use a single SQL statement to perform read and write operations that involve both external tables and UDFs. Manual directory creation is no longer required, but the legacy method is still supported.
  • You have created a MaxCompute project. For more information, see Create a MaxCompute project.

    MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, use a bucket in the same region as your MaxCompute project.
  • Your Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.

Notes

  • OSS external tables only record mappings to OSS directories. Deleting an OSS external table does not delete the data files in the mapped OSS directory.

  • If OSS data files are archived objects, you must first restore the objects.

  • Use the classic network endpoint for OSS. MaxCompute does not guarantee network connectivity for public endpoints.

Limits

  • The cluster property is not supported for OSS external tables.

  • A single file cannot exceed 3 GB. If a file is larger than 3 GB, you must split it.

Entry points

You can create OSS external tables and read data from or write data to them on the following platforms.

Method

Platform

MaxCompute SQL

Visualization

Create an OSS external table

OSS external tables can be partitioned or non-partitioned. The type of external table that you create depends on how the data files are stored in OSS. If data files are stored in partitioned directories, create a partitioned external table. Otherwise, create a non-partitioned external table.

Notes

The CREATE EXTERNAL TABLE statement requires different parameters depending on the data file format. To avoid errors during read or write operations, ensure you use the correct syntax and parameters for your specific business needs.

Syntax

  • Scenario: Create an external table using the built-in text data parser

    Syntax

    Data file format

    Example

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> 
    (
    <col_name> <data_type>,
    ...
    )
    [comment <table_comment>]
    [partitioned BY (<col_name> <data_type>, ...)] 
    stored BY '<StorageHandler>'  
    WITH serdeproperties (
     ['<property_name>'='<property_value>',...]
    ) 
    location '<oss_location>';

    Supported data file formats for reading data from or writing data to OSS:

    • CSV

    • TSV

    • CSV or TSV files compressed in GZIP, SNAPPY, or LZO format

  • Scenario: Create an external table using the built-in open source data parser

    Syntax

    Data file format

    Example

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
    <col_name> <data_type>,
    ...
    )
    [comment <table_comment>]
    [partitioned BY (<col_name> <data_type>, ...)]
    [row format serde '<serde_class>'
      [WITH serdeproperties (
        ['<property_name>'='<property_value>',...])
      ]
    ]
    stored AS <file_format> 
    location '<oss_location>' 
    [USING '<resource_name>']
    [tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];

    Supported data file formats for reading data from or writing data to OSS:

    • PARQUET

    • TEXTFILE

    • ORC

    • RCFILE

    • AVRO

    • JSON

    • SEQUENCEFILE

    • Hudi (Only supports reading Hudi data generated by Data Lake Formation (DLF).)

    • PARQUET files compressed in ZSTD, SNAPPY, or GZIP format

    • ORC files compressed in SNAPPY or ZLIB format

    • TEXTFILE files compressed in GZIP, SNAPPY, or LZO format

  • Scenario: Create an external table using a custom parser

    Syntax

    Data file format

    Example

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> 
    (
    <col_name> <date_type>,
    ...
    )
    [comment <table_comment>]
    [partitioned BY (<col_name> <data_type>, ...)] 
    stored BY '<StorageHandler>' 
    WITH serdeproperties (
     ['<property_name>'='<property_value>',...]
    ) 
    location '<oss_location>' 
    USING '<jar_name>';

    Supported data file formats for reading data from or writing data to OSS: data files in formats other than the preceding ones.

Parameters

The following table describes the common parameters for external tables of all formats. For parameters that are unique to a specific format, see the documentation for that format.

  • Basic syntax parameters

    Parameter

    Required

    Description

    mc_oss_extable_name

    Yes

    The name of the OSS external table that you want to create.

    Table names are not case-sensitive. When you query external tables, you do not need to distinguish between uppercase and lowercase letters. Forced case conversions are not supported.

    col_name

    Yes

    The name of a column in the OSS external table.

    When you read data from OSS, the schema of the created OSS external table must be the same as the schema of the OSS data files. Otherwise, you cannot read the OSS data.

    data_type

    Yes

    The data type of a column in the OSS external table.

    When you read data from OSS, the data types of columns in the OSS external table must be the same as the data types of columns in the OSS data files. Otherwise, you cannot read the OSS data.

    table_comment

    No

    The comment of the table. The comment must be a valid string of up to 1,024 bytes. Otherwise, an error is returned.

    partitioned by (col_name data_type, ...)

    No

    If data files in OSS are stored in partitioned directories, this parameter is required to create a partitioned table.

    • col_name: the name of a partition key column.

    • data_type: the data type of a partition key column.

    '<(tb)property_name>' = '<(tb)property_value>'

    Yes

    The extended properties of the OSS external table. For more information about the parameters, see the documentation for the specific format.

    oss_location

    Yes

    The OSS path where the data files are stored. By default, MaxCompute reads all data files in this path.

    Format: oss://<oss_endpoint>/<Bucket name>/<OSS directory name>/.

    • oss_endpoint:

      • The OSS endpoint. Use the classic network endpoint provided by OSS, which is the endpoint that contains -internal.

        Example: oss://oss-cn-beijing-internal.aliyuncs.com/xxx.

        For more information about OSS classic network endpoints, see OSS Regions and Endpoints.

        We recommend that the OSS region where data files are stored is the same as the region where your MaxCompute project is located. If they are in different regions, data connectivity issues may occur.
      • You can omit the endpoint. If you do not specify an endpoint, the system uses the endpoint of the region where the current project is located by default.

        This method is not recommended. If the file storage is cross-region, data connectivity issues may occur.
    • Bucket name: The name of the OSS bucket.

      Example: oss://oss-cn-beijing-internal.aliyuncs.com/your_bucket/path/.

      For more information about bucket names, see List buckets.

    • Directory name: The name of the OSS directory. Do not specify a file name after the directory.

      Example: oss://oss-cn-beijing-internal.aliyuncs.com/oss-mc-test/Demo1/.

      Incorrect examples:

       -- HTTP connections are not supported.
      http://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/
      -- HTTPS connections are not supported.               
      https://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/
      -- The endpoint is invalid.            
      oss://oss-cn-shanghai-internal.aliyuncs.com/Demo1
      -- Do not specify a file name.                            
      oss://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/vehicle.csv
    • Permission specification (RamRole):

      • Explicitly specify (recommended): Create a custom role, attach an access policy to it, and use its ARN. For more information, see STS Pattern Authorization.

      • Use the default role (not recommended): Use the ARN of the aliyunodpsdefaultrole role.

  • WITH serdeproperties properties

    property_name

    Scenario

    property_value

    Default value

    odps.properties.rolearn

    If you use Security Token Service (STS) authorization, add this property.

    The Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS. You can obtain the ARN from the role details in the RAM console. Example: acs:ram::xxxxxx:role/aliyunodpsdefaultrole.

    • If the owners of MaxCompute and OSS are the same Alibaba Cloud account:

      • If you do not specify odps.properties.rolearn in the statement that is used to create a table, the ARN of the aliyunodpsdefaultrole role is used by default.

      • If you want to use the ARN of a custom role, you must create the custom role in advance. For more information, see STS authorization for OSS (Method 2).

    • When the owners of MaxCompute and OSS are different Alibaba Cloud accounts, you must specify the ARN of a custom role. For more information about how to create a custom role, see STS authorization for OSS (Method 3).

Syntax for adding partition data to an OSS external table

If you created a partitioned OSS external table, you must add partition data to it using one of the following methods:

  • Method 1 (recommended): Automatically parse the OSS directory to identify partitions and add the partition information to the OSS external table.

    This method is suitable for scenarios where you want to add all missing historical partitions at once. MaxCompute automatically adds partitions to the OSS external table based on the partition directory that you specified when you created the OSS external table. You do not need to add partitions one by one based on partition key column names and values.

    MSCK REPAIR TABLE <mc_oss_extable_name> ADD PARTITIONS [ WITH properties (key:VALUE, key: VALUE ...)];
    This method is not suitable for processing incremental data, especially when the OSS directory contains many partitions, such as more than 1,000. If the number of new partitions is much smaller than the number of existing ones, frequent use of the msck command causes repeated scans of the OSS directory and frequent metadata updates, which reduces the command's execution efficiency. For scenarios where you need to update incremental partitions, we recommend that you use Method 2.
  • Method 2: Manually run commands to add partition information to the OSS external table.

    This method is suitable for scenarios where historical partitions are already created and you need to add new partitions frequently. Before you run a data write task, you must create partitions. You do not need to refresh the partitions when new data is written to OSS. The external table can then read the latest data directly from the OSS directory.

    ALTER TABLE <mc_oss_extable_name> 
      ADD PARTITION (<col_name>=<col_value>)[
      ADD PARTITION (<col_name>=<col_value>)...][location URL];

    The col_name and col_value values must match the names of the directories where the partition data files are stored.

    For example, assume the OSS directory structure for partition data files is as shown in the following figure. col_name corresponds to direction, and col_value corresponds to N, NE, S, SW, W. Each add partition statement corresponds to one subdirectory. To add multiple OSS subdirectories, you must run multiple add partition statements.

    分区路径

Read OSS data

Notes

  • After you create an OSS external table, you can use it to read data from OSS. For more information about the OSS data file formats that are supported by MaxCompute and the syntax for creating OSS external tables, see Syntax.

  • If an SQL statement involves complex data types, you must add the set odps.sql.type.system.odps2=true; command before the SQL statement and submit them together. For more information about data types, see Data Type Version Guide.

  • For OSS external tables that are mapped to open source data, you must set set odps.sql.hive.compatible=true; at the session level before you read data from OSS. Otherwise, an error is returned.

  • OSS provides limited outbound bandwidth. If the data read and write traffic exceeds the bandwidth limit of an instance within a short period, the read and write speed of OSS external tables is affected. For more information about OSS bandwidth, see Limits and performance metrics.

Syntax

<select_statement> FROM <from_statement>;
  • select_statement: the SELECT clause used to query data from the source table that will be inserted into the destination table.

  • from_statement: the FROM clause that specifies the data source, such as an external table name.

Non-partitioned data

After you create a non-partitioned OSS external table, you can use one of the following methods to read data from OSS:

  • Method 1 (recommended): Import data from OSS into a MaxCompute internal table, and then query the internal table.

    This method is suitable for scenarios where you must repeatedly compute data or have high requirements for computing performance. You can create an internal table in your MaxCompute project that has the same schema as the OSS external table, import data from OSS to the internal table, and then execute complex queries. This method lets you take full advantage of the MaxCompute optimization mechanism for internal storage to improve computing performance. Example:

    CREATE TABLE <table_internal> LIKE <mc_oss_extable_name>;
    INSERT OVERWRITE TABLE <table_internal> SELECT * FROM <mc_oss_extable_name>;
  • Method 2: Directly read data from OSS in the same way you read data from a MaxCompute internal table.

    This method is suitable for scenarios where you do not have high requirements for computing performance. Unlike reading data from an internal table, each read operation retrieves data directly from OSS.

Partitioned data

When you query an OSS external table, MaxCompute performs a full table scan on all data in the OSS directory, including data in subdirectories. If the data volume is large, a full directory scan causes unnecessary I/O consumption and increases data processing time. You can use one of the following methods to resolve this issue:

  • Method 1 (Recommended): Store data on OSS using either the standard partition path or a custom partition path format.

    When you create an OSS external table, you must specify the partition and oss_location information in the creation statement. We recommend that you store OSS data in standard partition directories.

  • Method 2: Use multiple data storage paths.

    Create multiple OSS external tables to read data from different paths. Each OSS external table maps to a subset of the OSS data. We do not recommend this method because it is cumbersome and results in poor data management.

Standard partition directory format

oss://<oss_endpoint>/<Bucket name>/<Directory name>/<partitionKey1=value1>/<partitionKey2=value2>/...

Example: A company stores log files that are generated every day in CSV format in OSS and processes the data on a daily basis using MaxCompute. You can specify the following standard partition directories for OSS data:

oss://oss-odps-test/log_data/year=2016/month=06/day=01/logfile
oss://oss-odps-test/log_data/year=2016/month=06/day=02/logfile
oss://oss-odps-test/log_data/year=2016/month=07/day=10/logfile
oss://oss-odps-test/log_data/year=2016/month=08/day=08/logfile
...

Custom partition directory format

In a custom partition directory format, only partition column values are included, not partition column names. Example:

oss://oss-odps-test/log_data_customized/2016/06/01/logfile
oss://oss-odps-test/log_data_customized/2016/06/02/logfile
oss://oss-odps-test/log_data_customized/2016/07/10/logfile
oss://oss-odps-test/log_data_customized/2016/08/08/logfile
...

Data in OSS is stored by partition, but the path is not in the standard partition directory format. MaxCompute lets you bind subdirectories to different partitions to access the data within them.

Solution: After you create an OSS external table, run the alter table ... add partition ... location ... command to specify subdirectories and bind them to partitions. Example:

ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '01')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/06/01/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '02')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/06/02/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '07', day = '10')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/07/10/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '08', day = '08')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/08/08/';

Query optimization

  • Dynamic statistics collection

    Because data is stored in an external data lake and lacks pre-computed statistics, the query optimizer adopts a conservative strategy, which can result in low query efficiency. The dynamic statistics collection feature allows the optimizer to temporarily collect table statistics during query execution to identify small tables. The optimizer can then use this information to actively use hash joins, optimize join orders, reduce data shuffling, and shorten execution pipelines to optimize queries.

    SET odps.meta.exttable.stats.onlinecollect=true;
    SELECT * FROM <tablename>;
  • Optimizing External Table Splits

    You can optimize query efficiency by setting the split size parameter, which adjusts the amount of data processed by a single concurrent task. The split size setting has the following effects:

    • If the table contains a large amount of data and the split size is too small, too many splits are created. This results in a high degree of parallelism, and instances spend most of their time queuing for resources.

    • If the table contains a small amount of data and the split size is too large, too few splits are created. This results in insufficient concurrency, and most resources remain idle.

    SET odps.stage.mapper.split.size=<value>;
    SELECT * FROM <tablename>;

Write data to OSS

Similar to reading data from OSS, MaxCompute lets you write data from internal tables or data processed from external tables to OSS.

Syntax

INSERT {INTO|OVERWRITE} TABLE <table_name> PARTITION (<ptcol_name>[, <ptcol_name> ...]) 
<select_statement> FROM <from_statement>;

Parameter

Required

Description

table_name

Yes

The name of the external table to which you want to write data.

select_statement

Yes

The SELECT clause that is used to query data from the source table to be inserted into the destination table. If the destination table has only one level of dynamic partitions, the value of the last field in the SELECT clause is the dynamic partition value of the destination table. The values for the output partitions are mapped by the order of the fields in the SELECT clause, not by field names. If the field order in the source table differs from that in the destination table, specify the fields in the select_statement statement in the same order as the fields in the destination table.

from_statement

Yes

The FROM clause that specifies the data source. For example, the name of the internal table from which you want to read data.

To insert data into dynamic partitions, see Insert or overwrite dynamic partition data (DYNAMIC PARTITION).

Notes

  • If the INSERT OVERWRITE ... SELECT ... FROM ...; operation allocates 1,000 mappers for the source table from_tablename, 1,000 TSV or CSV files are generated.

  • You can control the number of generated files using the configurations that MaxCompute provides.

    • Output in Mapper: Use odps.stage.mapper.split.size to control the concurrency of mappers and adjust the number of generated files.

    • Output in Reducer or Joiner: Use odps.stage.reducer.num and odps.stage.joiner.num to adjust the number of generated files.

Write data to OSS using the OSS multipart upload feature

To write data to OSS in an open source format, you can use an OSS foreign table that is based on an open-source data parser and the OSS multipart upload feature to execute an INSERT operation.

You can enable the OSS multipart upload feature using the following settings:

Scenario

Command

Set at the project level

The setting takes effect for the entire project.

setproject odps.sql.unstructured.oss.commit.mode =true;

Set at the session level

The setting takes effect only for the current task.

set odps.sql.unstructured.oss.commit.mode =true;

The default value of the odps.sql.unstructured.oss.commit.mode property is false. The following table describes the implementation principles for each value of this property:

Value

Principle

false

The data that MaxCompute writes to an OSS external table is stored in the .odps folder in the LOCATION directory. The .odps folder contains a .meta file that is used to ensure the consistency of MaxCompute data. Only MaxCompute can correctly process the content in the .odps folder. Other data processing engines may not be able to correctly parse the content, which results in errors.

true

MaxCompute uses the multipart upload feature, which is compatible with other data processing engines. MaxCompute ensures data consistency using a two-phase commit mechanism. This process avoids creating the .odps directory and .meta file.

Manage exported files

Parameters

When you need to add a prefix, suffix, or extension to the data files written to OSS, you can set the following parameters.

property_name

Scenario

Description

property_value

Default value

odps.external.data.output.prefix

(Compatible with odps.external.data.prefix)

Add this property when you need to add a custom prefix to the output files.

  • Contains only letters, digits, and underscores (a-z, A-Z, 0-9, _).

  • The length is from 1 to 10 characters.

A valid combination of characters, such as 'mc_'.

None

odps.external.data.enable.extension

Add this property when you need to display the extension of the output files.

True indicates that the extension of the output files is displayed. False indicates that the extension is not displayed.

  • True

  • False

False

odps.external.data.output.suffix

Add this property when you need to add a custom suffix to the output files.

Contains only letters, digits, and underscores (a-z, A-Z, 0-9, _).

A valid combination of characters, such as '_hangzhou'.

None

odps.external.data.output.explicit.extension

Add this property when you need to add a custom extension to the output files.

  • Contains only letters, digits, and underscores (a-z, A-Z, 0-9, _).

  • The length is from 1 to 10 characters.

  • This parameter has a higher priority than the odps.external.data.enable.extension parameter.

A valid combination of characters, such as "jsonl".

None

Usage examples

  1. Set the custom prefix of the output OSS file to test06_. The DDL statement is as follows:

    CREATE EXTERNAL TABLE  <mc_oss_extable_name>
    (
      vehicleId INT,
      recordId INT,
      patientId INT,
      calls INT,
      locationLatitute DOUBLE,
      locationLongitude DOUBLE,
      recordTime STRING,
      direction STRING
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    ) 
    STORED AS textfile
    LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
    TBLPROPERTIES (
    -- Set a custom prefix.
        'odps.external.data.output.prefix'='test06_') 
    ;
    
    -- Write data to the external table.
    INSERT INTO  <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

    The following figure shows the generated file.

    image

  2. Set the custom suffix of the output OSS file to _beijing. The DDL statement is as follows:

    CREATE EXTERNAL TABLE <mc_oss_extable_name>
    (
      vehicleId INT,
      recordId INT,
      patientId INT,
      calls INT,
      locationLatitute DOUBLE,
      locationLongitude DOUBLE,
      recordTime STRING,
      direction STRING
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    ) 
    STORED AS textfile
    LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
    TBLPROPERTIES (
    -- Set a custom suffix.
        'odps.external.data.output.suffix'='_beijing') 
    ;
    
    -- Write data to the external table.
    INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

    The following figure shows the generated file.

    image

  3. An extension is automatically generated for the output OSS file. The DDL statement is as follows:

    CREATE EXTERNAL TABLE <mc_oss_extable_name>
    (
      vehicleId INT,
      recordId INT,
      patientId INT,
      calls INT,
      locationLatitute DOUBLE,
      locationLongitude DOUBLE,
      recordTime STRING,
      direction STRING
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    ) 
    STORED AS textfile
    LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
    TBLPROPERTIES (
    -- Automatically generate an extension.
        'odps.external.data.enable.extension'='true') 
    ;
    
    -- Write data to the external table.
    INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

    The figure below shows the generated file.

  4. Set the custom extension of the output OSS file to jsonl. The DDL statement is as follows:

    CREATE EXTERNAL TABLE <mc_oss_extable_name>
    (
      vehicleId INT,
      recordId INT,
      patientId INT,
      calls INT,
      locationLatitute DOUBLE,
      locationLongitude DOUBLE,
      recordTime STRING,
      direction STRING
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    ) 
    STORED AS textfile
    LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
    TBLPROPERTIES (
    -- Set a custom extension.
       'odps.external.data.output.explicit.extension'='jsonl') 
    ;
    
    -- Write data to the external table.
    INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

    The following figure shows the generated file.

    image.png

  5. Set the prefix of the output OSS file to mc_, the suffix to _beijing, and the extension to jsonl. The DDL statement is as follows:

    CREATE EXTERNAL TABLE <mc_oss_extable_name>
    (
      vehicleId INT,
      recordId INT,
      patientId INT,
      calls INT,
      locationLatitute DOUBLE,
      locationLongitude DOUBLE,
      recordTime STRING,
      direction STRING
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    ) 
    STORED AS textfile
    LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/'
    TBLPROPERTIES (
        -- Set a custom prefix.
        'odps.external.data.output.prefix'='mc_', 
        -- Set a custom suffix.
        'odps.external.data.output.suffix'='_beijing', 
        -- Set a custom extension.
        'odps.external.data.output.explicit.extension'='jsonl') 
    ;  
    
    -- Write data to the external table.
    INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');

    The following figure shows the generated file.

    image.png

Write large files in dynamic partitions

Business scenario

You need to export the calculation results of an ancestor table to OSS in partitions and write them as large files, such as 4 GB files.

The odps.adaptive.shuffle.desired.partition.size parameter (in MB) controls the data volume processed by each reducer, which in turn determines the size of each output file. Using this parameter to create larger output files can reduce parallelism, potentially increasing the overall job runtime.

Parameters

-- You must enable the dynamic partition feature.
set odps.sql.reshuffle.dynamicpt=true; 	

-- Set the desired amount of data for each reducer to consume. Assume you want each file to be 4 GB.
set odps.adaptive.shuffle.desired.partition.size=4096;	

Usage example

Write a JSON file of about 4 GB to OSS.

  1. Prepare test data. Use the bigdata_public_dataset.tpcds_1t.web_sales table from the public dataset, which is about 30 GB. The data is stored in a compressed format in MaxCompute, and its size will increase after being exported.

  2. Create a JSON external table.

    -- Sample table name: json_ext_web_sales
    CREATE EXTERNAL TABLE  json_ext_web_sales(
        c_int INT ,
        c_string STRING 
    )
    PARTITIONED BY (pt STRING)
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
    WITH serdeproperties (
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole'
    )
    STORED AS textfile
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/demo-test/';
  3. With no parameters set, data from the test table is written to the JSON foreign table using dynamic partitions.

    -- Enable the Layer 3 syntax switch.
    set odps.namespace.schema=true;
    
    -- Write data to the JSON external table using dynamic partitions.
    INSERT OVERWRITE json_ext_web_sales 
    PARTITION(pt) 
    SELECT CAST(ws_item_sk AS INT) AS c_int, 
           CAST(ws_bill_customer_sk AS string) AS c_string , 
          COALESCE(CONCAT(ws_bill_addr_sk %2, '_', ws_promo_sk %3),'null_pt') AS pt 
    FROM  bigdata_public_dataset.tpcds_1t.web_sales;

    The following figure shows the files stored on OSS:

    image

  4. Add the odps.adaptive.shuffle.desired.partition.size parameter for outputting large files, and write the test table to the JSON external table in dynamic partitions.

    -- Enable Layer 3 schema syntax.
    set odps.namespace.schema=true;
    
    -- You must enable the dynamic partition feature.
    set odps.sql.reshuffle.dynamicpt=true; 	
    
    -- Set the desired amount of data for each reducer to consume. Assume you want each file to be 4 GB.
    set odps.adaptive.shuffle.desired.partition.size=4096;	
    
    -- Write data to the JSON external table in dynamic partitions.
    INSERT OVERWRITE json_ext_web_sales 
    PARTITION(pt) 
    SELECT CAST(ws_item_sk AS INT) AS c_int, 
           CAST(ws_bill_customer_sk AS string) AS c_string , 
          COALESCE(CONCAT(ws_bill_addr_sk %2, '_', ws_promo_sk %3),'null_pt') AS pt 
    FROM  bigdata_public_dataset.tpcds_1t.web_sales;

    The files stored on OSS are shown in the following figure:

Import data from or export data to OSS

  • LOAD: Imports data from external storage in OSS to a MaxCompute table or table partition.

  • UNLOAD: Exports data from a MaxCompute project to external storage in OSS for use by other compute engines.

Appendix: Prepare sample data

  1. Prepare an OSS directory

    The following sample data is used:

    • oss_endpoint: oss-cn-hangzhou-internal.aliyuncs.com, which is the endpoint for the China (Hangzhou) region.

    • Bucket name: oss-mc-test.

    • Directory name: Demo1/, Demo2/, Demo3/, and SampleData/.

  2. Non-partitioned table data

    The vehicle.csv file is uploaded to the Demo1/ directory and contains the following data. The Demo1/ directory is used to establish a mapping with a non-partitioned table that is created using the built-in text data parser.

    1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S
    1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE
    1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE
    1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W
    1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S
    1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S
    1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N
    1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW
    1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE
    1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,N
  3. Partitioned table data

    The Demo2/ directory contains five subdirectories: direction=N/, direction=NE/, direction=S/, direction=SW/, and direction=W/. The files vehicle1.csv, vehicle2.csv, vehicle3.csv, vehicle4.csv, and vehicle5.csv are uploaded to these subdirectories, respectively, and contain the following data. The Demo2/ directory is used to map to a partitioned table that is created using the built-in text data parser.

    --vehicle1.csv
    1,7,53,1,46.81006,-92.08174,9/15/2014 0:00
    1,10,31,1,46.81006,-92.08174,9/15/2014 0:00
    
    --vehicle2.csv
    1,2,13,1,46.81006,-92.08174,9/14/2014 0:00
    1,3,48,1,46.81006,-92.08174,9/14/2014 0:00
    1,9,4,1,46.81006,-92.08174,9/15/2014 0:00
    
    --vehicle3.csv
    1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
    1,5,47,1,46.81006,-92.08174,9/14/2014 0:00
    1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
    
    --vehicle4.csv
    1,8,63,1,46.81006,-92.08174,9/15/2014 0:00
    
    --vehicle5.csv
    1,4,30,1,46.81006,-92.08174,9/14/2014 0:00
  4. Compressed data

    The vehicle.csv.gz file is uploaded to the Demo3/ directory. This compressed file contains vehicle.csv, which has the same content as the file in the Demo1/ directory. This file is used to create a mapping to a compressed OSS external table.

  5. Custom parser data

    The vehicle6.csv file is uploaded to the SampleData/ directory and contains the following data. The SampleData/ directory is used to map to an OSS external table that is created using an open-source data parser.

    1|1|51|1|46.81006|-92.08174|9/14/2014 0:00|S
    1|2|13|1|46.81006|-92.08174|9/14/2014 0:00|NE
    1|3|48|1|46.81006|-92.08174|9/14/2014 0:00|NE
    1|4|30|1|46.81006|-92.08174|9/14/2014 0:00|W
    1|5|47|1|46.81006|-92.08174|9/14/2014 0:00|S
    1|6|9|1|46.81006|-92.08174|9/14/2014 0:00|S
    1|7|53|1|46.81006|-92.08174|9/14/2014 0:00|N
    1|8|63|1|46.81006|-92.08174|9/14/2014 0:00|SW
    1|9|4|1|46.81006|-92.08174|9/14/2014 0:00|NE
    1|10|31|1|46.81006|-92.08174|9/14/2014 0:00|N

FAQ

How do I resolve the "Inline data exceeds the maximum allowed size" error when processing OSS data using an external table?

  • Symptom

    When you process OSS data, the error Inline data exceeds the maximum allowed size is reported.

  • Cause

    The size of a single file in OSS cannot exceed 3 GB. If a file exceeds this limit, this error is reported.

  • Solution

    To resolve this issue, you can adjust the following two properties. The purpose is to adjust the execution plan using these property values to control the size of data that each reducer writes to the OSS external table, ensuring that the OSS file does not exceed the 3 GB limit.

    set odps.sql.mapper.split.size=256; # Adjusts the size of data read by each mapper, in MB.
    set odps.stage.reducer.num=100; # Adjusts the number of workers in the Reduce stage.

How do I resolve an out-of-memory error when accessing an OSS external table in MaxCompute with a UDF that passed local tests?

  • Symptom

    When accessing an OSS external table in MaxCompute, a UDF that passed local tests returns the following error after being uploaded.

    FAILED: ODPS-0123131:User defined function exception - Traceback:
    java.lang.OutOfMemoryError: Java heap space        

    After you set the following parameters, the running time increases, but the error persists.

    set odps.stage.mapper.mem = 2048; 
    set odps.stage.mapper.jvm.mem = 4096;       
  • Cause

    There are too many object files in the external table, and the table is not partitioned, which causes excessive memory usage.

  • Solution

    • Query a smaller amount of data.

    • Partition the object files to reduce memory usage.

How do I merge multiple small files into a single file using an OSS external table?

Check the Logview logs to see if the last stage in the SQL execution plan is a Reducer or a Joiner.

  • If it is a Reducer, execute the statement set odps.stage.reducer.num=1;

  • If it is a Joiner, execute the statement set odps.stage.joiner.num=1;

How do I resolve the "Couldn't connect to server" error when reading an OSS external table?

  • Symptom

    When you read data from an OSS external table, the error ODPS-0123131:User defined function exception - common/io/oss/oss_client.cpp(95): OSSRequestException: req_id: , http status code: -998, error code: HttpIoError, message: Couldn't connect to server is reported.

  • Causes

    • Cause 1: When creating the OSS external table, the oss_endpoint in the oss_location address used a public endpoint instead of an internal endpoint.

    • Cause 2: When creating the OSS external table, the oss_endpoint in the oss_location address used an endpoint from another region.

  • Solution

    • For Cause 1:

      Check whether the oss_endpoint of the oss_location parameter in the CREATE TABLE statement for the OSS foreign table is an internal network address. If it is a public network address, you must change it to an internal network address. For more information about the parameters, see the Parameter descriptions section.

      For example, if a user in the Indonesia (Jakarta) region created an external table using the address oss://oss-ap-southeast-5.aliyuncs.com/<bucket>/...., you must change it to the corresponding internal address oss://oss-ap-southeast-5-internal.aliyuncs.com/<bucket>/.....

    • For Cause 2:

      Verify that the `oss_endpoint` of the `oss_location` parameter in the `CREATE TABLE` statement for the OSS foreign table is the endpoint for the region that you want to access. For more information about OSS classic network domain names, see OSS Regions and Endpoints.

How do I resolve the "Network is unreachable (connect failed)" error when creating an OSS external table?

  • Symptom

    When you create an OSS external table, the error ODPS-0130071:[1,1] Semantic analysis exception - external table checking failure, error message: Cannot connect to the endpoint 'oss-cn-beijing.aliyuncs.com': Connect to bucket.oss-cn-beijing.aliyuncs.com:80 [bucket.oss-cn-beijing.aliyuncs.com] failed: Network is unreachable (connect failed) is reported.

  • Cause

    When creating the OSS external table, the oss_endpoint in the oss_location address used a public endpoint instead of an internal endpoint.

  • Solution

    Check if the oss_endpoint in the `oss_location` parameter of your OSS foreign table's creation statement is an internal network address. If it is a public network address, change it to an internal network address. For more information about the parameters, see the Parameter description section.

    For example, if a user in the China (Beijing) region created an external table using the address oss://oss-cn-beijing.aliyuncs.com/<bucket>/...., you must change it to the corresponding internal address oss://oss-cn-beijing-internal.aliyuncs.com/<bucket>/.....

How do I resolve slow SQL job execution on OSS external tables?

  • Slow reading of GZ compressed files in an OSS external table

    • Symptom

      Reading data is slow from an OSS external table that uses a 200 GB GZ compressed file in OSS as the data source.

    • Cause

      The SQL processing is slow because too few mappers are executing the computation in the Map stage.

    • Solution

      • For structured data, you can set the following parameter to adjust the amount of data read by a single mapper to accelerate SQL execution.

        set odps.sql.mapper.split.size=256; # Adjusts the size of table data read by each mapper, in MB.       
      • For unstructured data, check whether there is only one OSS file in the OSS external table's path. If there is only one, because unstructured data in a compressed format cannot be split, only one mapper can be generated, which results in slow processing. We recommend that you split the large OSS file into smaller files in the corresponding external table path on OSS. This increases the number of mappers generated when reading the external table and improves the reading speed.

  • Slow data search in a MaxCompute external table using an SDK

    • Symptom

      Searching for data in a MaxCompute external table using an SDK is slow.

    • Solution

      External tables only support full table scans, which is slow. We recommend that you use a MaxCompute internal table instead.

In a multipart upload scenario, how do I resolve the issue where old data is deleted but new data is not written?

  • Symptom

    In an insert overwrite operation, an unexpected issue can occur in extreme cases if the job fails. The issue is that the old data is deleted, but the new data is not written.

  • Cause

    The new data is not successfully written to the destination table because of a rare hardware failure or metadata update failure. The delete operation in OSS does not support rollbacks, which means the deleted old data cannot be recovered.

  • Solution

    • If you want to overwrite an OSS external table based on its existing data, for example, insert overwrite table T select * from table T;, you must back up the OSS data in advance. If the job fails, you can then overwrite the OSS external table using the backed-up data.

    • If the insert overwrite job can be submitted multiple times, you can resubmit the job if it fails.

References