In MaxCompute, you can create an Object Storage Service (OSS) external table that maps to an OSS directory. This lets you access unstructured data from files in the OSS directory or write data from a MaxCompute project to the OSS directory. This topic describes the syntax and parameters for creating OSS external tables and reading data from or writing data to them.
Prerequisites
Your Alibaba Cloud account, RAM user, or RAM role must have the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.
(Optional) You have created an OSS bucket, OSS directories, and OSS data files. For more information, see Create buckets, Manage directories, and Simple upload.
MaxCompute can automatically create directories in OSS. You can use a single SQL statement to perform read and write operations that involve both external tables and UDFs. Manual directory creation is no longer required, but the legacy method is still supported.
You have created a MaxCompute project. For more information, see Create a MaxCompute project.
MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, use a bucket in the same region as your MaxCompute project.
Your Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.
Notes
OSS external tables only record mappings to OSS directories. Deleting an OSS external table does not delete the data files in the mapped OSS directory.
If OSS data files are archived objects, you must first restore the objects.
Use the classic network endpoint for OSS. MaxCompute does not guarantee network connectivity for public endpoints.
Limits
The cluster property is not supported for OSS external tables.
A single file cannot exceed 3 GB. If a file is larger than 3 GB, you must split it.
Entry points
You can create OSS external tables and read data from or write data to them on the following platforms.
Method | Platform |
MaxCompute SQL | |
Visualization |
Create an OSS external table
OSS external tables can be partitioned or non-partitioned. The type of external table that you create depends on how the data files are stored in OSS. If data files are stored in partitioned directories, create a partitioned external table. Otherwise, create a non-partitioned external table.
Notes
The CREATE EXTERNAL TABLE statement requires different parameters depending on the data file format. To avoid errors during read or write operations, ensure you use the correct syntax and parameters for your specific business needs.
Syntax
Scenario: Create an external table using the built-in text data parser
Syntax
Data file format
Example
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [comment <table_comment>] [partitioned BY (<col_name> <data_type>, ...)] stored BY '<StorageHandler>' WITH serdeproperties ( ['<property_name>'='<property_value>',...] ) location '<oss_location>';Supported data file formats for reading data from or writing data to OSS:
CSV
TSV
CSV or TSV files compressed in GZIP, SNAPPY, or LZO format
Scenario: Create an external table using the built-in open source data parser
Syntax
Data file format
Example
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [comment <table_comment>] [partitioned BY (<col_name> <data_type>, ...)] [row format serde '<serde_class>' [WITH serdeproperties ( ['<property_name>'='<property_value>',...]) ] ] stored AS <file_format> location '<oss_location>' [USING '<resource_name>'] [tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];Supported data file formats for reading data from or writing data to OSS:
PARQUET
TEXTFILE
ORC
RCFILE
AVRO
JSON
SEQUENCEFILE
Hudi (Only supports reading Hudi data generated by Data Lake Formation (DLF).)
PARQUET files compressed in ZSTD, SNAPPY, or GZIP format
ORC files compressed in SNAPPY or ZLIB format
TEXTFILE files compressed in GZIP, SNAPPY, or LZO format
Scenario: Create an external table using a custom parser
Syntax
Data file format
Example
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <date_type>, ... ) [comment <table_comment>] [partitioned BY (<col_name> <data_type>, ...)] stored BY '<StorageHandler>' WITH serdeproperties ( ['<property_name>'='<property_value>',...] ) location '<oss_location>' USING '<jar_name>';Supported data file formats for reading data from or writing data to OSS: data files in formats other than the preceding ones.
Parameters
The following table describes the common parameters for external tables of all formats. For parameters that are unique to a specific format, see the documentation for that format.
Basic syntax parameters
Parameter
Required
Description
mc_oss_extable_name
Yes
The name of the OSS external table that you want to create.
Table names are not case-sensitive. When you query external tables, you do not need to distinguish between uppercase and lowercase letters. Forced case conversions are not supported.
col_name
Yes
The name of a column in the OSS external table.
When you read data from OSS, the schema of the created OSS external table must be the same as the schema of the OSS data files. Otherwise, you cannot read the OSS data.
data_type
Yes
The data type of a column in the OSS external table.
When you read data from OSS, the data types of columns in the OSS external table must be the same as the data types of columns in the OSS data files. Otherwise, you cannot read the OSS data.
table_comment
No
The comment of the table. The comment must be a valid string of up to 1,024 bytes. Otherwise, an error is returned.
partitioned by (col_name data_type, ...)
No
If data files in OSS are stored in partitioned directories, this parameter is required to create a partitioned table.
col_name: the name of a partition key column.
data_type: the data type of a partition key column.
'<(tb)property_name>' = '<(tb)property_value>'
Yes
The extended properties of the OSS external table. For more information about the parameters, see the documentation for the specific format.
oss_location
Yes
The OSS path where the data files are stored. By default, MaxCompute reads all data files in this path.
Format:
oss://<oss_endpoint>/<Bucket name>/<OSS directory name>/.oss_endpoint:
The OSS endpoint. Use the classic network endpoint provided by OSS, which is the endpoint that contains
-internal.Example:
oss://oss-cn-beijing-internal.aliyuncs.com/xxx.For more information about OSS classic network endpoints, see OSS Regions and Endpoints.
We recommend that the OSS region where data files are stored is the same as the region where your MaxCompute project is located. If they are in different regions, data connectivity issues may occur.
You can omit the endpoint. If you do not specify an endpoint, the system uses the endpoint of the region where the current project is located by default.
This method is not recommended. If the file storage is cross-region, data connectivity issues may occur.
Bucket name: The name of the OSS bucket.
Example:
oss://oss-cn-beijing-internal.aliyuncs.com/your_bucket/path/.For more information about bucket names, see List buckets.
Directory name: The name of the OSS directory. Do not specify a file name after the directory.
Example:
oss://oss-cn-beijing-internal.aliyuncs.com/oss-mc-test/Demo1/.Incorrect examples:
-- HTTP connections are not supported. http://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/ -- HTTPS connections are not supported. https://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/ -- The endpoint is invalid. oss://oss-cn-shanghai-internal.aliyuncs.com/Demo1 -- Do not specify a file name. oss://oss-cn-shanghai-internal.aliyuncs.com/oss-mc-test/Demo1/vehicle.csvPermission specification (RamRole):
Explicitly specify (recommended): Create a custom role, attach an access policy to it, and use its ARN. For more information, see STS Pattern Authorization.
Use the default role (not recommended): Use the ARN of the
aliyunodpsdefaultrolerole.
WITH serdeproperties properties
property_name
Scenario
property_value
Default value
odps.properties.rolearn
If you use Security Token Service (STS) authorization, add this property.
The Alibaba Cloud Resource Name (ARN) of a RAM role that is granted permissions to access OSS. You can obtain the ARN from the role details in the RAM console. Example:
acs:ram::xxxxxx:role/aliyunodpsdefaultrole.If the owners of MaxCompute and OSS are the same Alibaba Cloud account:
If you do not specify
odps.properties.rolearnin the statement that is used to create a table, the ARN of thealiyunodpsdefaultrolerole is used by default.If you want to use the ARN of a custom role, you must create the custom role in advance. For more information, see STS authorization for OSS (Method 2).
When the owners of MaxCompute and OSS are different Alibaba Cloud accounts, you must specify the ARN of a custom role. For more information about how to create a custom role, see STS authorization for OSS (Method 3).
Syntax for adding partition data to an OSS external table
If you created a partitioned OSS external table, you must add partition data to it using one of the following methods:
Method 1 (recommended): Automatically parse the OSS directory to identify partitions and add the partition information to the OSS external table.
This method is suitable for scenarios where you want to add all missing historical partitions at once. MaxCompute automatically adds partitions to the OSS external table based on the partition directory that you specified when you created the OSS external table. You do not need to add partitions one by one based on partition key column names and values.
MSCK REPAIR TABLE <mc_oss_extable_name> ADD PARTITIONS [ WITH properties (key:VALUE, key: VALUE ...)];This method is not suitable for processing incremental data, especially when the OSS directory contains many partitions, such as more than 1,000. If the number of new partitions is much smaller than the number of existing ones, frequent use of the
msckcommand causes repeated scans of the OSS directory and frequent metadata updates, which reduces the command's execution efficiency. For scenarios where you need to update incremental partitions, we recommend that you use Method 2.Method 2: Manually run commands to add partition information to the OSS external table.
This method is suitable for scenarios where historical partitions are already created and you need to add new partitions frequently. Before you run a data write task, you must create partitions. You do not need to refresh the partitions when new data is written to OSS. The external table can then read the latest data directly from the OSS directory.
ALTER TABLE <mc_oss_extable_name> ADD PARTITION (<col_name>=<col_value>)[ ADD PARTITION (<col_name>=<col_value>)...][location URL];The col_name and col_value values must match the names of the directories where the partition data files are stored.
For example, assume the OSS directory structure for partition data files is as shown in the following figure. col_name corresponds to
direction, and col_value corresponds toN, NE, S, SW, W. Eachadd partitionstatement corresponds to one subdirectory. To add multiple OSS subdirectories, you must run multipleadd partitionstatements.
Read OSS data
Notes
After you create an OSS external table, you can use it to read data from OSS. For more information about the OSS data file formats that are supported by MaxCompute and the syntax for creating OSS external tables, see Syntax.
If an SQL statement involves complex data types, you must add the
set odps.sql.type.system.odps2=true;command before the SQL statement and submit them together. For more information about data types, see Data Type Version Guide.For OSS external tables that are mapped to open source data, you must set
set odps.sql.hive.compatible=true;at the session level before you read data from OSS. Otherwise, an error is returned.OSS provides limited outbound bandwidth. If the data read and write traffic exceeds the bandwidth limit of an instance within a short period, the read and write speed of OSS external tables is affected. For more information about OSS bandwidth, see Limits and performance metrics.
Syntax
<select_statement> FROM <from_statement>;select_statement: the
SELECTclause used to query data from the source table that will be inserted into the destination table.from_statement: the
FROMclause that specifies the data source, such as an external table name.
Non-partitioned data
After you create a non-partitioned OSS external table, you can use one of the following methods to read data from OSS:
Method 1 (recommended): Import data from OSS into a MaxCompute internal table, and then query the internal table.
This method is suitable for scenarios where you must repeatedly compute data or have high requirements for computing performance. You can create an internal table in your MaxCompute project that has the same schema as the OSS external table, import data from OSS to the internal table, and then execute complex queries. This method lets you take full advantage of the MaxCompute optimization mechanism for internal storage to improve computing performance. Example:
CREATE TABLE <table_internal> LIKE <mc_oss_extable_name>; INSERT OVERWRITE TABLE <table_internal> SELECT * FROM <mc_oss_extable_name>;Method 2: Directly read data from OSS in the same way you read data from a MaxCompute internal table.
This method is suitable for scenarios where you do not have high requirements for computing performance. Unlike reading data from an internal table, each read operation retrieves data directly from OSS.
Partitioned data
When you query an OSS external table, MaxCompute performs a full table scan on all data in the OSS directory, including data in subdirectories. If the data volume is large, a full directory scan causes unnecessary I/O consumption and increases data processing time. You can use one of the following methods to resolve this issue:
Method 1 (Recommended): Store data on OSS using either the standard partition path or a custom partition path format.
When you create an OSS external table, you must specify the partition and oss_location information in the creation statement. We recommend that you store OSS data in standard partition directories.
Method 2: Use multiple data storage paths.
Create multiple OSS external tables to read data from different paths. Each OSS external table maps to a subset of the OSS data. We do not recommend this method because it is cumbersome and results in poor data management.
Standard partition directory format
oss://<oss_endpoint>/<Bucket name>/<Directory name>/<partitionKey1=value1>/<partitionKey2=value2>/...Example: A company stores log files that are generated every day in CSV format in OSS and processes the data on a daily basis using MaxCompute. You can specify the following standard partition directories for OSS data:
oss://oss-odps-test/log_data/year=2016/month=06/day=01/logfile
oss://oss-odps-test/log_data/year=2016/month=06/day=02/logfile
oss://oss-odps-test/log_data/year=2016/month=07/day=10/logfile
oss://oss-odps-test/log_data/year=2016/month=08/day=08/logfile
...Custom partition directory format
In a custom partition directory format, only partition column values are included, not partition column names. Example:
oss://oss-odps-test/log_data_customized/2016/06/01/logfile
oss://oss-odps-test/log_data_customized/2016/06/02/logfile
oss://oss-odps-test/log_data_customized/2016/07/10/logfile
oss://oss-odps-test/log_data_customized/2016/08/08/logfile
...Data in OSS is stored by partition, but the path is not in the standard partition directory format. MaxCompute lets you bind subdirectories to different partitions to access the data within them.
Solution: After you create an OSS external table, run the alter table ... add partition ... location ... command to specify subdirectories and bind them to partitions. Example:
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '01')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/06/01/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '06', day = '02')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/06/02/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '07', day = '10')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/07/10/';
ALTER TABLE log_table_external ADD PARTITION (year = '2016', month = '08', day = '08')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/bucket-name/oss-odps-test/log_data_customized/2016/08/08/';Query optimization
Dynamic statistics collection
Because data is stored in an external data lake and lacks pre-computed statistics, the query optimizer adopts a conservative strategy, which can result in low query efficiency. The dynamic statistics collection feature allows the optimizer to temporarily collect table statistics during query execution to identify small tables. The optimizer can then use this information to actively use hash joins, optimize join orders, reduce data shuffling, and shorten execution pipelines to optimize queries.
SET odps.meta.exttable.stats.onlinecollect=true; SELECT * FROM <tablename>;Optimizing External Table Splits
You can optimize query efficiency by setting the split size parameter, which adjusts the amount of data processed by a single concurrent task. The split size setting has the following effects:
If the table contains a large amount of data and the split size is too small, too many splits are created. This results in a high degree of parallelism, and instances spend most of their time queuing for resources.
If the table contains a small amount of data and the split size is too large, too few splits are created. This results in insufficient concurrency, and most resources remain idle.
SET odps.stage.mapper.split.size=<value>; SELECT * FROM <tablename>;
Write data to OSS
Similar to reading data from OSS, MaxCompute lets you write data from internal tables or data processed from external tables to OSS.
Write data to OSS using the built-in text or open source data parser: CSV/TSV external tables, CSV/TSV external tables, CSV/TSV external tables, and Write data to OSS using the built-in open source data parser.
Write data to OSS using a custom parser: Example: Create an OSS foreign table using a custom parser.
Write data to OSS using the multipart upload feature of OSS: Write data to OSS using the multipart upload feature of OSS.
Syntax
INSERT {INTO|OVERWRITE} TABLE <table_name> PARTITION (<ptcol_name>[, <ptcol_name> ...])
<select_statement> FROM <from_statement>;Parameter | Required | Description |
table_name | Yes | The name of the external table to which you want to write data. |
select_statement | Yes | The |
from_statement | Yes | The |
To insert data into dynamic partitions, see Insert or overwrite dynamic partition data (DYNAMIC PARTITION).
Notes
If the
INSERT OVERWRITE ... SELECT ... FROM ...;operation allocates 1,000 mappers for the source table from_tablename, 1,000 TSV or CSV files are generated.You can control the number of generated files using the configurations that MaxCompute provides.
Output in Mapper: Use
odps.stage.mapper.split.sizeto control the concurrency of mappers and adjust the number of generated files.Output in Reducer or Joiner: Use
odps.stage.reducer.numandodps.stage.joiner.numto adjust the number of generated files.
Write data to OSS using the OSS multipart upload feature
To write data to OSS in an open source format, you can use an OSS foreign table that is based on an open-source data parser and the OSS multipart upload feature to execute an INSERT operation.
You can enable the OSS multipart upload feature using the following settings:
Scenario | Command |
Set at the project level | The setting takes effect for the entire project. |
Set at the session level | The setting takes effect only for the current task. |
The default value of the odps.sql.unstructured.oss.commit.mode property is false. The following table describes the implementation principles for each value of this property:
Value | Principle |
false | The data that MaxCompute writes to an OSS external table is stored in the .odps folder in the LOCATION directory. The .odps folder contains a .meta file that is used to ensure the consistency of MaxCompute data. Only MaxCompute can correctly process the content in the .odps folder. Other data processing engines may not be able to correctly parse the content, which results in errors. |
true | MaxCompute uses the multipart upload feature, which is compatible with other data processing engines. MaxCompute ensures data consistency using a |
Manage exported files
Parameters
When you need to add a prefix, suffix, or extension to the data files written to OSS, you can set the following parameters.
property_name | Scenario | Description | property_value | Default value |
odps.external.data.output.prefix (Compatible with odps.external.data.prefix) | Add this property when you need to add a custom prefix to the output files. |
| A valid combination of characters, such as 'mc_'. | None |
odps.external.data.enable.extension | Add this property when you need to display the extension of the output files. | True indicates that the extension of the output files is displayed. False indicates that the extension is not displayed. |
| False |
odps.external.data.output.suffix | Add this property when you need to add a custom suffix to the output files. | Contains only letters, digits, and underscores (a-z, A-Z, 0-9, _). | A valid combination of characters, such as '_hangzhou'. | None |
odps.external.data.output.explicit.extension | Add this property when you need to add a custom extension to the output files. |
| A valid combination of characters, such as "jsonl". | None |
Usage examples
Set the custom prefix of the output OSS file to
test06_. The DDL statement is as follows:CREATE EXTERNAL TABLE <mc_oss_extable_name> ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/' TBLPROPERTIES ( -- Set a custom prefix. 'odps.external.data.output.prefix'='test06_') ; -- Write data to the external table. INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');The following figure shows the generated file.

Set the custom suffix of the output OSS file to
_beijing. The DDL statement is as follows:CREATE EXTERNAL TABLE <mc_oss_extable_name> ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/' TBLPROPERTIES ( -- Set a custom suffix. 'odps.external.data.output.suffix'='_beijing') ; -- Write data to the external table. INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');The following figure shows the generated file.

An extension is automatically generated for the output OSS file. The DDL statement is as follows:
CREATE EXTERNAL TABLE <mc_oss_extable_name> ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/' TBLPROPERTIES ( -- Automatically generate an extension. 'odps.external.data.enable.extension'='true') ; -- Write data to the external table. INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');The figure below shows the generated file.
Set the custom extension of the output OSS file to
jsonl. The DDL statement is as follows:CREATE EXTERNAL TABLE <mc_oss_extable_name> ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/' TBLPROPERTIES ( -- Set a custom extension. 'odps.external.data.output.explicit.extension'='jsonl') ; -- Write data to the external table. INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');The following figure shows the generated file.

Set the prefix of the output OSS file to
mc_, the suffix to_beijing, and the extension tojsonl. The DDL statement is as follows:CREATE EXTERNAL TABLE <mc_oss_extable_name> ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongitude DOUBLE, recordTime STRING, direction STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-beijing-internal.aliyuncs.com/***/' TBLPROPERTIES ( -- Set a custom prefix. 'odps.external.data.output.prefix'='mc_', -- Set a custom suffix. 'odps.external.data.output.suffix'='_beijing', -- Set a custom extension. 'odps.external.data.output.explicit.extension'='jsonl') ; -- Write data to the external table. INSERT INTO <mc_oss_extable_name> VALUES (1,32,76,1,63.32106,-92.08174,'9/14/2014 0:10','NW');The following figure shows the generated file.

Write large files in dynamic partitions
Business scenario
You need to export the calculation results of an ancestor table to OSS in partitions and write them as large files, such as 4 GB files.
The odps.adaptive.shuffle.desired.partition.size parameter (in MB) controls the data volume processed by each reducer, which in turn determines the size of each output file. Using this parameter to create larger output files can reduce parallelism, potentially increasing the overall job runtime.
Parameters
-- You must enable the dynamic partition feature.
set odps.sql.reshuffle.dynamicpt=true;
-- Set the desired amount of data for each reducer to consume. Assume you want each file to be 4 GB.
set odps.adaptive.shuffle.desired.partition.size=4096; Usage example
Write a JSON file of about 4 GB to OSS.
Prepare test data. Use the
bigdata_public_dataset.tpcds_1t.web_salestable from the public dataset, which is about 30 GB. The data is stored in a compressed format in MaxCompute, and its size will increase after being exported.Create a JSON external table.
-- Sample table name: json_ext_web_sales CREATE EXTERNAL TABLE json_ext_web_sales( c_int INT , c_string STRING ) PARTITIONED BY (pt STRING) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole' ) STORED AS textfile LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/demo-test/';With no parameters set, data from the test table is written to the JSON foreign table using dynamic partitions.
-- Enable the Layer 3 syntax switch. set odps.namespace.schema=true; -- Write data to the JSON external table using dynamic partitions. INSERT OVERWRITE json_ext_web_sales PARTITION(pt) SELECT CAST(ws_item_sk AS INT) AS c_int, CAST(ws_bill_customer_sk AS string) AS c_string , COALESCE(CONCAT(ws_bill_addr_sk %2, '_', ws_promo_sk %3),'null_pt') AS pt FROM bigdata_public_dataset.tpcds_1t.web_sales;The following figure shows the files stored on OSS:

Add the
odps.adaptive.shuffle.desired.partition.sizeparameter for outputting large files, and write the test table to the JSON external table in dynamic partitions.-- Enable Layer 3 schema syntax. set odps.namespace.schema=true; -- You must enable the dynamic partition feature. set odps.sql.reshuffle.dynamicpt=true; -- Set the desired amount of data for each reducer to consume. Assume you want each file to be 4 GB. set odps.adaptive.shuffle.desired.partition.size=4096; -- Write data to the JSON external table in dynamic partitions. INSERT OVERWRITE json_ext_web_sales PARTITION(pt) SELECT CAST(ws_item_sk AS INT) AS c_int, CAST(ws_bill_customer_sk AS string) AS c_string , COALESCE(CONCAT(ws_bill_addr_sk %2, '_', ws_promo_sk %3),'null_pt') AS pt FROM bigdata_public_dataset.tpcds_1t.web_sales;The files stored on OSS are shown in the following figure:
Import data from or export data to OSS
Appendix: Prepare sample data
Prepare an OSS directory
The following sample data is used:
oss_endpoint:
oss-cn-hangzhou-internal.aliyuncs.com, which is the endpoint for the China (Hangzhou) region.Bucket name:
oss-mc-test.Directory name:
Demo1/,Demo2/,Demo3/, andSampleData/.
Non-partitioned table data
The vehicle.csv file is uploaded to the
Demo1/directory and contains the following data. TheDemo1/directory is used to establish a mapping with a non-partitioned table that is created using the built-in text data parser.1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S 1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE 1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W 1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S 1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N 1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW 1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE 1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,NPartitioned table data
The
Demo2/directory contains five subdirectories:direction=N/,direction=NE/,direction=S/,direction=SW/, anddirection=W/. The files vehicle1.csv, vehicle2.csv, vehicle3.csv, vehicle4.csv, and vehicle5.csv are uploaded to these subdirectories, respectively, and contain the following data. TheDemo2/directory is used to map to a partitioned table that is created using the built-in text data parser.--vehicle1.csv 1,7,53,1,46.81006,-92.08174,9/15/2014 0:00 1,10,31,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle2.csv 1,2,13,1,46.81006,-92.08174,9/14/2014 0:00 1,3,48,1,46.81006,-92.08174,9/14/2014 0:00 1,9,4,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle3.csv 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00 1,5,47,1,46.81006,-92.08174,9/14/2014 0:00 1,6,9,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle4.csv 1,8,63,1,46.81006,-92.08174,9/15/2014 0:00 --vehicle5.csv 1,4,30,1,46.81006,-92.08174,9/14/2014 0:00Compressed data
The vehicle.csv.gz file is uploaded to the
Demo3/directory. This compressed file contains vehicle.csv, which has the same content as the file in theDemo1/directory. This file is used to create a mapping to a compressed OSS external table.Custom parser data
The vehicle6.csv file is uploaded to the
SampleData/directory and contains the following data. TheSampleData/directory is used to map to an OSS external table that is created using an open-source data parser.1|1|51|1|46.81006|-92.08174|9/14/2014 0:00|S 1|2|13|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|3|48|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|4|30|1|46.81006|-92.08174|9/14/2014 0:00|W 1|5|47|1|46.81006|-92.08174|9/14/2014 0:00|S 1|6|9|1|46.81006|-92.08174|9/14/2014 0:00|S 1|7|53|1|46.81006|-92.08174|9/14/2014 0:00|N 1|8|63|1|46.81006|-92.08174|9/14/2014 0:00|SW 1|9|4|1|46.81006|-92.08174|9/14/2014 0:00|NE 1|10|31|1|46.81006|-92.08174|9/14/2014 0:00|N
FAQ
How do I resolve the "Inline data exceeds the maximum allowed size" error when processing OSS data using an external table?
Symptom
When you process OSS data, the error
Inline data exceeds the maximum allowed sizeis reported.Cause
The size of a single file in OSS cannot exceed 3 GB. If a file exceeds this limit, this error is reported.
Solution
To resolve this issue, you can adjust the following two properties. The purpose is to adjust the execution plan using these property values to control the size of data that each reducer writes to the OSS external table, ensuring that the OSS file does not exceed the 3 GB limit.
set odps.sql.mapper.split.size=256; # Adjusts the size of data read by each mapper, in MB. set odps.stage.reducer.num=100; # Adjusts the number of workers in the Reduce stage.
How do I resolve an out-of-memory error when accessing an OSS external table in MaxCompute with a UDF that passed local tests?
Symptom
When accessing an OSS external table in MaxCompute, a UDF that passed local tests returns the following error after being uploaded.
FAILED: ODPS-0123131:User defined function exception - Traceback: java.lang.OutOfMemoryError: Java heap spaceAfter you set the following parameters, the running time increases, but the error persists.
set odps.stage.mapper.mem = 2048; set odps.stage.mapper.jvm.mem = 4096;Cause
There are too many object files in the external table, and the table is not partitioned, which causes excessive memory usage.
Solution
Query a smaller amount of data.
Partition the object files to reduce memory usage.
How do I merge multiple small files into a single file using an OSS external table?
Check the Logview logs to see if the last stage in the SQL execution plan is a Reducer or a Joiner.
If it is a Reducer, execute the statement
set odps.stage.reducer.num=1;If it is a Joiner, execute the statement
set odps.stage.joiner.num=1;
How do I resolve the "Couldn't connect to server" error when reading an OSS external table?
Symptom
When you read data from an OSS external table, the error
ODPS-0123131:User defined function exception - common/io/oss/oss_client.cpp(95): OSSRequestException: req_id: , http status code: -998, error code: HttpIoError, message: Couldn't connect to serveris reported.Causes
Cause 1: When creating the OSS external table, the
oss_endpointin the oss_location address used a public endpoint instead of an internal endpoint.Cause 2: When creating the OSS external table, the
oss_endpointin the oss_location address used an endpoint from another region.
Solution
For Cause 1:
Check whether the
oss_endpointof the oss_location parameter in the CREATE TABLE statement for the OSS foreign table is an internal network address. If it is a public network address, you must change it to an internal network address. For more information about the parameters, see the Parameter descriptions section.For example, if a user in the Indonesia (Jakarta) region created an external table using the address
oss://oss-ap-southeast-5.aliyuncs.com/<bucket>/...., you must change it to the corresponding internal addressoss://oss-ap-southeast-5-internal.aliyuncs.com/<bucket>/.....For Cause 2:
Verify that the `oss_endpoint` of the `oss_location` parameter in the `CREATE TABLE` statement for the OSS foreign table is the endpoint for the region that you want to access. For more information about OSS classic network domain names, see OSS Regions and Endpoints.
How do I resolve the "Network is unreachable (connect failed)" error when creating an OSS external table?
Symptom
When you create an OSS external table, the error
ODPS-0130071:[1,1] Semantic analysis exception - external table checking failure, error message: Cannot connect to the endpoint 'oss-cn-beijing.aliyuncs.com': Connect to bucket.oss-cn-beijing.aliyuncs.com:80 [bucket.oss-cn-beijing.aliyuncs.com] failed: Network is unreachable (connect failed)is reported.Cause
When creating the OSS external table, the
oss_endpointin the oss_location address used a public endpoint instead of an internal endpoint.Solution
Check if the
oss_endpointin the `oss_location` parameter of your OSS foreign table's creation statement is an internal network address. If it is a public network address, change it to an internal network address. For more information about the parameters, see the Parameter description section.For example, if a user in the China (Beijing) region created an external table using the address
oss://oss-cn-beijing.aliyuncs.com/<bucket>/...., you must change it to the corresponding internal addressoss://oss-cn-beijing-internal.aliyuncs.com/<bucket>/.....
How do I resolve slow SQL job execution on OSS external tables?
Slow reading of GZ compressed files in an OSS external table
Symptom
Reading data is slow from an OSS external table that uses a 200 GB GZ compressed file in OSS as the data source.
Cause
The SQL processing is slow because too few mappers are executing the computation in the Map stage.
Solution
For structured data, you can set the following parameter to adjust the amount of data read by a single mapper to accelerate SQL execution.
set odps.sql.mapper.split.size=256; # Adjusts the size of table data read by each mapper, in MB.For unstructured data, check whether there is only one OSS file in the OSS external table's path. If there is only one, because unstructured data in a compressed format cannot be split, only one mapper can be generated, which results in slow processing. We recommend that you split the large OSS file into smaller files in the corresponding external table path on OSS. This increases the number of mappers generated when reading the external table and improves the reading speed.
Slow data search in a MaxCompute external table using an SDK
Symptom
Searching for data in a MaxCompute external table using an SDK is slow.
Solution
External tables only support full table scans, which is slow. We recommend that you use a MaxCompute internal table instead.
In a multipart upload scenario, how do I resolve the issue where old data is deleted but new data is not written?
Symptom
In an
insert overwriteoperation, an unexpected issue can occur in extreme cases if the job fails. The issue is that the old data is deleted, but the new data is not written.Cause
The new data is not successfully written to the destination table because of a rare hardware failure or metadata update failure. The delete operation in OSS does not support rollbacks, which means the deleted old data cannot be recovered.
Solution
If you want to overwrite an OSS external table based on its existing data, for example,
insert overwrite table T select * from table T;, you must back up the OSS data in advance. If the job fails, you can then overwrite the OSS external table using the backed-up data.If the
insert overwritejob can be submitted multiple times, you can resubmit the job if it fails.
References
Supported formats for OSS external tables:
Create OSS external tables using a custom parser, and read data from or write data to OSS: Custom parser.
Parse OSS files into datasets with schemas and support for column filtering and processing: Schemaless Query.