MaxCompute introduced the Object Table feature. This feature allows the data warehouse compute engine to access unstructured data and its metadata in data lake storage. This topic describes the command syntax for Object Tables and provides examples.
Background information
Many AI processes require data warehouse developers who are familiar with data and business operations. These developers use the low-cost, large-scale computing power of big data platforms to pre-process data for large models or handle unstructured data. The processes and their results interact with data in the data warehouse or data lake.
Using SQL to process unstructured data presents the following challenges.
Big data SQL engines are unaware of object sizes when reading files from Object Storage Service (OSS). This makes it difficult to optimize execution plans, control concurrency, or automatically start the correct number of concurrent tasks. Additionally, filter conditions cannot be effectively pushed down. This prevents the full use of computing power when data skew occurs.
Reading metadata from Object Storage Service is inefficient. Each query requires remote access to the storage service, which causes high latency.
File lists from Object Storage Service (OSS) can only be retrieved serially in a single process within a user-defined table-valued function (UDTF). This results in poor data read performance.
The logic for handling permissions and network connections to storage services must be implemented within the user-defined function (UDF).
UDFs have limited capabilities for processing unstructured data. Traditional data warehouses lack a flexible and secure way to upload custom images. They also lack a secure runtime environment for UDFs. Additionally, remote invocations require concurrent integration with distributed computing services.
Features
MaxCompute introduced the Object Table feature. This feature allows the data warehouse compute engine to access unstructured data and its metadata in data lake storage. It provides the following capabilities:
Allows the engine to read OSS object metadata as a table.
Caches various types of OSS object metadata in a versioned manner based on the Metadata Table capability. The SQL engine can then use the Metadata Table to perform effective query optimizations, such as data filtering and condition pushdown.
Allows you to read the content of unstructured data files in multiple ways using built-in document functions.
The MaxCompute SQL engine uses Object Table metadata to perform concurrent chunking. This enables large-scale distributed computing to improve the efficiency of data reading and processing.
Supports uploading custom images to build UDFs and process unstructured data read by the engine.
Allows the engine to process unstructured data, generate structured data results, and write them to internal and foreign tables in the data warehouse. It will also support generating unstructured data results and writing them back to Object Storage Service through Object Tables.
Supports the Maxframe engine for the Python ecosystem.
Limits
The MaxCompute project must support the Schema feature. For more information, see Enable the Schema feature.
MaxCompute must support the 2.0 data type system.
Object Tables do not support partitions.
Billing
An Object Table is a collection of object metadata from OSS. Storage fees are charged for the metadata that is refreshed and stored in the Object Table. For more information, see Storage fees. Because the files on OSS are not stored within MaxCompute, MaxCompute does not charge storage fees. OSS charges for data storage and access. For more information, see OSS storage fees.
In a task that extracts and refreshes OSS metadata, the
inputsizefor each scanned file is a value related to the metadata, not the actual file size. Therefore, the total cost of a refresh task is related to the number of files, not their actual size on OSS. For more information, see SQL billing for foreign tables.Compute fees are generated when you use Object Tables and their metadata to analyze, extract, and process unstructured data from OSS.
In pay-as-you-go mode, metadata analysis on an Object Table is billed as an internal table. For more information, see Standard SQL billing. Processing the content of unstructured data from OSS is billed as a foreign table. For more information, see SQL billing for foreign tables.
In subscription mode, subscription computing resources are used. For more information, see Compute fees (Subscription).
Create an OBJECT TABLE
Syntax
CREATE OBJECT TABLE [IF NOT EXISTS] <objecttable_name>
WITH SERDEPROPERTIES ('<key>' = '<value>')
LOCATION '<location>'
[TBLPROPERTIES ('<key>' = '<value>')]
[COMMENT '<comment>']
;Object Tables must be used in a project that supports Schema mode, and the Schema syntax switch must be enabled.
You do not need to define columns for an Object Table. The system provides the metadata columns.
Parameters
Parameter | Required | Description |
objecttable_name | Required. | The table name. |
SERDEPROPERTIES ('<key>'='<value>') | Required. | Specifies the RAM role for authentication. If you do not specify this parameter, the RAM role named Example: Before you use the role, make sure that you have completed the one-click authorization for
Note One-click authorization can be performed only when the ProjectOwner of the MaxCompute project is the OSS Alibaba Cloud account. |
location | Required. |
|
TBLPROPERTIES ('<key>'='<value>') | Optional. |
|
comment | Optional. | The table comment. |
Example
SET odps.namespace.schema=true;
CREATE OBJECT TABLE ot_demo_day
WITH serdeproperties (
'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole')
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/odps-external-****/ottest/';View OBJECT TABLE properties
Syntax
DESC <object_table_name>Parameters
object_table_name: Required. The table name.
Example
DESC ot_demo_day; The following result is returned.
+------------------------------------------------------------------------------------+
| Owner: ALIYUN$****@test.aliyunid.com |
| Project: test_objecttable |
| Schema: default |
| TableComment: |
+------------------------------------------------------------------------------------+
| CreateTime: 2024-09-02 20:01:56 |
| LastDDLTime: 2024-09-02 20:01:56 |
| LastModifiedTime: 2024-09-02 20:01:56 |
+------------------------------------------------------------------------------------+
| InternalTable: YES | Size: 0 |
+------------------------------------------------------------------------------------+
| Native Columns: |
+------------------------------------------------------------------------------------+
| Field | Type | Label | Comment |
+------------------------------------------------------------------------------------+
| key | varchar(2048) | | The name of the object. |
| size | bigint | | The size of the returned object in bytes. |
| type | varchar(32) | | The type of the object and valid values: Normal, Multipart, Appendable, and Symlink. |
| last_modified | timestamp | | The last modified time of the object. |
| storage_class | varchar(32) | | The storage class of the object. |
| etag | varchar(64) | | The entity tag (ETag). When an object is created, an ETag is created to identify the content of the object. |
| restore_info | varchar(256) | | The restoration status of the object. |
| owner_id | bigint | | The ID of the bucket owner. |
| owner_display_name | varchar(256) | | The display name of the bucket owner. |
+------------------------------------------------------------------------------------+The following table describes some of the columns in the returned result.
Column Name | Type Description | NULL Allowed | Description |
key | VARCHAR(2048) The native length constraint is 1023. For more information, see OSS object naming conventions and examples. | False | The relative path name of the object in the Object Table. |
size | BIGINT | False | The size of the object in bytes. |
type | VARCHAR(32) | False | The file type of the object in OSS: Normal, Multipart, Appendable, and Symlink. |
last_modified | TIMESTAMP_NTZ | False | The time when the object data was last modified on OSS. |
storage_class | VARCHAR(32) | False | The storage class of the object on OSS. For more information about specific types, see Storage classes. |
etag | VARCHAR(64) | False | The ETag is an entity tag created when each object is generated. It is a signature used to identify whether the content of an object has changed between two updates, but it is not a unique identifier. |
restore_info | VARCHAR(256) | True | Describes whether an object has been restored from cold storage. If an object is being restored, this field contains information about the process. |
owner_id | BIGINT | True | The ID of the object owner. |
owner_display_name | VARCHAR(256) | True | The display name of the object owner. |
View the DDL statement of an OBJECT TABLE
Syntax
SHOW CREATE TABLE <object_table_name>;Parameters
object_table_name: Required. The table name.
Example
SHOW CREATE TABLE ot_demo_day; The following result is returned.
CREATE OBJECT TABLE IF NOT EXISTS yunqi_object_****.`default`.ot_demo_day
WITH SERDEPROPERTIES (
'serialization.format'='1',
'odps.properties.rolearn'='acs:ram::139699392458****:role/aliyunodpsdefaultrole')
LOCATION
'oss://oss-cn-hangzhou-internal.aliyuncs.com/odps-external-****/ottest/'
TBLPROPERTIES (
'last_modified_time'='1731478307',
'transient_lastDdlTime'='1731478307',
'metadata.cache.mode'='manual',
'metadata.staleness.seconds'='3600');Refresh OBJECT TABLE metadata
The actual object data for an Object Table is stored in OSS. MaxCompute caches the metadata of these objects and performs queries and computations based on the cached metadata. Therefore, before you use an Object Table, you must refresh its cache. You can refresh the cache manually or configure periodic refreshes when you create the table.
Both manual and periodic refreshes are full refreshes.
Manual refresh
Each refresh is a full metadata synchronization. You can control the timing and frequency of the refreshes.
Syntax
ALTER TABLE <objecttable_name> REFRESH METADATA;Parameters
objecttable_name: Required. The table name.
Example
ALTER TABLE ot_demo_day REFRESH METADATA;
Periodic refresh
If files in the OSS directory that is mapped to the Object Table change frequently, you can refresh the metadata periodically. To do this, specify the relevant parameters when you create the table. This reduces maintenance costs.
Syntax
SET odps.namespace.schema=true; SET odps.sql.type.system.odps2 = true; CREATE OBJECT TABLE ot_demo_day WITH serdeproperties ( 'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole' ) location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/odps-external-****/ottest/' tblproperties ( 'metadata.cache.mode' = 'periodic', 'metadata.staleness.seconds' = '3600' );Parameters
metadata.staleness.seconds: The refresh period. Forperiodicmode, you must specify this parameter. The value must be within the range of[1, 604800], which is 1 second to 1 week. This parameter is not a strict guarantee. The scheduler attempts to execute the refresh based on this parameter.metadata.cache.mode: The refresh trigger method. The following options are available.periodic: The refresh is triggered periodically.manual: The refresh is triggered manually (default). You can control the trigger timing.
View refresh tasks
You can use the following command to view the status of historical refresh tasks.
SHOW refresh task history FOR object TABLE <object_table_name>;Parameters
<object_table_name> must be an Object Table.
Return values: The command returns the instance ID (InstanceId), start time (CreateTime), end time (EndTime), and status (Status) of each refresh task.
If the status is Failed, you can run
wait InstanceId;and then print the log view to see the error details.
Example
-- View the historical refresh tasks of the Object Table. SHOW refresh task history for object table ot_demo_day04; -- The following result is returned. ID = 20260105*******f +---------------------------------------------------------------------------------------------------+ | Project: test_project | | Schema: default | | Task: *** | +---------------------------------------------------------------------------------------------------+ | History: | +---------------------------------------------------------------------------------------------------+ | InstanceId | CreateTime | EndTime | Status | +---------------------------------------------------------------------------------------------------+ | 20260105******************ks | 2026-01-05 14:12:00 | 2026-01-05 14:12:04 | Terminated | | 20260105******************y3 | 2026-01-05 14:10:00 | 2026-01-05 14:10:03 | Terminated | +---------------------------------------------------------------------------------------------------+ OK
Query an OBJECT TABLE
An Object Table retrieves the metadata of files in an OSS directory. You can query the Object Table to view this metadata. You can also use SQL statements to perform computations on the metadata, such as filtering and matching. These computations include but are not limited to aggregations, joins, window functions, and `ORDER BY` and `LIMIT` clauses.
Syntax
SELECT * FROM <object_table_name>;Parameters
object_table_name: Required. The table name.
Example
-- You can query the data uploaded to the specified OSS directory. If the data volume is large, you can view five records.
SELECT * FROM ot_demo_day [limit 5];Query object content for business computation
The query process performs computations only on the object metadata. It does not involve the actual content of the objects. However, the computation process needs to read the actual content of the objects. MaxCompute has a built-in download function that you can integrate into your computation process.
GET_DATA_FROM_OSS function syntax
The GET_DATA_FROM_OSS function reads part or all of an object's content and returns it in binary format.
BINARY GET_DATA_FROM_OSS (
STRING <full_object_table_name>,
STRING <key>
[, BIGINT <offset>]
[, BIGINT <length>]
[, STRING <object_not_found_policy>]
)Parameters
Parameter | Required | Data type | Description | Default value |
full_object_table_name | Yes | STRING | The full path to the OBJECT TABLE in the three-tier model, including the Project and Schema names, such as If you use RoleARN for authentication when you create the table, this parameter helps automatically generate a Security Token Service (STS) token to access OSS. | None |
key | Yes | STRING | The name of the accessed object in the Object Table. For more information, see the description of the `key` parameter in the response described in View OBJECT TABLE properties. | None |
offset | No | BIGINT | The starting position from which to read the object content. The value must be greater than or equal to 0. | 0. This means the read operation starts from the beginning of the object. |
length | No | BIGINT | The number of bytes to read. | -1. This means the length is not limited. |
object_not_found_policy | No | STRING | Specifies how MaxCompute returns the result of the function invocation if an object key exists in the cached data but the object no longer exists in OSS. Valid values:
| The default value is OUTPUT_NULL. |
Example 1
To output the result of the GET_DATA_FROM_OSS function as the STRING type, nest it within the `STRING` function.
SELECT STRING(
GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 0, -1, 'OUTPUT_NULL')
)
FROM ot_demo_day;Example 2
Read all the content of the Object Table and return it as a binary value. The full path of the Object Table is <project_name>.default.ot_demo_day. The following code shows different parameter combinations.
-- The complete format.
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 0, -1, 'OUTPUT_NULL') FROM ot_demo_day;
-- The following statements are equivalent to get_data_from_oss('<project_name>.default.ot_demo_day', key, 0, -1, 'OUTPUT_NULL').
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key) FROM ot_demo_day;
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 0) FROM ot_demo_day;
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 0, -1) FROM ot_demo_day;
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 'OUTPUT_NULL') FROM ot_demo_day;
SELECT GET_DATA_FROM_OSS('<project_name>.default.ot_demo_day', key, 0, 'OUTPUT_NULL') FROM ot_demo_day;Use query optimization to improve performance
By default, when you access a table in a MaxCompute SQL statement, MaxCompute creates evenly sized chunks based on the number of records, the size of the records in bytes, and the total computing resources available under your quota. This provides parallel computing capabilities for different shards, which effectively controls long-tail problems in parallel tasks and improves overall query performance.
This chunking method is not ideal for SQL statements that need to download the content of objects from an Object Table for in-memory computation. Performance is not optimal, and I/O operations can easily become a bottleneck, leading to long-tail problems. Consider the following case:
key | size |
a0000.jpg | 10 MB |
a0001.jpg | 10 MB |
a0002.jpg | 10 MB |
... | ... |
a1022.jpg | 10 MB |
a1023.jpg | 10 MB |
b.avi | 10 GB |
Assume that only two worker resources are available. If chunks are created based on the number of rows or record bytes, as is done for standard internal tables, two split objects might be created: split1 ([a0000.jpg to a0511.jpg]) and split2 ([a0512.jpg to a1023.jpg, b.avi]). The actual data volume to download for split1 is 10 MB × 512 = 5 GB. The data volume for split2 is 5 GB + 10 GB = 15 GB. The computation load for split2 is significantly higher than for split1, which causes a severe long-tail problem.
A more logical approach is to create chunks based on the actual size of the objects when your SQL statement needs to download them. This helps control long-tail problems. If the logic for consuming OSS object data is more time-consuming, a more flexible chunking capability is needed. For the preceding case, you can create two splits: split1 ([a0000.jpg to a1023.jpg]) and split2 ([b.avi]). The download volume for both splits is 10 GB.
MaxCompute Object Tables provide this capability and create chunks based on the actual object size by default. The default value is 1 GB. You can adjust this value as needed, with granularity levels of KB, MB, or GB.
# The default value is 1 GB. You can adjust it.
SET odps.sql.object.table.split.unit.gb = 1;
SELECT get_data_from_oss('project.default.ot_demo_day', key) FROM ot_demo_da WHERE ...
# Alternatively, you can control it at the MB level. This has a higher priority than the GB parameter.
SET odps.sql.object.table.split.unit.mb = 1;
SELECT get_data_from_oss('project.default.ot_demo_day', key) FROM ot_demo_da WHERE ...
# Alternatively, you can control it at the KB level. This has a higher priority than the GB and MB parameters.
SET odps.sql.object.table.split.unit.kb = 1;
SELECT get_data_from_oss('project.default.ot_demo_day', key) FROM ot_demo_da WHERE ...In some cases, you may want to disable this query optimization solution. This solution starts two jobs. The first job performs pre-processing, and you can view its information in Logview. MaxCompute Object Tables provide a way to do this:
SET odps.sql.object.table.split.by.object.size.enabled = false;
SELECT get_data_from_oss('project.default.ot_demo_day', key) FROM ot_demo_day WHERE ...Delete an OBJECT TABLE
Object Tables cache your metadata, which consumes storage space and incurs storage costs. If you no longer need this cached data for your business, you can delete the Object Table. To use the data again later, you can recreate the Object Table.
Syntax
DROP TABLE [IF EXISTS] <object_table_name>; Parameters
object_table_name: Required. The table name.
Example
DROP TABLE IF EXISTS ot_demo_day;FAQ
ODPS-0010000:System internal error
Symptom
The following error message is returned:
ODPS-0010000:System internal error - ActionHandler job failed with failinfo storage service worker error occured: common/io/oss/oss_file_system_cppsdk.cpp(919): OSSRequestException: Status: -50, RequestId: , ErrorCode: ClientError:-50, Message: E_HTTP_ERROR_CONN_REFUSEDCause
A public endpoint for OSS was used when the Object Table was created.
Solution
When you create an Object Table, the
oss_endpointin thelocationparameter must be an internal network address. For more information about how to obtain the address, see Parameters.
Periodic refresh fails
Symptom
You set parameters for a periodic refresh when you create an Object Table. However, the refresh does not occur when the refresh period is reached.
Solution
Check whether the value of the `location` parameter configured for the Object Table is an OSS internal network domain name. For more information about how to create an Object Table, see Parameters.
Periodic refresh task reports an error
Error message
FAILED: ODPS-0010000:System internal error - ActionHandler job failed with failinfo storage service worker error occured: common/io/oss/oss_file_system_cppsdk.cpp(877): OSSRequestException: Status: -50, RequestId: , ErrorCode: ClientError:-50, Message:
Solution
This is an internal error. You can submit a ticket to contact the MaxCompute technical support team.