All Products
Search
Document Center

MaxCompute:Feature: Schemaless Query

Last Updated:Aug 09, 2025

MaxCompute supports schemaless queries on Parquet data from external tables in Object Storage Service (OSS). You can export the parsed dataset to OSS, write it to an internal table, or embed it as a subquery in SQL operations. This feature simplifies data operations in a data lake.

Background information

When you use Spark to process structured data for ad hoc exploration, you do not need a fixed data warehouse model. However, defining a table schema for both input and output and mapping fields from OSS files to an external table for read and write operations makes partition maintenance complex.

Schemaless Query simplifies this process. When you use a command to read a Parquet external table, MaxCompute automatically parses the file format and reads the data into a dataset with a schema. You can then select specific columns from the dataset for data processing, similar to how you would with a table. The results can be exported to OSS using the UNLOAD command or imported into an internal table using CREATE TABLE AS. You can also use the dataset as a subquery in other SQL statements. This provides a flexible way to operate on data in a data lake using MaxCompute.

Limits

Schemaless Query does not support treating subdirectories in an OSS bucket as partitions.

Syntax

SELECT *, <col_name>, <table_alias>.<col_name> 
FROM 
 LOCATION '<location_path>'  
 ('key'='value' [, 'key1'='value1', ...]) 
 [AS <table_alias>];

Parameters

Parameter

Required

Description

*

Yes

Queries all fields in the Parquet file.

col_name

Yes

Queries a field with a known name in the Parquet file.

table_alias.col_name

Yes

Queries a field with a known name in the Parquet file. The field is specified by its full path, which consists of the table alias and the field name.

table_alias

No

A custom alias for the table.

location_path

Yes

The location of the Parquet files. This must be an OSS directory with the structure oss://oss_endpoint/bucket_name/path/.

The subdirectory under path can be a partition directory in the partition_name=partition_value format.

Note

Schemaless Query does not support treating subdirectories in the location as partitions. Therefore, partition pruning is not supported.

key&value

Yes

The parameters and their values for the query. For more information, see the following table.

Optional key-value parameters:

Key

Value

Required

Description

Default value

file_format

parquet

Yes

Specifies the format of the files in the location. Only Parquet is supported. If you specify another format, an error is reported.

parquet

rolearn

acs:ram::xxxxxx:role/aliyunodpsdefaultrole

No

Specifies the Rolearn required to access the Location. You can obtain the Rolearn in the RAM console by searching for AliyunODPSDefaultRole on the Roles page.

Note

If you do not specify a RoleArn in the SQL statement, the system uses the ARN of the AliyunODPSDefaultRole role by default.

acs:ram::1234****:role/aliyunodpsdefaultrole

file_pattern_blacklist

A regular expression. Example:

".*_SUCCESS$,.*\\.hive_staging.*"

No

Specifies a blacklist of files to read. If a scanned file name matches the blacklist, the file is not read.

None

file_pattern_whitelist

A regular expression. Example:

".*_20250124_.*.parquet"

No

Specifies a whitelist of files to read. A file is read only if its name matches the whitelist.

.*

Examples

Example 1: Read OSS data by setting blacklist and whitelist parameters

  1. Prepare the data.

    Log on to the OSS console and upload the test data to the object-table-test/schema/ directory in an OSS bucket. For more information, see Upload files to OSS.

    • Prepare a Parquet file to read and verify the whitelist parameter.

    • Prepare a CSV file to verify the blacklist parameter.

    oss1

  2. Read the Parquet file.

    Log on to the MaxCompute client (odpscmd) and execute the following SQL commands.

    • Add test_oss.csv to the blacklist and read the Parquet file from OSS.

      SELECT tinyint_name, int_name, binary_name, float_name,varchar_name 
      FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
      (
      'file_format'='parquet',
      'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'file_pattern_blacklist'='.*test_oss.*'
      );
    • Add 20250610TableSink to the whitelist and read the Parquet file from OSS.

      SELECT tinyint_name, int_name, binary_name, float_name, varchar_name 
      FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/'
      (
      'file_format'='parquet',
      'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'file_pattern_whitelist'='.*20250610TableSink.*'
      );

    In both cases, the 20250610TableSink file is read. The following result is returned:

    +--------------+------------+-------------+------------+--------------+
    | tinyint_name | int_name   | binary_name | float_name | varchar_name |
    +--------------+------------+-------------+------------+--------------+
    | 1            | 100        | abc         | 3.14       | N            |
    +--------------+------------+-------------+------------+--------------+

Example 2: Read data written by Spark

  1. Generate Parquet data using Serverless Spark. For more information, see Create an SQL task.

    Note
    1. Log on to the E-MapReduce console and create an EMR Serverless Spark workspace.

    2. On the EMR Serverless > Spark page, click the newly created workspace. On the Data Development page, execute the following SQL statement.

      CREATE TABLE example_table_parquet04 (
          id STRING,
          name STRING,
          age STRING,
          salary DOUBLE,
          is_active BOOLEAN,
          created_at TIMESTAMP,
          details STRUCT<department:STRING, position:STRING>
      )
      USING PARQUET;
      
      INSERT INTO example_table_parquet04 VALUES
      ('1', 'Alice', '30', 5000.50, TRUE, TIMESTAMP '2024-01-01 10:00:00', STRUCT('HR', 'Manager')),
      ('2', 'Bob', '25', 6000.75, FALSE, TIMESTAMP '2024-02-01 11:00:00', STRUCT('Engineering', 'Developer')),
      ('3', 'Charlie','35', 7000.00, TRUE, TIMESTAMP '2024-03-01 12:00:00', STRUCT('Marketing', 'Analyst')),
      ('4', 'David', '40', 8000.25, FALSE, TIMESTAMP '2024-04-01 13:00:00', STRUCT('Sales', 'Representative')),
      ('5', 'Eve', '28', 5500.50, TRUE, TIMESTAMP '2024-05-01 14:00:00', STRUCT('Support', 'Technician'));
      
      SELECT * FROM example_table_parquet04;
  2. Log on to the OSS console and view the generated data files in the destination path.image

  3. Log on to the MaxCompute client, add _SUCCESS to the blacklist, and read the Parquet file from OSS. Assume that the specified OSS directory is object-table-test/spark.

    SELECT  *
    from location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/spark/example_table_parquet04/'
    (
    'file_format'='parquet',
    'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
    'file_pattern_blacklist'='.*_SUCCESS.*'
    );

    The following result is returned:

    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
    | id | name    | age | salary     | is_active | created_at          | details                                      |
    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
    | 1  | Alice   | 30  | 5000.5     | true      | 2024-01-01 10:00:00 | {department:HR, position:Manager}            |
    | 2  | Bob     | 25  | 6000.75    | false     | 2024-02-01 11:00:00 | {department:Engineering, position:Developer} |
    | 3  | Charlie | 35  | 7000.0     | true      | 2024-03-01 12:00:00 | {department:Marketing, position:Analyst}     |
    | 4  | David   | 40  | 8000.25    | false     | 2024-04-01 13:00:00 | {department:Sales, position:Representative}  |
    | 5  | Eve     | 28  | 5500.5     | true      | 2024-05-01 14:00:00 | {department:Support, position:Technician}    |
    +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+

Example 3: Use Schemaless Query in a subquery

  1. Prepare the data.

    Log on to the OSS console and upload the test data part-00001.snappy.parquet to the specified directory object-table-test/schema/ in an OSS bucket. For more information, see Upload files to OSS.

  2. Log on to the MaxCompute client and create an internal table to store the automatically discovered data from OSS.

    CREATE TABLE ow_test (
        id INT,
        name STRING,
        age INT
    );
  3. Read the OSS data using Schemaless Query. The command is as follows.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  4. Pass the data read from OSS as a subquery to an outer SQL statement to query the ow_test table.

    INSERT OVERWRITE TABLE ow_test
    SELECT id,name,age FROM 
    (
        SELECT * FROM 
        LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
        (
          'file_format'='parquet'
        )
    );
    
    SELECT * FROM ow_test;

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+

Example 4: Save the Schemaless Query result to an internal data warehouse table

  1. Prepare the data.

    Log on to the OSS console and upload the test data part-00001.snappy.parquet to the specified directory object-table-test/schema/ in an OSS bucket. For more information, see OSS File Upload.

  2. Log on to the MaxCompute client and read the OSS data using Schemaless Query.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  3. Copy the automatically discovered data from OSS to an internal table using the CREATE TABLE AS statement, and then query the table.

    CREATE TABLE ow_test_2 AS 
      SELECT * FROM 
      LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
      (
        'file_format'='parquet'
       );
    
    -- Query the result table ow_test_2
    SELECT * FROM ow_test_2;

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+

Example 5: Unload the Schemaless Query result back to the data lake

  1. Prepare the data.

    Log on to the OSS console and upload the test data part-00001.snappy.parquet to the object-table-test/schema/ directory in an OSS bucket. For more information, see Upload files to OSS.

  2. Log on to the MaxCompute client and read the OSS data using Schemaless Query.

    SELECT * FROM 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
    (
      'file_format'='parquet'
    );

    The following result is returned:

    +------------+------------+------------+
    | id         | name       | age        |
    +------------+------------+------------+
    | 3          | Charlie    | 35         |
    | 4          | David      | 40         |
    | 5          | Eve        | 28         |
    +------------+------------+------------+
  3. Unload the automatically discovered data to OSS. For more information about the UNLOAD operation, see UNLOAD.

    UNLOAD FROM (
      SELECT * FROM 
      LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' 
      ('file_format'='parquet')
    ) 
    INTO 
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/unload/ow_test_3/'
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH SERDEPROPERTIES ('odps.external.data.enable.extension'='true')
    STORED AS PARQUET;

    View the files generated in the OSS directory:image