MaxCompute supports schemaless queries on data in Parquet external tables that are stored in Object Storage Service (OSS). You can export the parsed dataset to OSS or write it to an internal table. You can also embed the dataset as a subquery in SQL operations to easily operate on data in a data lake.
Background
When you use Spark to perform ad-hoc exploration of structured data, you do not need to rely on a fixed data warehouse model. If you must define table schemas for input and output, manually map file fields, and then maintain partitions before you read or write OSS data, the process becomes cumbersome and lacks flexibility.
When you read a Parquet external table, the LOAD instruction automatically parses the file format and reads the data into a dataset with a schema. This lets you process the data as if it were in a table by selecting specific columns. You can export the results to OSS using the UNLOAD command or import the results into an internal table using CREATE TABLE AS. You can also use the dataset as a subquery in other SQL statements. This provides a flexible way for you to use MaxCompute to operate on data in a data lake.
Usage notes
Schemaless Query currently does not support treating subdirectories in an OSS bucket as partitions.
Syntax
SELECT *, <col_name>, <table_alias>.<col_name>
FROM
LOCATION '<location_path>'
('key'='value' [, 'key1'='value1', ...])
[AS <table_alias>];Parameters
Parameter | Required | Description |
* | Yes | Query all fields in the Parquet file. |
col_name | Yes | Query a field with a known column name in the Parquet file. |
table_alias.col_name | Yes | Query a field with a known column name in the Parquet file. Expressed as the full path with the table alias and field name. |
table_alias | No | A custom table alias. |
location_path | Yes |
|
key&value | Yes | The parameters and their values for the query statement. For more information, see the following table. |
key and value parameters
key | Required | Description | value | Default value |
file_format | Yes | Specifies the format of the file at the location. Only Parquet is supported. Other formats cause an error. | parquet | parquet |
rolearn | No | Specifies the RoleARN required to access the location.
Note If you do not specify a RoleARN in the SQL statement, the system uses the ARN of the |
| acs:ram::1234****:role/aliyunodpsdefaultrole |
file_pattern_blacklist | No | Specifies a blacklist of files to read. If a scanned file name matches the blacklist, the file is not read. | A regular expression. For example:
| None |
file_pattern_whitelist | No | Specifies a whitelist of files to read. A file is read only if its name matches the whitelist. | A regular expression. For example:
|
|
Examples
Example 1: Read OSS data by setting blacklist and whitelist parameters
Prepare the data.
Log on to the Object Storage Service (OSS) console.
In the navigation pane on the left, click Buckets.
On the Buckets page, click Create Bucket.
Create the
object-table-test/schema/directory in the OSS bucket.Prepare a Parquet file to read and validate the whitelist parameter. You can run the following Python code locally to create the Parquet file.
import pandas as pd # Sample data data = [ {'id': 3, 'name': 'Charlie', 'age': 35}, {'id': 4, 'name': 'David', 'age': 40}, {'id': 5, 'name': 'Eve', 'age': 28} ] df = pd.DataFrame(data) df['id'] = df['id'].astype('int32') df['name'] = df['name'].astype('str') df['age'] = df['age'].astype('int32') output_filename = 'sample_data.parquet' df.to_parquet(output_filename, index=False, engine='pyarrow')Upload the Parquet file to the
object-table-test/schema/directory in the OSS bucket.Log on to the Object Storage Service (OSS) console.
In the
object-table-test/schema/directory of the bucket, click Upload Object.
In the
object-table-test/schema/directory of the OSS bucket, prepare a CSV file to validate the blacklist parameter.
Read the Parquet file.
Log on to the MaxCompute client (odpscmd) and execute the following SQL commands.
Add
test_oss.csvto the blacklist and read the Parquet file from OSS.SELECT id, name, age FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_blacklist'='.*test_oss.*' );Add
sample_datato the whitelist and read the Parquet file from OSS.SELECT id, name, age FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_whitelist'='.*sample_data.*' );
With either of the preceding parameter settings, the
sample_datafile is read and the following result is returned:+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+
Example 2: Read data written by Spark
Create the
object-table-test/spark/directory in the OSS bucket.Use Serverless Spark to generate Parquet data. For more information, see Create an SQL task. If Parquet files that are generated by Spark already exist in the OSS directory, you can skip this step.
Log on to the E-MapReduce console and select a region in the upper-left corner.
In the navigation pane on the left, choose .
On the Spark page, click the name of the target workspace to open it. You can also click Create Workspace. After the workspace is created, click the name of the new workspace to open it.
In the navigation pane on the left, select Development, create a new SparkSQL file, and execute the following SQL statement:
CREATE TABLE example_table_parquet04 ( id STRING, name STRING, age STRING, salary DOUBLE, is_active BOOLEAN, created_at TIMESTAMP, details STRUCT<department:STRING, position:STRING> ) USING PARQUET; INSERT INTO example_table_parquet04 VALUES ('1', 'Alice', '30', 5000.50, TRUE, TIMESTAMP '2024-01-01 10:00:00', STRUCT('HR', 'Manager')), ('2', 'Bob', '25', 6000.75, FALSE, TIMESTAMP '2024-02-01 11:00:00', STRUCT('Engineering', 'Developer')), ('3', 'Charlie','35', 7000.00, TRUE, TIMESTAMP '2024-03-01 12:00:00', STRUCT('Marketing', 'Analyst')), ('4', 'David', '40', 8000.25, FALSE, TIMESTAMP '2024-04-01 13:00:00', STRUCT('Sales', 'Representative')), ('5', 'Eve', '28', 5500.50, TRUE, TIMESTAMP '2024-05-01 14:00:00', STRUCT('Support', 'Technician')); SELECT * FROM example_table_parquet04;
Log on to the OSS console and view the generated data file in the destination path.
Log on to the MaxCompute client, add
_SUCCESSto the blacklist, and read the Parquet file from OSS.SELECT * FROM location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/spark/example_table_parquet04/' ( 'file_format'='parquet', 'rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'file_pattern_blacklist'='.*_SUCCESS.*' );The following result is returned:
+----+---------+-----+------------+-----------+---------------------+----------------------------------------------+ | id | name | age | salary | is_active | created_at | details | +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+ | 1 | Alice | 30 | 5000.5 | true | 2024-01-01 10:00:00 | {department:HR, position:Manager} | | 2 | Bob | 25 | 6000.75 | false | 2024-02-01 11:00:00 | {department:Engineering, position:Developer} | | 3 | Charlie | 35 | 7000.0 | true | 2024-03-01 12:00:00 | {department:Marketing, position:Analyst} | | 4 | David | 40 | 8000.25 | false | 2024-04-01 13:00:00 | {department:Sales, position:Representative} | | 5 | Eve | 28 | 5500.5 | true | 2024-05-01 14:00:00 | {department:Support, position:Technician} | +----+---------+-----+------------+-----------+---------------------+----------------------------------------------+
For more information about schemaless query operations, see Use a schemaless query to read Parquet data from a data lake.
Example 3: Use a schemaless query as a subquery
Prepare the data.
Log on to the OSS console and upload the test data file part-00001.snappy.parquet to the specified OSS bucket directory
object-table-test/schema/. For more information, see Upload files to OSS.Log on to the MaxCompute client and create an internal table to store the automatically discovered OSS data.
CREATE TABLE ow_test ( id INT, name STRING, age INT );Read the OSS data using a schemaless query. The command is as follows:
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+You can pass the data read from OSS as a subquery to an outer SQL statement and query the ow_test result table.
INSERT OVERWRITE TABLE ow_test SELECT id,name,age FROM ( SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' ) ); SELECT * FROM ow_test;The following is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+
Example 4: Save the results of a schemaless query to an internal data warehouse table
Prepare the data.
Log on to the OSS console and upload the test data file part-00001.snappy.parquet to the specified OSS bucket directory
object-table-test/schema/. For more information, see Upload files to OSS.Log on to the MaxCompute client and read the OSS data using a schemaless query.
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+Copy the automatically discovered OSS data to an internal table using a
CREATE TABLE ASstatement and query the result.CREATE TABLE ow_test_2 AS SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' ); -- Query the result table ow_test_2 SELECT * FROM ow_test_2;The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+
Example 5: Unload the results of a schemaless query back to the data lake
Prepare the data.
Log on to the OSS console and upload the test data file part-00001.snappy.parquet to the specified OSS bucket directory
object-table-test/schema/. For more information, see Upload files to OSS.Log on to the MaxCompute client and read the OSS data using a schemaless query.
SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ( 'file_format'='parquet' );The following result is returned:
+------------+------------+------------+ | id | name | age | +------------+------------+------------+ | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 28 | +------------+------------+------------+Unload the automatically discovered results to OSS. For more information about the UNLOAD operation, see UNLOAD.
UNLOAD FROM ( SELECT * FROM LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/schema/' ('file_format'='parquet') ) INTO LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/object-table-test/unload/ow_test_3/' ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('odps.external.data.enable.extension'='true') STORED AS PARQUET;View the generated file in the OSS directory:
