Query Parquet Files from Data Lake without Schema via UNLOAD - MaxCompute

This topic describes how to use Schemaless Query in MaxCompute to read Parquet files generated by Spark SQL and export the computation results to OSS by using the UNLOAD command. This example uses an E-MapReduce Serverless Spark cluster.

Prerequisites

An EMR Serverless Spark workspace is created.
This topic uses the China (Hangzhou) region as an example. The workspace name is schemaless_test and the OSS bucket name is oss-mc-test.
A MaxCompute project is created.

Step 1: Generate Parquet data based on Serverless Spark

Log on to the E-MapReduce console. In the left-side navigation pane, choose EMR Serverless > Spark.
On the Spark page, click the name of the created workspace. On the EMR Serverless Spark page, click Data Development in the left-side navigation pane.
Create a Spark SQL job, enter the following SQL commands to create a table in Parquet format and write data to it, then click Run.
Important
Before you execute the following commands, make sure that the data directory and database selected in the upper-right corner of the page are associated with the path of the created OSS bucket.
```
CREATE TABLE example_table01 (
    id INT,
    name STRING,
    age INT
) USING PARQUET;

INSERT INTO example_table01 VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35),
(4, 'David', 40),
(5, 'Eve', 32),
(6, 'Frank', 28),
(7, 'Grace', 33),
(8, 'Hannah', 29),
(9, 'Ian', 36),
(10, 'Julia', 31);

SELECT * FROM example_table01;
```
The following figure shows the sample output.
After the job is successfully executed, you can view the generated example_table01 table on the Metadata page in the Data Lake Formation console.
You can also view the generated Parquet files in the OSS bucket directory.

Step 2: Read data by using Schemaless Query

For more information, see Schemaless query.

When you write data to a Parquet table, a file named _SUCCESS is generated. You must use the file_pattern_blacklist parameter to add the _SUCCESS file to the blacklist so that this file is not read. If you do not add this parameter, an error will occur.

SELECT * FROM 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/example_table01/' 
(
  'file_format'='parquet',
  'file_pattern_blacklist'='.*_SUCCESS.*'
);

The following result is returned:

+------------+------------+------------+
| id         | name       | age        |
+------------+------------+------------+
| 1          | Alice      | 30         |
| 2          | Bob        | 25         |
| 3          | Charlie    | 35         |
| 4          | David      | 40         |
| 5          | Eve        | 32         |
| 6          | Frank      | 28         |
| 7          | Grace      | 33         |
| 8          | Hannah     | 29         |
| 9          | Ian        | 36         |
| 10         | Julia      | 31         |
+------------+------------+------------+

Step 3: Perform calculations using SQL

Query the total number of people who are older than 30.

SELECT COUNT(*) FROM 
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/example_table01/' 
(
  'file_format'='parquet',
  'file_pattern_blacklist'='.*_SUCCESS.*'
)
WHERE age>30;

The following result is returned:

+------------+
| _c0        |
+------------+
| 6          |
+------------+

Step 4: Export calculation results to OSS using the UNLOAD command

MaxCompute allows you to export data from a MaxCompute project to external storage such as OSS for use by other computing engines. For more information, see UNLOAD.

Execute the following command in MaxCompute to export the calculation results from Step 3: Perform calculations using SQL to OSS in Parquet format.

Before you execute the following code, you must create the unload_schemaless directory in the OSS bucket oss-mc-test.

UNLOAD FROM  
(
  SELECT COUNT(*) FROM 
  LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/example_table01/' 
  ('file_format'='parquet','file_pattern_blacklist'='.*_SUCCESS.*')
  WHERE age>30
)
INTO
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/unload_schemaless/'
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ('odps.properties.rolearn'='acs:ram::<uid>:role/AliyunODPSDefaultRole') 
STORED AS PARQUET 
PROPERTIES('mcfed.parquet.compression'='SNAPPY')
;

Log on to the OSS console to check whether the UNLOAD operation is successful.