This topic describes how to create, read, and write to OSS external tables in Parquet format.
Prerequisites
Ensure your Alibaba Cloud account, RAM user, or RAM role has the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.
(Optional) Prepare an OSS bucket, directories, and data files. For more information, see Create buckets, Manage directories, and Simple upload.
MaxCompute can automatically create directories in OSS. You can use a single SQL statement to read from or write to an external table that uses a UDF. Manual directory creation is no longer required, but the legacy method is still supported.
A MaxCompute project has been created. For more information, see Create a MaxCompute project.
MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, we recommend that you use a bucket in the same region as your MaxCompute project.
The Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.
Limits
The cluster attribute is not supported for OSS external tables.
A single file cannot exceed 3 GB. If a file exceeds this limit, we recommend that you split it.
Notes
If the schema of a Parquet file is inconsistent with the external table schema:
The number of columns is inconsistent: If the number of columns in the Parquet file is less than the number of columns in the DDL statement for the external table, the system fills the missing column values with NULL when reading the Parquet data. If the number of columns in the Parquet file is greater, the extra column data is discarded.
The column types are inconsistent: If a column type in the Parquet file is inconsistent with the corresponding column type in the DDL statement for the external table, an error is reported when you read the Parquet data. For example, if you use the STRING (or INT) type to read data of the INT (or STRING) type from a Parquet file, the error
ODPS-0123131:User defined function exception - Traceback:xxxis reported.
Data type support
In the following table,
indicates support and
indicates no support.
For more information about MaxCompute data types, see Data types (V1.0) and Data types (V2.0).
JNI mode:
set odps.ext.parquet.native=false. When you read data from an external table, the system uses the original open-source Java-based implementation to parse the Parquet data files. This mode supports read and write operations.Native mode:
set odps.ext.parquet.native=true. When you read data from an external table, the system uses the new C++-based native implementation to parse the Parquet data files. This mode supports only read operations.Data type
JNI mode supported (read/write)
Native mode supported (read-only)
TINYINT


SMALLINT


INT


BIGINT


BINARY


FLOAT


DOUBLE


DECIMAL(precision,scale)


VARCHAR(n)


CHAR(n)


STRING


DATE


DATETIME


TIMESTAMP


TIMESTAMP_NTZ


BOOLEAN


ARRAY


MAP


STRUCT


JSON


Supported compression formats
When you read from or write to compressed OSS files, add the with serdeproperties attribute to the CREATE TABLE statement. For more information, see with serdeproperties parameters.
MaxCompute supports reading from and writing to Parquet files compressed with ZSTD, SNAPPY, or GZIP.
Create an external table
Syntax
For more information about the syntax for creating external tables in different formats, see OSS external tables.
Simplified syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [COMMENT <table_comment>] [PARTITIONED BY (<col_name> <data_type>, ...)] STORED AS parquet LOCATION '<oss_location>' [tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];Detailed syntax
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name> ( <col_name> <data_type>, ... ) [COMMENT <table_comment>] [PARTITIONED BY (<col_name> <data_type>, ...)] ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH serdeproperties( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'mcfed.parquet.compression'='ZSTD/SNAPPY/GZIP' ) STORED AS parquet LOCATION '<oss_location>' ;
Common parameters
For information about common parameters, see Basic syntax parameters.
Private parameters
Properties for WITH SERDEPROPERTIES
property_name | Scenario | Description | property_value | Default value |
mcfed.parquet.compression | Add this property when you need to write Parquet data to OSS in a compressed format. | Parquet compression property. By default, Parquet data is not compressed. |
| None |
mcfed.parquet.compression.codec.zstd.level | You can add this property when | A higher level value results in a higher compression ratio. However, our tests indicate that high levels offer minimal data reduction while significantly increasing time and resource consumption. This trade-off reduces cost-effectiveness. Therefore, for scenarios that involve reading and writing large compressed Parquet files, a low zstd compression level (level 3 to level 5) provides the best results. For example: | Valid values: 1 to 22. | 3 |
parquet.file.cache.size | Add this property to improve the performance of reading OSS data files when processing Parquet data. | Specifies the amount of data that can be cached when reading an OSS data file. Unit: KB. | 1024 | None |
parquet.io.buffer.size | Add this property to improve the performance of reading OSS data files when processing Parquet data. | Specifies the amount of data that can be cached when the OSS data file size exceeds 1024 KB. Unit: KB. | 4096 | None |
Properties for TBLPROPERTIES
property_name | Scenario | Description | property_value | Default value |
io.compression.codecs | Add this property when the OSS data file is in Raw-Snappy format. | This property is for scenarios that use the built-in open source data parser for the SNAPPY format. Configure this parameter to allow MaxCompute to read the compressed data. Otherwise, MaxCompute cannot read the data. | com.aliyun.odps.io.compress.SnappyRawCodec | None |
Write data
For more information about the write syntax in MaxCompute, see Write syntax.
Query analysis
For more information about the SELECT syntax, see Query syntax.
For more information about optimizing query plans, see Query optimization.
To read data directly from the files specified by LOCATION, see Schemaless query.
Example
This example shows how to create an external table for Parquet files that are compressed using ZSTD, and then read data from and write data to the table.
When you run the following sample code, replace
<uid>with your Alibaba Cloud account ID.The role used in the following examples is
aliyunodpsdefaultrole. If you want to use a different role, replacealiyunodpsdefaultrolewith the name of the target role and grant the target role permissions to access OSS.
Prepare a data file in ZSTD format.
In the
oss-mc-testbucket described in Sample data, create theparquet_zstd_jni/dt=20230418folder hierarchy, and store the data file in thedt=20230418partition folder.Create a Parquet external table for ZSTD-compressed data.
CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_parquet_data_type_zstd ( vehicleId INT, recordId INT, patientId INT, calls INT, locationLatitute DOUBLE, locationLongtitue DOUBLE, recordTime STRING, direction STRING ) PARTITIONED BY (dt STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH serdeproperties( 'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole', 'mcfed.parquet.compression'='zstd' ) STORED AS parquet LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/parquet_zstd_jni/';Discover partitions. If the external table is partitioned, you must run the following command to discover and register the partitions from the underlying OSS location. For more information, see Syntax for adding partition data to an OSS external table.
-- Discover partition data. MSCK REPAIR TABLE mc_oss_parquet_data_type_zstd ADD PARTITIONS;Read data from the Parquet external table.
SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt='20230418' LIMIT 10;A partial result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 12 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20230418 | | 1 | 1 | 51 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:00 | S | 20230418 | | 1 | 2 | 13 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:01 | NE | 20230418 | | 1 | 3 | 48 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:02 | NE | 20230418 | | 1 | 4 | 30 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:03 | W | 20230418 | | 1 | 5 | 47 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:04 | S | 20230418 | | 1 | 6 | 9 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:05 | S | 20230418 | | 1 | 7 | 53 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:06 | N | 20230418 | | 1 | 8 | 63 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:07 | SW | 20230418 | | 1 | 9 | 4 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:08 | NE | 20230418 | | 1 | 10 | 31 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:09 | N | 20230418 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+Write data to the Parquet external table.
INSERT INTO mc_oss_parquet_data_type_zstd PARTITION ( dt = '20230418') VALUES (1,16,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW'); -- Query the newly written data. SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt = '20230606' AND recordId=16;The following result is returned:
+------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | vehicleid | recordid | patientid | calls | locationlatitute | locationlongtitue | recordtime | direction | dt | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+ | 1 | 16 | 76 | 1 | 46.81006 | -92.08174 | 9/14/2014 0:10 | SW | 20230418 | +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
FAQ
The column type in the Parquet file is inconsistent with the type in the DDL statement for the external table
Error message
ODPS-0123131:User defined function exception - Traceback: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46)Error description
The LongWritable field type in the Parquet file is inconsistent with the INT type in the DDL statement for the external table.
Solution
Change the INT type in the DDL statement for the external table to BIGINT.