All Products
Search
Document Center

MaxCompute:Parquet external tables

Last Updated:Aug 26, 2025

This topic describes how to create, read, and write to OSS external tables in Parquet format.

Prerequisites

  • Ensure your Alibaba Cloud account, RAM user, or RAM role has the required permissions to access OSS external tables. For more information about authorization, see STS authorization for OSS.

  • (Optional) Prepare an OSS bucket, directories, and data files. For more information, see Create buckets, Manage directories, and Simple upload.

    MaxCompute can automatically create directories in OSS. You can use a single SQL statement to read from or write to an external table that uses a UDF. Manual directory creation is no longer required, but the legacy method is still supported.
  • A MaxCompute project has been created. For more information, see Create a MaxCompute project.

    MaxCompute is deployed only in specific regions. To prevent cross-region data connectivity issues, we recommend that you use a bucket in the same region as your MaxCompute project.
  • The Alibaba Cloud account or RAM user has the CreateTable permission on your project. For more information about table operation permissions, see MaxCompute permissions.

Limits

  • The cluster attribute is not supported for OSS external tables.

  • A single file cannot exceed 3 GB. If a file exceeds this limit, we recommend that you split it.

Notes

If the schema of a Parquet file is inconsistent with the external table schema:

  • The number of columns is inconsistent: If the number of columns in the Parquet file is less than the number of columns in the DDL statement for the external table, the system fills the missing column values with NULL when reading the Parquet data. If the number of columns in the Parquet file is greater, the extra column data is discarded.

  • The column types are inconsistent: If a column type in the Parquet file is inconsistent with the corresponding column type in the DDL statement for the external table, an error is reported when you read the Parquet data. For example, if you use the STRING (or INT) type to read data of the INT (or STRING) type from a Parquet file, the error ODPS-0123131:User defined function exception - Traceback:xxx is reported.

Data type support

Note

In the following table, 已开通 indicates support and 未开通 indicates no support.

  • For more information about MaxCompute data types, see Data types (V1.0) and Data types (V2.0).

  • JNI mode: set odps.ext.parquet.native=false. When you read data from an external table, the system uses the original open-source Java-based implementation to parse the Parquet data files. This mode supports read and write operations.

  • Native mode: set odps.ext.parquet.native=true. When you read data from an external table, the system uses the new C++-based native implementation to parse the Parquet data files. This mode supports only read operations.

    Data type

    JNI mode supported (read/write)

    Native mode supported (read-only)

    TINYINT

    已开通

    已开通

    SMALLINT

    已开通

    已开通

    INT

    已开通

    已开通

    BIGINT

    已开通

    已开通

    BINARY

    已开通

    已开通

    FLOAT

    已开通

    已开通

    DOUBLE

    已开通

    已开通

    DECIMAL(precision,scale)

    已开通

    已开通

    VARCHAR(n)

    已开通

    已开通

    CHAR(n)

    已开通

    已开通

    STRING

    已开通

    已开通

    DATE

    已开通

    已开通

    DATETIME

    已开通

    已开通

    TIMESTAMP

    已开通

    已开通

    TIMESTAMP_NTZ

    已开通

    已开通

    BOOLEAN

    已开通

    已开通

    ARRAY

    已开通

    已开通

    MAP

    已开通

    已开通

    STRUCT

    已开通

    已开通

    JSON

    未开通

    未开通

Supported compression formats

When you read from or write to compressed OSS files, add the with serdeproperties attribute to the CREATE TABLE statement. For more information, see with serdeproperties parameters.

MaxCompute supports reading from and writing to Parquet files compressed with ZSTD, SNAPPY, or GZIP.

Create an external table

Syntax

For more information about the syntax for creating external tables in different formats, see OSS external tables.

  • Simplified syntax

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
      <col_name> <data_type>,
      ...
    )
    [COMMENT <table_comment>]
    [PARTITIONED BY (<col_name> <data_type>, ...)]
    STORED AS parquet 
    LOCATION '<oss_location>' 
    [tblproperties ('<tbproperty_name>'='<tbproperty_value>',...)];
  • Detailed syntax

    CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
    (
      <col_name> <data_type>,
      ...
    )
    [COMMENT <table_comment>]
    [PARTITIONED BY (<col_name> <data_type>, ...)]
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH serdeproperties(
        'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
        'mcfed.parquet.compression'='ZSTD/SNAPPY/GZIP'
    )
    STORED AS parquet 
    LOCATION '<oss_location>' 
    ;

Common parameters

For information about common parameters, see Basic syntax parameters.

Private parameters

Properties for WITH SERDEPROPERTIES

property_name

Scenario

Description

property_value

Default value

mcfed.parquet.compression

Add this property when you need to write Parquet data to OSS in a compressed format.

Parquet compression property. By default, Parquet data is not compressed.

  • ZSTD

  • SNAPPY

  • GZIP

None

mcfed.parquet.compression.codec.zstd.level

You can add this property when 'mcfed.parquet.compression'='zstd'. If you do not specify this property, the default compression level of 3 is used.

A higher level value results in a higher compression ratio. However, our tests indicate that high levels offer minimal data reduction while significantly increasing time and resource consumption. This trade-off reduces cost-effectiveness. Therefore, for scenarios that involve reading and writing large compressed Parquet files, a low zstd compression level (level 3 to level 5) provides the best results. For example: 'mcfed.parquet.compression.codec.zstd.level'= '5'.

Valid values: 1 to 22.

3

parquet.file.cache.size

Add this property to improve the performance of reading OSS data files when processing Parquet data.

Specifies the amount of data that can be cached when reading an OSS data file. Unit: KB.

1024

None

parquet.io.buffer.size

Add this property to improve the performance of reading OSS data files when processing Parquet data.

Specifies the amount of data that can be cached when the OSS data file size exceeds 1024 KB. Unit: KB.

4096

None

Properties for TBLPROPERTIES

property_name

Scenario

Description

property_value

Default value

io.compression.codecs

Add this property when the OSS data file is in Raw-Snappy format.

This property is for scenarios that use the built-in open source data parser for the SNAPPY format.

Configure this parameter to allow MaxCompute to read the compressed data. Otherwise, MaxCompute cannot read the data.

com.aliyun.odps.io.compress.SnappyRawCodec

None

Write data

For more information about the write syntax in MaxCompute, see Write syntax.

Query analysis

Example

This example shows how to create an external table for Parquet files that are compressed using ZSTD, and then read data from and write data to the table.

Note
  • When you run the following sample code, replace <uid> with your Alibaba Cloud account ID.

  • The role used in the following examples is aliyunodpsdefaultrole. If you want to use a different role, replace aliyunodpsdefaultrole with the name of the target role and grant the target role permissions to access OSS.

  1. Prepare a data file in ZSTD format.

    In the oss-mc-test bucket described in Sample data, create the parquet_zstd_jni/dt=20230418 folder hierarchy, and store the data file in the dt=20230418 partition folder.

  2. Create a Parquet external table for ZSTD-compressed data.

    CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_parquet_data_type_zstd (
        vehicleId INT,
        recordId INT,
        patientId INT,
        calls INT,
        locationLatitute DOUBLE,
        locationLongtitue DOUBLE,
        recordTime STRING,
        direction STRING
    )
    PARTITIONED BY (dt STRING )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH serdeproperties(
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'mcfed.parquet.compression'='zstd'
    )
    STORED AS parquet
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/parquet_zstd_jni/';
  3. Discover partitions. If the external table is partitioned, you must run the following command to discover and register the partitions from the underlying OSS location. For more information, see Syntax for adding partition data to an OSS external table.

    -- Discover partition data.
    MSCK REPAIR TABLE mc_oss_parquet_data_type_zstd ADD PARTITIONS;
  4. Read data from the Parquet external table.

    SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt='20230418' LIMIT 10;

    A partial result is returned:

    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | vehicleid  | recordid   | patientid  | calls      | locationlatitute | locationlongtitue | recordtime     | direction  | dt         |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | 1          | 12         | 76         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:10 | SW         | 20230418   |
    | 1          | 1          | 51         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:00 | S          | 20230418   |
    | 1          | 2          | 13         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:01 | NE         | 20230418   |
    | 1          | 3          | 48         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:02 | NE         | 20230418   |
    | 1          | 4          | 30         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:03 | W          | 20230418   |
    | 1          | 5          | 47         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:04 | S          | 20230418   |
    | 1          | 6          | 9          | 1          | 46.81006         | -92.08174         | 9/14/2014 0:05 | S          | 20230418   |
    | 1          | 7          | 53         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:06 | N          | 20230418   |
    | 1          | 8          | 63         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:07 | SW         | 20230418   |
    | 1          | 9          | 4          | 1          | 46.81006         | -92.08174         | 9/14/2014 0:08 | NE         | 20230418   |
    | 1          | 10         | 31         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:09 | N          | 20230418   |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
  5. Write data to the Parquet external table.

    INSERT INTO mc_oss_parquet_data_type_zstd PARTITION ( dt = '20230418') 
      VALUES  (1,16,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW');
    
    -- Query the newly written data.
    SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt = '20230606' AND recordId=16;

    The following result is returned:

    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | vehicleid  | recordid   | patientid  | calls      | locationlatitute | locationlongtitue | recordtime     | direction  | dt         |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | 1          | 16         | 76         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:10 | SW         | 20230418   |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+

FAQ

The column type in the Parquet file is inconsistent with the type in the DDL statement for the external table

  • Error message

    ODPS-0123131:User defined function exception - Traceback:
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable 
       at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46)
  • Error description

    The LongWritable field type in the Parquet file is inconsistent with the INT type in the DDL statement for the external table.

  • Solution

    Change the INT type in the DDL statement for the external table to BIGINT.