All Products
Search
Document Center

MaxCompute:Parquet external tables

Last Updated:Mar 26, 2026

Parquet external tables let you query and write Parquet files stored in OSS directly from MaxCompute, without loading data into internal tables.

Constraints

  • OSS external tables do not support the cluster property.

  • A single file cannot exceed 2 GB. Split files that exceed this limit.

  • MaxCompute and OSS must be in the same region.

Parsing modes

MaxCompute supports two modes for parsing Parquet files. Set the mode before running queries or write operations.

ModeFlagImplementationSupports
JNI modeset odps.ext.parquet.native=falseJava-based open-source implementationRead and write
Native modeset odps.ext.parquet.native=trueC++-based native implementationRead only

Supported data types

For a full description of MaxCompute data types, see Data Type Version 1.0 and Data Type Version 2.0.

Data typeJNI mode (read/write)Native mode (read only)
TINYINTSupportedSupported
SMALLINTSupportedSupported
INTSupportedSupported
BIGINTSupportedSupported
BINARYSupportedSupported
FLOATSupportedSupported
DOUBLESupportedSupported
DECIMAL(precision,scale)Not supportedSupported
VARCHAR(n)SupportedSupported
CHAR(n)SupportedSupported
STRINGSupportedSupported
DATESupportedSupported
DATETIMESupportedSupported
TIMESTAMPSupportedSupported
TIMESTAMP_NTZNot supportedNot supported
BOOLEANSupportedSupported
ARRAYSupportedSupported
MAPSupportedSupported
STRUCTSupportedSupported
JSONNot supportedNot supported

Supported compression formats

Parquet external tables support ZSTD, SNAPPY, and GZIP compression. To use compression, add the with serdeproperties clause to the CREATE EXTERNAL TABLE statement. See with serdeproperties parameters for details.

Schema evolution

Parquet external tables map column values by name, not position. This means you can safely add, delete, or reorder columns without breaking reads of existing data.

The Data compatibility column below describes whether the table can read existing data written before the schema change.
OperationSupportedDescriptionData compatibility
Add columnYesNew columns are added at the end of the table. Position cannot be specified. Default values apply only to data written by MaxCompute after the change.Compatible for new data. Historical rows return NULL for the new column.
Delete columnYesParquet maps columns by name, so removing a column from the table schema does not affect reads of existing files.Compatible
Reorder columnsYesParquet maps columns by name, so column order in the DDL does not need to match the file.Compatible
Change column data typeNoParquet enforces strict type validation at the file level. Changing a column's type in the DDL does not rewrite the underlying files, causing type mismatches on read.Not applicable
Rename columnNoParquet maps columns by name. Renaming a column in the DDL breaks the mapping for all existing files that still use the original column name.Not applicable
Update column commentYesComments must be valid strings no longer than 1,024 bytes.Compatible
Change column nullabilityNoColumns are nullable by default. Changing nullability is not supported.Not applicable

Create an external table

Tip: If you don't know the column structure of your Parquet files, use Schemaless Query to explore the file schema before writing the DDL.

Schema mismatch behavior

When the Parquet file schema differs from the external table DDL:

  • Fewer columns in the file than in the DDL: Missing columns return NULL.

  • More columns in the file than in the DDL: Extra columns are ignored.

  • Column type mismatch: Reading fails with ODPS-0123131:User defined function exception - Traceback:xxx. Align the column types in the DDL with the actual Parquet file schema to resolve this.

Syntax

Use the simplified syntax when you do not need compression or performance tuning. Use the detailed syntax to configure compression and other Parquet-specific options.

Simplified syntax structure

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
STORED AS parquet
LOCATION '<oss_location>'
[tblproperties ('<tbproperty_name>'='<tbproperty_value>', ...)];

Detailed Syntax

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH serdeproperties(
    'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
    'mcfed.parquet.compression'='ZSTD/SNAPPY/GZIP'
)
STORED AS parquet
LOCATION '<oss_location>';

For common parameters shared across OSS external table types, see Basic syntax parameters.

Unique parameters

with serdeproperties parameters

Use WITH serdeproperties to configure compression and read performance at table-creation time. These settings apply to both read and write operations.

ParameterWhen to addDescriptionValueDefault
mcfed.parquet.compressionWhen writing Parquet data to OSS in compressed formatSets the compression codec. Parquet data is uncompressed by default.ZSTD, SNAPPY, or GZIPNone
mcfed.parquet.compression.codec.zstd.levelWhen mcfed.parquet.compression is set to ZSTDZSTD compression level. Higher values increase the compression ratio but reduce write throughput. For large-scale Parquet I/O, use levels 3–5.1–223
parquet.file.cache.sizeTo improve OSS read performanceCache size for OSS data files, in KB.1024None
parquet.io.buffer.sizeTo improve OSS read performance for files larger than 1,024 KBBuffer size for OSS data files, in KB.4096None

tblproperties parameters

Use tblproperties to control output file naming, compression, and write behavior. These settings persist as table properties and apply at write time.

mcfed.parquet.compression appears in both serdeproperties and tblproperties. Use serdeproperties to set compression at table-creation time for both read and write. Use tblproperties when you want to control write-time behavior as a persistent table property.
ParameterWhen to addDescriptionValueDefault
io.compression.codecsWhen OSS data files use Raw-Snappy formatEnables the built-in resolver for Raw-Snappy compressed data. When set, MaxCompute can read compressed data.com.aliyun.odps.io.compress.SnappyRawCodecNone
odps.external.data.output.prefix (also: odps.external.data.prefix)To add a custom prefix to output filenamesLetters, digits, and underscores only (a–z, A–Z, 0–9, _). Length: 1–10 characters.For example, mc_None
odps.external.data.enable.extensionTo include file extensions in output filenamesTrue shows extensions. False hides them.True or FalseFalse
odps.external.data.output.suffixTo add a custom suffix to output filenamesLetters, digits, and underscores only.For example, _hangzhouNone
odps.external.data.output.explicit.extensionTo set a custom file extensionLetters, digits, and underscores only. Length: 1–10 characters. Takes precedence over odps.external.data.enable.extension.For example, jsonlNone
mcfed.parquet.compressionWhen writing Parquet data to OSS in compressed formatSets the compression codec. Parquet data is uncompressed by default.SNAPPY, GZIP, or ZSTDNone
mcfed.parquet.block.sizeTo control storage efficiency and read performanceParquet block size in bytes.Non-negative integer134217728 (128 MB)
mcfed.parquet.block.row.count.limitTo prevent out-of-memory (OOM) errors when writing large datasetsMaximum records per row group. Reduce this value if OOM occurs. If JVM memory is 1 GB and the average record size is 1 MB, set this to approximately 100 (the default row group size is 128 MB). Do not set this too low.Non-negative integer2147483647 (Integer.MAX_VALUE)
mcfed.parquet.page.size.row.check.minTo control memory check frequency during writesMinimum records between memory checks. Reduce this value if OOM occurs.Non-negative integer100
mcfed.parquet.page.size.row.check.maxTo control memory check frequency during writesMaximum records between memory checks. By default, MaxCompute checks memory every 10,000 records. For small records, reduce to 1,000 for more frequent checks. Because frequent memory checks add overhead, first reduce mcfed.parquet.block.row.count.limit. Reduce mcfed.parquet.page.size.row.check.max only if OOM persists or output files are too large.Non-negative integer1000
mcfed.parquet.compression.codec.zstd.levelWhen writing Parquet data with ZSTD compressionZSTD compression level. Valid values: 1–22.Non-negative integer3

Write data

For write syntax, see Write syntax.

Query and analysis

Predicate push down

By default, queries on Parquet external tables with WHERE filters scan all data, regardless of filter conditions. Parquet Predicate Push Down (PPD) uses row group metadata embedded in Parquet files to skip row groups that cannot match the filter, reducing scanned data, query latency, and resource usage.

PPD requires Native mode (odps.ext.parquet.native=true). Enable it by running these session-level SET commands before your query:

-- Enable Native mode (required for PPD).
SET odps.ext.parquet.native = true;
-- Enable Parquet PPD.
SET odps.sql.parquet.use.predicate.pushdown = true;

Performance comparison

The following results are from a 1 TB TPCDS benchmark using the tpcds_1t_store_sales table (2,879,987,999 total rows).

ModeRows scannedBytes scannedMapper timeCPUMemory
External table, PPD disabled2,879,987,999 (100%)19,386,793,984 (100%)18 s19.25 Core×Min24.07 GB×Min
External table, PPD enabled762,366,649 (26.47%)3,339,386,880 (17.22%)12 s11.47 Core×Min14.33 GB×Min (~59.58%)
Internal table, PPD enabled32,830,000 (1.14%)1,633,880,386 (8.43%)9 s5.62 Core×Min7.02 GB×Min (~29.19%)

Enabling PPD on the external table reduced bytes scanned from 100% to 17.22% and cut resource usage by approximately 40%. Internal tables show even greater gains because data is sorted, making row group pruning more effective.

Even with PPD enabled, external tables scan significantly more data than internal tables (17.22% vs. 8.43% of bytes). If query performance is critical and data changes infrequently, consider loading data into an internal table.

Test details

  1. External table with PPD disabled

    SET odps.ext.parquet.native = true;
    SET odps.sql.parquet.use.predicate.pushdown = false;
    
    SELECT SUM(ss_sold_date_sk) FROM tpcds_1t_store_sales
      WHERE ss_store_sk = 2 AND ss_sold_date_sk >= 2451871 AND ss_sold_date_sk <= 2451880;

    image

    image

    image

  2. External table with PPD enabled

    SET odps.ext.parquet.native = true;
    SET odps.sql.parquet.use.predicate.pushdown = true;
    
    SELECT SUM(ss_sold_date_sk) FROM tpcds_1t_store_sales
      WHERE ss_store_sk = 2 AND ss_sold_date_sk >= 2451871 AND ss_sold_date_sk <= 2451880;

    Many mappers are empty and skip reading data: Actual row group pruning log:

    image

    image

    image.webp

    image.png

  3. Internal table with PPD enabled Internal table data is sorted, so pruning is more effective.

    SELECT SUM(ss_sold_date_sk) FROM bigdata_public_dataset.tpcds_1t.store_sales
      WHERE ss_store_sk = 2 AND ss_sold_date_sk >= 2451871 AND ss_sold_date_sk <= 2451880;

    image

    image

    image

Example: create and use a ZSTD-compressed Parquet external table

This example creates a partitioned Parquet external table with ZSTD compression, then reads and writes data.

Prerequisites

Before you begin, ensure that you have:

Steps

  1. Prepare ZSTD-formatted data files. In the oss-mc-test bucket from the sample data, create the directory parquet_zstd_jni/dt=20230418 and place the partitioned data in that directory.

  2. Create a ZSTD-compressed Parquet external table.

    CREATE EXTERNAL TABLE IF NOT EXISTS mc_oss_parquet_data_type_zstd (
        vehicleId INT,
        recordId INT,
        patientId INT,
        calls INT,
        locationLatitute DOUBLE,
        locationLongtitue DOUBLE,
        recordTime STRING,
        direction STRING
    )
    PARTITIONED BY (dt STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    WITH serdeproperties(
      'odps.properties.rolearn'='acs:ram::<uid>:role/aliyunodpsdefaultrole',
      'mcfed.parquet.compression'='zstd'
    )
    STORED AS parquet
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/parquet_zstd_jni/';
  3. Add partitions to the external table.

    MSCK REPAIR TABLE mc_oss_parquet_data_type_zstd ADD PARTITIONS;

    For more options, see Add partitions to OSS external tables.

  4. Read data from the table.

    SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt='20230418' LIMIT 10;

    Expected output:

    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | vehicleid  | recordid   | patientid  | calls      | locationlatitute | locationlongtitue | recordtime     | direction  | dt         |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | 1          | 12         | 76         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:10 | SW         | 20230418   |
    | 1          | 1          | 51         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:00 | S          | 20230418   |
    | 1          | 2          | 13         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:01 | NE         | 20230418   |
    | 1          | 3          | 48         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:02 | NE         | 20230418   |
    | 1          | 4          | 30         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:03 | W          | 20230418   |
    | 1          | 5          | 47         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:04 | S          | 20230418   |
    | 1          | 6          | 9          | 1          | 46.81006         | -92.08174         | 9/14/2014 0:05 | S          | 20230418   |
    | 1          | 7          | 53         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:06 | N          | 20230418   |
    | 1          | 8          | 63         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:07 | SW         | 20230418   |
    | 1          | 9          | 4          | 1          | 46.81006         | -92.08174         | 9/14/2014 0:08 | NE         | 20230418   |
    | 1          | 10         | 31         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:09 | N          | 20230418   |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
  5. Write data to the table.

    INSERT INTO mc_oss_parquet_data_type_zstd PARTITION (dt = '20230418')
      VALUES (1,16,76,1,46.81006,-92.08174,'9/14/2014 0:10','SW');
    
    -- Verify the inserted row.
    SELECT * FROM mc_oss_parquet_data_type_zstd WHERE dt = '20230418' AND recordid=16;

    Expected output:

    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | vehicleid  | recordid   | patientid  | calls      | locationlatitute | locationlongtitue | recordtime     | direction  | dt         |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+
    | 1          | 16         | 76         | 1          | 46.81006         | -92.08174         | 9/14/2014 0:10 | SW         | 20230418   |
    +------------+------------+------------+------------+------------------+-------------------+----------------+------------+------------+

Troubleshooting

Column type mismatch

Error:

ODPS-0123131:User defined function exception - Traceback:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
   at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46)

The Parquet file stores the column as LongWritable (BIGINT), but the external table DDL defines it as INT. Change the column type in the DDL from INT to BIGINT.

Out-of-memory error when writing

Error:

ODPS-0123131:User defined function exception - Traceback:
java.lang.OutOfMemoryError: Java heap space
    at java.io.ByteArrayOutputStream.<init>(ByteArrayOutputStream.java:77)
    at org.apache.parquet.bytes.BytesInput$BAOS.<init>(BytesInput.java:175)
    at org.apache.parquet.bytes.BytesInput$BAOS.<init>(BytesInput.java:173)
    at org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:161)

This error occurs when writing large amounts of data to a Parquet external table. Resolve it in this order:

  1. Reduce mcfed.parquet.block.row.count.limit in tblproperties.

  2. If OOM persists or output files are too large, also reduce mcfed.parquet.page.size.row.check.max.

Add these session-level settings before writing:

-- Set maximum JVM heap memory for UDFs.
SET odps.sql.udf.jvm.memory=12288;
-- Control batch size on the runtime side.
SET odps.sql.executionengine.batch.rowcount=64;
-- Set memory size per mapper.
SET odps.stage.mapper.mem=12288;
-- Adjust input data size per mapper (file split size).
SET odps.stage.mapper.split.size=64;