The data storage format of MaxCompute has been updated to Alibaba Optimized Row Columnar (AliORC) since February 2020. To help you better understand the data performance of MaxCompute, this topic compares AliORC with Apache Optimized Row Columnar (ORC) and Apache Parquet based on TPC Benchmark DS (TPC-DS) tests.

Test results

  • The following table describes the comparison between the test results of AliORC and Apache ORC, and between the test results of AliORC and Apache Parquet. The test results are compared based on the dataset that contains 24 TPC-DS test tables.
    Item File size Writer elapsed time Reader elapsed time
    AliORC compared with Apache ORC Drops by more than 8%. Drops by more than 85%. Drops by more than 76%.
    AliORC compared with Apache Parquet Drops by more than 22%. Drops by more than 50%. Drops by more than 28%.
    The following figure shows the test results of the dataset.**
    Parameter description:
    • File size: the data storage size of all tables combined. Unit: bytes.
    • Writer elapsed time: the time it takes to import CSV data of TPC-DS to AliORC, Apache ORC, or Apache Parquet. Unit: seconds.
    • Reader elapsed time: the time it takes for AliORC, Apache ORC, or Apache Parquet to complete a data scan. Unit: seconds.
  • The following table describes the comparison results based on the store_sales table, which is the largest table among the test tables.
    Item File size Writer elapsed time Reader elapsed time
    AliORC compared with Apache ORC Drops by more than 7%. Drops by more than 86%. Drops by more than 74%.
    AliORC compared with Apache Parquet Drops by more than 20%. Drops by more than 54%. Drops by more than 30%.
    The following figure shows the test results of the store_sales table.**

Test environments

  • Apache Parquet version: Apache Arrow C++ V0.16.0
  • Apache ORC version: C++ V1.6.2
  • Dataset: TPC-DS 10G (SF=10)

Dataset

TPC-DS is a decision support benchmark that uses multi-dimensional data models, such as star and snowflake data models. The benchmark contains 7 fact tables and 17 dimension tables, with an average of 18 columns per table. The tables contain skewed data and values to simulate a real scenario. TPC-DS provides the best test set to measure different versions of Hadoop and SQL on Hadoop.

The following list shows the 24 tables of the TPC-DS dataset used in this test:
store_sales
catalog_sales
inventory
web_sales
store_returns
catalog_returns
web_returns
customer_demographics
customer
item
customer_address
date_dim
time_dim
catalog_page
household_demographics
promotion
store
web_page
web_site
call_center
reason
warehouse
ship_mode
income_band