All Products
Search
Document Center

MaxCompute:Overview of public datasets

Last Updated:Mar 26, 2026

MaxCompute provides a set of public datasets you can query immediately using MaxCompute SQL — no data loading required. The datasets cover GitHub activity, national statistics, TPC benchmark data, e-commerce behavior, and more, all stored in the BIGDATA_PUBLIC_DATASET project.

Important

Public datasets are for testing and exploration only. The data is not refreshed on a regular schedule, and accuracy is not guaranteed. Do not use this data in production environments.

Available datasets

All datasets are stored as schemas within the BIGDATA_PUBLIC_DATASET project.

Category What you can analyze Schema
GitHub public event data Open source project activity: stars, commits, forks, and other events across public repositories github_events
National statistics Annual GDP trends for countries worldwide and for provinces in China national_data
TPC-DS performance data Decision support queries across a simulated multi-channel retail chain tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t
TPC-H performance data OLAP queries on supplier-buyer transaction data tpch_10g, tpch_100g, tpch_1t, tpch_10t
TPCx-BB performance data Big data benchmark queries on an online retail scenario tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t
Digital commerce Ad click behavior and shopping activity from Taobao users commerce
Life service Movie box office results, mobile number attribution, and administrative division codes life_service
Financial stock Stock prices and quarterly financial reports finance

Usage notes

Public datasets are accessible to all MaxCompute users. Before querying, note the following:

Cross-project access is required. Data is stored in the BIGDATA_PUBLIC_DATASET project, but you are not added as a member of that project. When writing SQL, prefix every table name with the project name and schema name:

-- Enable session-level schema syntax (required if tenant-level schema syntax is not enabled).
SET odps.namespace.schema=true;

-- Query 100 records from the dwd_github_events_odps table.
SELECT * FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds='2024-05-10' LIMIT 100;
If tenant-level schema syntax is enabled for your account, the SET statement is not required.

Storage is free; compute is billed. You are not charged for storing data in public datasets. However, the computing resources your queries consume are billed at the standard Pay-as-you-go rate. For details, see Computing fees (Pay-as-you-go).

DataWorks limitations. Because cross-project access is required:

  • Tables in BIGDATA_PUBLIC_DATASET do not appear in the DataWorks Data Map.

  • If tenant-level schema syntax is not enabled for your account, the datasets are not visible in DataWorks DataAnalysis. You can still query them by running SQL statements directly.

TPC benchmark disclaimer. The TPC data in MaxCompute is based on TPC benchmarks. Results from queries on these datasets cannot be compared with published TPC benchmark results, because the tests do not meet all TPC benchmark requirements. To generate your own TPC data, see the official TPC documentation.

Dataset details

GitHub public event data

Property Details
Project BIGDATA_PUBLIC_DATASET
Schema github_events
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Data source GH Archive

Use this dataset to analyze open source project activity on GitHub — who is starring repositories, committing code, opening pull requests, and more. For a full list of event types, see GitHub Events.

Tables:

Table Description Update cycle
dwd_github_events_odps Fact table for GitHub public events T+1 hour
dws_overview_by_repo_month Monthly aggregated metrics per repository T+1 day

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Replace the schema and table name to inspect other tables.
DESC bigdata_public_dataset.github_events.dwd_github_events_odps;

Example query — find the most-starred repositories in the past year:

Which repositories received the most stars in the last 365 days?

-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Returns the top 10 repositories by star count over the last 365 days.
-- Note: This query does not account for users who later unstarred a repository.
SELECT
    repo_id,
    repo_name,
    COUNT(actor_login) total
FROM
    bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE
    ds >= date_add(getdate(), -365)
    AND type = 'WatchEvent'
GROUP BY
    repo_id,
    repo_name
ORDER BY
    total DESC
LIMIT 10;

For more data details and query samples, see GitHub public event data.

National statistics

Property Details
Project BIGDATA_PUBLIC_DATASET
Schema national_data
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Update cycle Static — not updated

Use this dataset to analyze long-term economic trends. It contains annual GDP data for provinces in China and for countries worldwide.

Tables:

Table Description Data source
annual_gdp_by_province Annual GDP by province in China National Bureau of Statistics of China
annual_gdp_by_country Annual GDP by country International Monetary Fund (IMF)

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.national_data.annual_gdp_by_province;

Example query — view Beijing's GDP trend over the past 20 years:

How has Beijing's GDP changed over the last 20 years?

-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
SELECT
    region,
    gdp,
    year
FROM
    bigdata_public_dataset.national_data.annual_gdp_by_province
WHERE
    region = 'Beijing'
ORDER BY
    year ASC
LIMIT 20;

TPC-DS data

Property Details
Project BIGDATA_PUBLIC_DATASET
Schemas tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Update cycle Static — not updated
Data source TPC

Use TPC-DS data to run decision support benchmark queries on MaxCompute. TPC-DS simulates the sales system of a large retail chain across three channels — physical stores, online stores, and phone orders (catalog) — with data covering sales, returns, customers, products, and promotions. Choose the schema that matches your target data volume for performance testing.

Tables (24 total):

Table Description
call_center Customer service center information
catalog_page Product catalog information
catalog_returns Return records from the phone order channel
catalog_sales Sales records from the phone order channel
customer Customer information
customer_address Customer address information
customer_demographics Basic customer credit information
date_dim Time dimension information
household_demographics Basic household credit information
income_band Income information
inventory Warehouse inventory information
item Product information
promotion Product promotion information
reason Reasons for customer returns
ship_mode Shipping method information
store Merchant (physical outlet) information
store_returns Return records from the physical outlet channel
store_sales Sales records from the physical outlet channel
time_dim Time dimension information
warehouse Warehouse information
web_page Product web page information
web_returns Return records from the web channel
web_sales Sales records from the web channel
web_site Product website information

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpcds_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpcds_10g.call_center;

Example query — find top-selling brands in December by year:

Which brands had the highest sales in December each year, for a specific manufacturer?

SET odps.namespace.schema=TRUE;
SELECT
    dt.d_year,
    item.i_brand_id  brand_id,
    item.i_brand     brand,
    SUM(ss_sales_price) sum_agg
FROM
    bigdata_public_dataset.tpcds_10g.date_dim       dt,
    bigdata_public_dataset.tpcds_10g.store_sales,
    bigdata_public_dataset.tpcds_10g.item
WHERE
    dt.d_date_sk = store_sales.ss_sold_date_sk
    AND store_sales.ss_item_sk = item.i_item_sk
    AND item.i_manufact_id = 190
    AND dt.d_moy = 12
GROUP BY
    dt.d_year,
    item.i_brand,
    item.i_brand_id
ORDER BY
    dt.d_year,
    sum_agg DESC,
    brand_id
LIMIT 100;

For query sample files covering all dataset sizes, see TPC-DS data. For benchmark specification details, see the official TPC Benchmark DS standard specification.

TPC-H data

Property Details
Project BIGDATA_PUBLIC_DATASET
Schemas tpch_10g, tpch_100g, tpch_1t, tpch_10t
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Update cycle Static — not updated
Data source TPC

Use TPC-H data to benchmark Online Analytical Processing (OLAP) queries on MaxCompute. TPC-H uses a set of business-oriented ad hoc queries with concurrent data modifications to evaluate query performance on large datasets. The data simulates transactions between suppliers and buyers.

Tables (8 total):

Table Description
customer Customer information
lineitem Line item details for orders
nation Nation information
orders Order information
part Part information
partsupp Supplier part information
region Region information
supplier Supplier information

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpch_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpch_10g.lineitem;

Example query — pricing summary report for shipped line items:

What is the summary of pricing information — quantities, prices, discounts, taxes, and order counts — for all line items shipped before a given date?

SET odps.namespace.schema=TRUE;
SET odps.sql.validate.orderby.limit=FALSE;
SET odps.sql.hive.compatible=TRUE;
SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity)                                      AS sum_qty,
    sum(l_extendedprice)                                 AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount))              AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity)                                      AS avg_qty,
    avg(l_extendedprice)                                 AS avg_price,
    avg(l_discount)                                      AS avg_disc,
    count(*)                                             AS count_order
FROM
    bigdata_public_dataset.tpch_10g.lineitem
WHERE
    l_shipdate <= date'1998-12-01' - interval '90' DAY
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;

For benchmark specification details, see the official TPC Benchmark H standard specification.

TPCx-BB data

Property Details
Project BIGDATA_PUBLIC_DATASET
Schemas tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Update cycle Static — not updated
Data source TPC

TPCx-BB is a big data benchmark that measures the performance of Hadoop-based big data systems. It evaluates hardware and software by running 30 common analytical queries on a simulated online retail scenario, covering sales, returns, products, promotions, and web clickstreams.

Tables (23 total):

Table Description
customer Customer information
customer_address Customer address information
customer_demographics Basic customer credit information
date_dim Time dimension information
household_demographics Basic household credit information
income_band Income information
inventory Warehouse inventory information
item Product information
item_marketprices Competitor price information for products
product_reviews Product review information
promotion Product promotion information
reason Reasons for customer returns
ship_mode Shipping method information
store Physical outlet information
store_returns Return records from the physical outlet channel
store_sales Sales records from the physical outlet channel
time_dim Time dimension information
warehouse Warehouse information
web_clickstreams Web clickstream information
web_page Product web page information
web_returns Return records from the web channel
web_sales Sales records from the web channel
web_site Product website information

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpcxbb_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpcxbb_10g.web_sales;

Example query:

SET odps.namespace.schema=TRUE;
SELECT * FROM bigdata_public_dataset.tpcxbb_10g.web_sales LIMIT 100;

For benchmark specification details, see the official TPCx-BB standard specification.

Digital commerce dataset

Property Details
Project BIGDATA_PUBLIC_DATASET
Schema commerce
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Update cycle Static — incremental updates are no longer provided
Data source Tianchi Lab — Taobao Display Ad Click-Through Rate Prediction dataset

Use this dataset to analyze ad click-through rates and shopping behavior. It contains display ad click logs and shopping activity — browsing, adding to cart, liking, and purchasing — from over one million randomly sampled Taobao users.

Tables:

Table Description
adv_raw_sample Raw display ad click logs from over 1 million users over 8 days
adv_ad_feature Basic information about the ads in adv_raw_sample
user_profile Basic profile for all users in adv_raw_sample
behavior_log Shopping activity for all users in adv_raw_sample over 22 days

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.commerce.behavior_log;

Example query — find the top three product categories by sales volume over 22 days:

Which product categories had the highest purchase volume across the 22-day observation period?

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT
    cate,
    count(btag) sales
FROM
    behavior_log
WHERE
    btag = 'buy'
GROUP BY
    cate
ORDER BY
    sales DESC
LIMIT 3;

Life service dataset

Property Details
Project BIGDATA_PUBLIC_DATASET
Schema life_service
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Use this dataset to analyze entertainment trends and geographic reference data. It includes movie box office results and administrative reference codes for China.

Tables:

Table Description Update cycle
movie_basic_info Basic movie information Fixed date partitions — incremental updates no longer provided
movie_box Box office results by date Fixed date partitions — incremental updates no longer provided
areacode_basic_info_2020 Administrative and urban/rural division codes (2020) Static — not updated
phoneno_basic_info_2020 Mobile number attribution data (2020) Static — not updated

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.life_service.movie_box;

Example query — find the top 10 movies at the box office on January 14, 2017:

Which movies had the highest box office ranking on January 14, 2017?

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT
    moviename
FROM
    bigdata_public_dataset.life_service.movie_box
WHERE
    ds = '20170114'
ORDER BY
    rank ASC
LIMIT 10;

Financial stock dataset

Property Details
Project BIGDATA_PUBLIC_DATASET
Schema finance
Available regions China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Update cycle Fixed date partitions — incremental updates no longer provided

Use this dataset to analyze stock performance trends and quarterly financial results. It contains stock price history and quarterly financial reports for publicly traded companies.

Tables:

Table Description
ods_enterprise_share_basic Basic stock information
ods_enterprise_share_quarter_cashflow Quarterly cash flow report
ods_enterprise_share_quarter_growth Quarterly business growth data
ods_enterprise_share_quarter_operation Quarterly financial turnover data
ods_enterprise_share_quarter_profit Quarterly profit statement
ods_enterprise_share_quarter_report Quarterly report
ods_enterprise_share_trade_h Stock price history

Inspect the table schema:

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;

Example query — view basic stock data for January 14, 2017:

What basic stock information was recorded on January 14, 2017?

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT *
FROM bigdata_public_dataset.finance.ods_enterprise_share_basic
WHERE ds = '20170114'
LIMIT 10;

Query public datasets

Prerequisites

Before you begin, ensure that you have:

  • An active MaxCompute service

  • A MaxCompute project. For setup instructions, see Create a project

Supported tools

Run queries against public datasets using any of the following:

Procedure (DataWorks Data Development node example)

  1. Log on to the DataWorks console and select a region in the upper-left corner.

  2. Create a workspace.

  3. Attach a MaxCompute data source.

  4. Create an ODPS SQL node and enter the following SQL example.

    -- View the GDP trend of each province in China over the last 20 years.
    SET odps.namespace.schema=true; 
    SET odps.sql.validate.orderby.limit = false;
    SELECT
        region,
        gdp,
        year
    FROM
        bigdata_public_dataset.national_data.annual_gdp_by_province
    ORDER BY
        year ASC;
  5. Click image.png to view the results.image.png

Query using DataWorks

  1. Log on to the DataWorks console and select a region in the upper-left corner.

  2. Create a workspace.

  3. Attach a MaxCompute data source.

  4. Create an ODPS SQL node and enter your SQL query. The following example views GDP trends across all provinces in China over the past 20 years:

    SET odps.namespace.schema=true;
    SET odps.sql.validate.orderby.limit = false;
    SELECT
        region,
        gdp,
        year
    FROM
        bigdata_public_dataset.national_data.annual_gdp_by_province
    ORDER BY
        year ASC;
  5. Click the run button to execute the query and view results.

What's next

To export data from MaxCompute after running your queries:

  • Download: Download query results or data to your local machine.

  • UNLOAD: Export data to external storage such as OSS or Hologres.