All Products
Search
Document Center

MaxCompute:Overview of public datasets

Last Updated:Dec 04, 2025

If you have activated MaxCompute, you can query tables in public datasets using MaxCompute SQL analysis. This lets you quickly try out the service. This topic describes the public datasets and explains how to query and analyze data.

Introduction

MaxCompute offers public datasets in several categories, such as GitHub public event data, national statistics, TPC performance test data, digital commerce data, life service data, and financial stock data. This data is stored in different schemas within the BIGDATA_PUBLIC_DATASET public project in MaxCompute.

Category

Introduction

Dataset name

Schema name

GitHub public event data

Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available.

GitHub public event dataset

github_events

National statistics

Includes annual GDP data for countries around the world and for provinces in China.

National statistics dataset

national_data

TPC performance data

TPC-DS

TPC-DS is a benchmark for decision support systems. It models common aspects of these systems, such as queries and data maintenance. This lets you run benchmark tests on emerging technologies like big data systems.

  • TPC-DS 10 GB performance test set

  • TPC-DS 100 GB performance test set

  • TPC-DS 1 TB performance test set

  • TPC-DS 10 TB performance test set

  • tpcds_10g

  • tpcds_100g

  • tpcds_1t

  • tpcds_10t

TPC-H

TPC-H is a benchmark for decision support systems. It uses a set of business-oriented ad hoc queries and concurrent data modifications. It runs complex queries on large data volumes to answer key business questions.

  • TPC-H 10 GB performance test set

  • TPC-H 100 GB performance test set

  • TPC-H 1 TB performance test set

  • TPC-H 10 TB performance test set

  • tpch_10g

  • tpch_100g

  • tpch_1t

  • tpch_10t

TPCx-BB

TPCx-BB Express Benchmark BB (TPCx-BB) is a big data benchmark. It measures the performance of Hadoop-based big data systems. It evaluates hardware and software components by running 30 common analytical queries.

  • TPCx-BB 10 GB performance test set

  • TPCx-BB 100 GB performance test set

  • TPCx-BB 1 TB performance test set

  • TPCx-BB 10 TB performance test set

  • tpcbb_10g

  • tpcbb_100g

  • tpcbb_1t

  • tpcbb_10t

Digital commerce

Includes data from Taobao advertising, Taobao shopping, and Alibaba E-commerce.

Digital commerce dataset

commerce

Life service

Includes data on second-hand real estate, movies and box office results, mobile number attribution, and administrative and urban/rural division codes.

Life service dataset

life_service

Financial stock

Stock information.

Financial stock dataset

finance

Disclaimer

  • The public datasets provided by MaxCompute are for product testing only. The data is not updated periodically, and its accuracy is not guaranteed. Do not use this data in production environments.

  • The generation and analysis of TPC data in the MaxCompute public datasets are based on TPC benchmarks. The results cannot be compared with published TPC benchmark results because the tests run on the MaxCompute public datasets do not meet all TPC benchmark requirements.

  • The TPC performance test data in MaxCompute originates from TPC. You can also generate TPC data yourself. For more information, see the official TPC documentation.

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

  • All data of public datasets is stored in the BIGDATA_PUBLIC_DATASET project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:

    -- Enable the session-level schema syntax.
    set odps.namespace.schema=true; 
    -- Query 100 data records from the dwd_github_events_odps table.
    select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
    Important

    You are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing (pay-as-you-go).

  • You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.

  • Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.

Detailed table information

The following tables provide detailed information about the tables in each schema of the BIGDATA_PUBLIC_DATASET public project.

GitHub public event data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

github_events

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Table names and descriptions

Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available. For more information about event types, see GitHub Events.

MaxCompute processes and develops the large volume of public event data provided by GH Archive offline to generate the following tables:

  • dwd_github_events_odps (fact table for GitHub public event data)

  • dws_overview_by_repo_month (aggregation table for monthly metrics of GitHub public events)

Note

The data in the tables is from GH Archive.

Update cycle

  • dwd_github_events_odps: Updated T+1 hour.

  • dws_overview_by_repo_month: Updated T+1 day.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=true; 
-- Query the schema of the dwd_github_events_odps table. To query other tables, replace the schema name and table name.
DESC bigdata_public_dataset.github_events.dwd_github_events_odps;

Query example

-- Enable session-level schema syntax.
SET odps.namespace.schema=true; 
-- Ranks the most starred projects in the last year. (Note: This example does not consider cases where users unstar projects.)
SELECT
    repo_id,
    repo_name,
    COUNT(actor_login) total
FROM
    bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE
    ds>=date_add(getdate(), -365)
    AND type = 'WatchEvent'
GROUP BY
    repo_id,
    repo_name
ORDER BY
    total DESC
LIMIT 10;

For more information about the data and for query samples, see GitHub public event data.

National statistics

Project name

BIGDATA_PUBLIC_DATASET

Schema name

national_data

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Table names and descriptions

  • annual_gdp_by_province (annual GDP data by province in China)

  • annual_gdp_by_country (annual GDP data by country)

Note

The data for annual_gdp_by_province is from the National Bureau of Statistics of China. The data for annual_gdp_by_country is from the International Monetary Fund (IMF).

Update cycle

Provides static data that is not updated.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=true; 
-- Query the schema of the annual_gdp_by_province table. To query other tables, replace the schema name and table name.
DESC bigdata_public_dataset.national_data.annual_gdp_by_province;

Query example

--Enable session-level schema syntax.
SET odps.namespace.schema=true; 
--Views the GDP trend of Beijing in the last 20 years.
SELECT
    region,
    gdp,
    year
FROM
    bigdata_public_dataset.national_data.annual_gdp_by_province
WHERE
    region='Beijing'
ORDER BY
    year ASC
LIMIT 20;

TPC-DS data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud

Table names and descriptions

The TPC-DS model simulates the sales system of a large retail chain with a national presence. It includes three sales channels: store (physical outlets), web (online stores), and catalog (phone orders). Each channel uses two tables to simulate sales and return records. The model also includes dimension tables for information about products, promotions, and customers. The details are as follows:

  • call_center (information about customer service centers)

  • catalog_page (information about product catalogs)

  • catalog_returns (product return records from the phone order channel)

  • catalog_sales (product sales records from the phone order channel)

  • customer (customer information)

  • customer_address (customer address information)

  • customer_demographics (basic customer credit information)

  • date_dim (time dimension information)

  • household_demographics (basic household credit information)

  • income_band (income information)

  • inventory (warehouse information)

  • item (product information)

  • promotion (product promotion information)

  • reason (reasons for customer returns)

  • ship_mode (product shipping information)

  • store (merchant information)

  • store_returns (product return records from the physical outlet channel)

  • store_sales (product sales records from the physical outlet channel)

  • time_dim (time dimension information)

  • warehouse (warehouse information)

  • web_page (product web page information)

  • web_returns (product return records from the web channel)

  • web_sales (product sales records from the web channel)

  • web_site (basic product website information)

Note

The data in the tables is from TPC.

Update cycle

Provides static data that is not updated.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Queries the schema of the call_center table in tpcds_10g. To query tables in other dataset specifications, replace the schema name and table name.
DESC bigdata_public_dataset.tpcds_10g.call_center;

Query example

SET odps.namespace.schema=TRUE; 
SELECT dt.d_year ,
       item.i_brand_id brand_id ,
       item.i_brand brand ,
       SUM(ss_sales_price) sum_agg
FROM bigdata_public_dataset.tpcds_10g.date_dim dt ,
     bigdata_public_dataset.tpcds_10g.store_sales ,
     bigdata_public_dataset.tpcds_10g.item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
  AND store_sales.ss_item_sk = item.i_item_sk
  AND item.i_manufact_id = 190
  AND dt.d_moy = 12
GROUP BY dt.d_year ,
         item.i_brand ,
         item.i_brand_id
ORDER BY dt.d_year,
         sum_agg DESC,
         brand_id LIMIT 100;

For query sample files for different data specifications, see TPC-DS data.

For more information about the data, see the official TPC Benchmark DS standard specification.

TPC-H data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpch_10g, tpch_100g, tpch_1t, tpch_10t

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud

Table names and descriptions

TPC-H is a benchmark program used to evaluate Online Analytical Processing (OLAP). It simulates transactions between suppliers and their buyers. It contains information about orders, products, and customers. The details are as follows:

  • customer (consumer information)

  • lineitem (online product information)

  • nation (nation information)

  • orders (order information)

  • part (part information)

  • partsupp (supplier part information)

  • region (region information)

  • supplier (supplier information)

Note

The data in the tables is from TPC.

Update cycle

Provides static data that is not updated.

Query table schema

--Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
--Queries the schema of the lineitem table in tpch_10g. To query tables in other dataset specifications, replace the schema name and table name.
DESC bigdata_public_dataset.tpch_10g.lineitem;

Query example

SET odps.namespace.schema=TRUE; 
SET odps.sql.validate.orderby.limit=FALSE;
SET odps.sql.hive.compatible=TRUE;
SELECT l_returnflag,
       l_linestatus,
       sum(l_quantity) AS sum_qty,
       sum(l_extendedprice) AS sum_base_price,
       sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
       avg(l_quantity) AS avg_qty,
       avg(l_extendedprice) AS avg_price,
       avg(l_discount) AS avg_disc,
       count(*) AS count_order
FROM bigdata_public_dataset.tpch_10g.lineitem
WHERE l_shipdate <= date'1998-12-01' - interval '90' DAY
GROUP BY l_returnflag,
         l_linestatus
ORDER BY l_returnflag,
         l_linestatus;

For more information about the data and for query samples, see the official TPC Benchmark H standard specification.

TPCx-BB data

Project name

BIGDATA_PUBLIC_DATASET

Schema name

tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud

Table names and descriptions

TPCx-BB is a big data benchmark tool. It simulates an online retail scenario that includes sales and return records. It also contains information about products and promotions. The details are as follows:

  • customer (customer information)

  • customer_address (customer address information)

  • customer_demographics (basic customer credit information)

  • date_dim (time dimension information)

  • household_demographics (basic household credit information)

  • income_band (income information)

  • inventory (warehouse information)

  • item (product information)

  • item_marketprices (competitor price information for products)

  • product_reviews (product review information)

  • promotion (product promotion information)

  • reason (reasons for customer returns)

  • ship_mode (product shipping information)

  • store (outlet information)

  • store_returns (product return records from the physical outlet channel)

  • store_sales (product sales records from the physical outlet channel)

  • time_dim (time dimension information)

  • warehouse (warehouse information)

  • web_clickstreams (web clickstream information)

  • web_page (product web page information)

  • web_returns (product return records from the web channel)

  • web_sales (product sales records from the web channel)

  • web_site (product website information)

Note

The data in the tables is from TPC.

Update cycle

Provides static data that is not updated.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Queries the schema of the web_sales table in tpcxbb_10g. To query tables in other dataset specifications, replace the schema name and table name.
DESC bigdata_public_dataset.tpcxbb_10g.web_sales;

Query example

SET odps.namespace.schema=TRUE; 
SELECT * FROM bigdata_public_dataset.tpcxbb_10g.web_sales limit 100;

For more information about the data and for query samples, see the official TPCx-BB standard specification.

Digital commerce dataset

Project name

BIGDATA_PUBLIC_DATASET

Schema name

commerce

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Table names and descriptions

  • adv_raw_sample (a raw sample skeleton composed of display ad click logs from over 1 million randomly sampled users on Taobao over 8 days)

  • adv_ad_feature (basic information about some of the ads in the raw_sample table)

  • user_profile (basic information about all users in raw_sample)

  • behavior_log (shopping behavior, such as browsing, adding to cart, liking, and purchasing, of all users in raw_sample over 22 days)

Update cycle

Provides static data. Incremental updates are no longer provided.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Queries the schema of the behavior_log table. To query other tables, replace the table name.
DESC bigdata_public_dataset.commerce.behavior_log;

Query example

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Counts the top three product category IDs with the highest sales within 22 days using behavior_log.
SELECT cate,
       count(btag) sales
FROM behavior_log
WHERE btag='buy'
GROUP BY cate
ORDER BY sales DESC LIMIT 3;

Life service dataset

Project name

BIGDATA_PUBLIC_DATASET

Schema name

life_service

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Table names and descriptions

  • movie_basic_info (basic movie information table)

  • movie_box (basic box office information table)

  • areacode_basic_info_2020 (basic information table for administrative and urban/rural division codes for 2020)

  • phoneno_basic_info_2020 (basic information table for mobile number attribution for 2020)

Update cycle

  • movie_basic_info, movie_box: Provides data for fixed date partitions. Incremental updates are no longer provided.

  • areacode_basic_info_2020, phoneno_basic_info_2020: Provides static data. Incremental updates are no longer provided.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Queries the schema of the movie_box table. To query other tables, replace the table name.
DESC bigdata_public_dataset.life_service.movie_box;

Query example

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Queries the names of the top 10 movies at the box office on January 14, 2017.
SELECT moviename
FROM bigdata_public_dataset.life_service.movie_box
WHERE ds ='20170114'
ORDER BY rank ASC LIMIT 10;

Financial stock dataset

Project name

BIGDATA_PUBLIC_DATASET

Schema name

finance

Available regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)

Table names and descriptions

  • ods_enterprise_share_basic (basic stock information table)

  • ods_enterprise_share_quarter_cashflow (quarterly cash flow report)

  • ods_enterprise_share_quarter_growth (quarterly business growth data table)

  • ods_enterprise_share_quarter_operation (quarterly financial turnover data table)

  • ods_enterprise_share_quarter_profit (quarterly profit statement)

  • ods_enterprise_share_quarter_report (quarterly report)

  • ods_enterprise_share_trade_h (stock price table)

Update cycle

Provides data for fixed date partitions. Incremental updates are no longer provided.

Query table schema

-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE; 
-- Queries the schema of the ods_enterprise_share_basic table. To query other tables, replace the table name.
DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;

Query example

--Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
--Queries basic stock information data for January 14, 2017.
SELECT *
FROM bigdata_public_dataset.finance.ods_enterprise_share_basic
WHERE ds ='20170114' LIMIT 10;

Use public datasets

Prerequisites

You have activated MaxCompute and created a project. For more information, see Create a MaxCompute project.

Supported tools or platforms

Procedure (DataWorks Data Development node example)

  1. Log on to the DataWorks console and select a region in the upper-left corner.

  2. Create a workspace.

  3. Attach a MaxCompute data source.

  4. Create an ODPS SQL node and enter the following SQL example.

    -- View the GDP trend of each province in China over the last 20 years.
    SET odps.namespace.schema=true; 
    SET odps.sql.validate.orderby.limit = false;
    SELECT
        region,
        gdp,
        year
    FROM
        bigdata_public_dataset.national_data.annual_gdp_by_province
    ORDER BY
        year ASC;
  5. Click image.png to view the results.image.png

References

For more information about how to export MaxCompute data, see the following topics:

  • Download: Allows you to download data or the execution results of a specified instance to your local computer.

  • UNLOAD: Allows you to export data to external storage, such as OSS or Hologres.