MaxCompute provides a set of public datasets you can query immediately using MaxCompute SQL — no data loading required. The datasets cover GitHub activity, national statistics, TPC benchmark data, e-commerce behavior, and more, all stored in the BIGDATA_PUBLIC_DATASET project.
Public datasets are for testing and exploration only. The data is not refreshed on a regular schedule, and accuracy is not guaranteed. Do not use this data in production environments.
Available datasets
All datasets are stored as schemas within the BIGDATA_PUBLIC_DATASET project.
| Category | What you can analyze | Schema |
|---|---|---|
| GitHub public event data | Open source project activity: stars, commits, forks, and other events across public repositories | github_events |
| National statistics | Annual GDP trends for countries worldwide and for provinces in China | national_data |
| TPC-DS performance data | Decision support queries across a simulated multi-channel retail chain | tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t |
| TPC-H performance data | OLAP queries on supplier-buyer transaction data | tpch_10g, tpch_100g, tpch_1t, tpch_10t |
| TPCx-BB performance data | Big data benchmark queries on an online retail scenario | tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t |
| Digital commerce | Ad click behavior and shopping activity from Taobao users | commerce |
| Life service | Movie box office results, mobile number attribution, and administrative division codes | life_service |
| Financial stock | Stock prices and quarterly financial reports | finance |
Usage notes
Public datasets are accessible to all MaxCompute users. Before querying, note the following:
Cross-project access is required. Data is stored in the BIGDATA_PUBLIC_DATASET project, but you are not added as a member of that project. When writing SQL, prefix every table name with the project name and schema name:
-- Enable session-level schema syntax (required if tenant-level schema syntax is not enabled).
SET odps.namespace.schema=true;
-- Query 100 records from the dwd_github_events_odps table.
SELECT * FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds='2024-05-10' LIMIT 100;
If tenant-level schema syntax is enabled for your account, the SET statement is not required.
Storage is free; compute is billed. You are not charged for storing data in public datasets. However, the computing resources your queries consume are billed at the standard Pay-as-you-go rate. For details, see Computing fees (Pay-as-you-go).
DataWorks limitations. Because cross-project access is required:
-
Tables in
BIGDATA_PUBLIC_DATASETdo not appear in the DataWorks Data Map. -
If tenant-level schema syntax is not enabled for your account, the datasets are not visible in DataWorks DataAnalysis. You can still query them by running SQL statements directly.
TPC benchmark disclaimer. The TPC data in MaxCompute is based on TPC benchmarks. Results from queries on these datasets cannot be compared with published TPC benchmark results, because the tests do not meet all TPC benchmark requirements. To generate your own TPC data, see the official TPC documentation.
Dataset details
GitHub public event data
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schema | github_events |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
| Data source | GH Archive |
Use this dataset to analyze open source project activity on GitHub — who is starring repositories, committing code, opening pull requests, and more. For a full list of event types, see GitHub Events.
Tables:
| Table | Description | Update cycle |
|---|---|---|
dwd_github_events_odps |
Fact table for GitHub public events | T+1 hour |
dws_overview_by_repo_month |
Monthly aggregated metrics per repository | T+1 day |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Replace the schema and table name to inspect other tables.
DESC bigdata_public_dataset.github_events.dwd_github_events_odps;
Example query — find the most-starred repositories in the past year:
Which repositories received the most stars in the last 365 days?
-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Returns the top 10 repositories by star count over the last 365 days.
-- Note: This query does not account for users who later unstarred a repository.
SELECT
repo_id,
repo_name,
COUNT(actor_login) total
FROM
bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE
ds >= date_add(getdate(), -365)
AND type = 'WatchEvent'
GROUP BY
repo_id,
repo_name
ORDER BY
total DESC
LIMIT 10;
For more data details and query samples, see GitHub public event data.
National statistics
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schema | national_data |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
| Update cycle | Static — not updated |
Use this dataset to analyze long-term economic trends. It contains annual GDP data for provinces in China and for countries worldwide.
Tables:
| Table | Description | Data source |
|---|---|---|
annual_gdp_by_province |
Annual GDP by province in China | National Bureau of Statistics of China |
annual_gdp_by_country |
Annual GDP by country | International Monetary Fund (IMF) |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.national_data.annual_gdp_by_province;
Example query — view Beijing's GDP trend over the past 20 years:
How has Beijing's GDP changed over the last 20 years?
-- Enable session-level schema syntax.
SET odps.namespace.schema=true;
SELECT
region,
gdp,
year
FROM
bigdata_public_dataset.national_data.annual_gdp_by_province
WHERE
region = 'Beijing'
ORDER BY
year ASC
LIMIT 20;
TPC-DS data
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schemas | tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
| Update cycle | Static — not updated |
| Data source | TPC |
Use TPC-DS data to run decision support benchmark queries on MaxCompute. TPC-DS simulates the sales system of a large retail chain across three channels — physical stores, online stores, and phone orders (catalog) — with data covering sales, returns, customers, products, and promotions. Choose the schema that matches your target data volume for performance testing.
Tables (24 total):
| Table | Description |
|---|---|
call_center |
Customer service center information |
catalog_page |
Product catalog information |
catalog_returns |
Return records from the phone order channel |
catalog_sales |
Sales records from the phone order channel |
customer |
Customer information |
customer_address |
Customer address information |
customer_demographics |
Basic customer credit information |
date_dim |
Time dimension information |
household_demographics |
Basic household credit information |
income_band |
Income information |
inventory |
Warehouse inventory information |
item |
Product information |
promotion |
Product promotion information |
reason |
Reasons for customer returns |
ship_mode |
Shipping method information |
store |
Merchant (physical outlet) information |
store_returns |
Return records from the physical outlet channel |
store_sales |
Sales records from the physical outlet channel |
time_dim |
Time dimension information |
warehouse |
Warehouse information |
web_page |
Product web page information |
web_returns |
Return records from the web channel |
web_sales |
Sales records from the web channel |
web_site |
Product website information |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpcds_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpcds_10g.call_center;
Example query — find top-selling brands in December by year:
Which brands had the highest sales in December each year, for a specific manufacturer?
SET odps.namespace.schema=TRUE;
SELECT
dt.d_year,
item.i_brand_id brand_id,
item.i_brand brand,
SUM(ss_sales_price) sum_agg
FROM
bigdata_public_dataset.tpcds_10g.date_dim dt,
bigdata_public_dataset.tpcds_10g.store_sales,
bigdata_public_dataset.tpcds_10g.item
WHERE
dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id = 190
AND dt.d_moy = 12
GROUP BY
dt.d_year,
item.i_brand,
item.i_brand_id
ORDER BY
dt.d_year,
sum_agg DESC,
brand_id
LIMIT 100;
For query sample files covering all dataset sizes, see TPC-DS data. For benchmark specification details, see the official TPC Benchmark DS standard specification.
TPC-H data
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schemas | tpch_10g, tpch_100g, tpch_1t, tpch_10t |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
| Update cycle | Static — not updated |
| Data source | TPC |
Use TPC-H data to benchmark Online Analytical Processing (OLAP) queries on MaxCompute. TPC-H uses a set of business-oriented ad hoc queries with concurrent data modifications to evaluate query performance on large datasets. The data simulates transactions between suppliers and buyers.
Tables (8 total):
| Table | Description |
|---|---|
customer |
Customer information |
lineitem |
Line item details for orders |
nation |
Nation information |
orders |
Order information |
part |
Part information |
partsupp |
Supplier part information |
region |
Region information |
supplier |
Supplier information |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpch_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpch_10g.lineitem;
Example query — pricing summary report for shipped line items:
What is the summary of pricing information — quantities, prices, discounts, taxes, and order counts — for all line items shipped before a given date?
SET odps.namespace.schema=TRUE;
SET odps.sql.validate.orderby.limit=FALSE;
SET odps.sql.hive.compatible=TRUE;
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM
bigdata_public_dataset.tpch_10g.lineitem
WHERE
l_shipdate <= date'1998-12-01' - interval '90' DAY
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
For benchmark specification details, see the official TPC Benchmark H standard specification.
TPCx-BB data
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schemas | tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
| Update cycle | Static — not updated |
| Data source | TPC |
TPCx-BB is a big data benchmark that measures the performance of Hadoop-based big data systems. It evaluates hardware and software by running 30 common analytical queries on a simulated online retail scenario, covering sales, returns, products, promotions, and web clickstreams.
Tables (23 total):
| Table | Description |
|---|---|
customer |
Customer information |
customer_address |
Customer address information |
customer_demographics |
Basic customer credit information |
date_dim |
Time dimension information |
household_demographics |
Basic household credit information |
income_band |
Income information |
inventory |
Warehouse inventory information |
item |
Product information |
item_marketprices |
Competitor price information for products |
product_reviews |
Product review information |
promotion |
Product promotion information |
reason |
Reasons for customer returns |
ship_mode |
Shipping method information |
store |
Physical outlet information |
store_returns |
Return records from the physical outlet channel |
store_sales |
Sales records from the physical outlet channel |
time_dim |
Time dimension information |
warehouse |
Warehouse information |
web_clickstreams |
Web clickstream information |
web_page |
Product web page information |
web_returns |
Return records from the web channel |
web_sales |
Sales records from the web channel |
web_site |
Product website information |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- This example uses tpcxbb_10g. Replace the schema name to query other dataset sizes.
DESC bigdata_public_dataset.tpcxbb_10g.web_sales;
Example query:
SET odps.namespace.schema=TRUE;
SELECT * FROM bigdata_public_dataset.tpcxbb_10g.web_sales LIMIT 100;
For benchmark specification details, see the official TPCx-BB standard specification.
Digital commerce dataset
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schema | commerce |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
| Update cycle | Static — incremental updates are no longer provided |
| Data source | Tianchi Lab — Taobao Display Ad Click-Through Rate Prediction dataset |
Use this dataset to analyze ad click-through rates and shopping behavior. It contains display ad click logs and shopping activity — browsing, adding to cart, liking, and purchasing — from over one million randomly sampled Taobao users.
Tables:
| Table | Description |
|---|---|
adv_raw_sample |
Raw display ad click logs from over 1 million users over 8 days |
adv_ad_feature |
Basic information about the ads in adv_raw_sample |
user_profile |
Basic profile for all users in adv_raw_sample |
behavior_log |
Shopping activity for all users in adv_raw_sample over 22 days |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.commerce.behavior_log;
Example query — find the top three product categories by sales volume over 22 days:
Which product categories had the highest purchase volume across the 22-day observation period?
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT
cate,
count(btag) sales
FROM
behavior_log
WHERE
btag = 'buy'
GROUP BY
cate
ORDER BY
sales DESC
LIMIT 3;
Life service dataset
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schema | life_service |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Use this dataset to analyze entertainment trends and geographic reference data. It includes movie box office results and administrative reference codes for China.
Tables:
| Table | Description | Update cycle |
|---|---|---|
movie_basic_info |
Basic movie information | Fixed date partitions — incremental updates no longer provided |
movie_box |
Box office results by date | Fixed date partitions — incremental updates no longer provided |
areacode_basic_info_2020 |
Administrative and urban/rural division codes (2020) | Static — not updated |
phoneno_basic_info_2020 |
Mobile number attribution data (2020) | Static — not updated |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.life_service.movie_box;
Example query — find the top 10 movies at the box office on January 14, 2017:
Which movies had the highest box office ranking on January 14, 2017?
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT
moviename
FROM
bigdata_public_dataset.life_service.movie_box
WHERE
ds = '20170114'
ORDER BY
rank ASC
LIMIT 10;
Financial stock dataset
| Property | Details |
|---|---|
| Project | BIGDATA_PUBLIC_DATASET |
| Schema | finance |
| Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
| Update cycle | Fixed date partitions — incremental updates no longer provided |
Use this dataset to analyze stock performance trends and quarterly financial results. It contains stock price history and quarterly financial reports for publicly traded companies.
Tables:
| Table | Description |
|---|---|
ods_enterprise_share_basic |
Basic stock information |
ods_enterprise_share_quarter_cashflow |
Quarterly cash flow report |
ods_enterprise_share_quarter_growth |
Quarterly business growth data |
ods_enterprise_share_quarter_operation |
Quarterly financial turnover data |
ods_enterprise_share_quarter_profit |
Quarterly profit statement |
ods_enterprise_share_quarter_report |
Quarterly report |
ods_enterprise_share_trade_h |
Stock price history |
Inspect the table schema:
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
-- Replace the table name to inspect other tables.
DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;
Example query — view basic stock data for January 14, 2017:
What basic stock information was recorded on January 14, 2017?
-- Enable session-level schema syntax.
SET odps.namespace.schema=TRUE;
SELECT *
FROM bigdata_public_dataset.finance.ods_enterprise_share_basic
WHERE ds = '20170114'
LIMIT 10;
Query public datasets
Prerequisites
Before you begin, ensure that you have:
-
An active MaxCompute service
-
A MaxCompute project. For setup instructions, see Create a project
Supported tools
Run queries against public datasets using any of the following:
Procedure (DataWorks Data Development node example)
-
Log on to the DataWorks console and select a region in the upper-left corner.
-
Create an ODPS SQL node and enter the following SQL example.
-- View the GDP trend of each province in China over the last 20 years. SET odps.namespace.schema=true; SET odps.sql.validate.orderby.limit = false; SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province ORDER BY year ASC; -
Click
to view the results.
Query using DataWorks
-
Log on to the DataWorks console and select a region in the upper-left corner.
-
Create an ODPS SQL node and enter your SQL query. The following example views GDP trends across all provinces in China over the past 20 years:
SET odps.namespace.schema=true; SET odps.sql.validate.orderby.limit = false; SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province ORDER BY year ASC; -
Click the run button to execute the query and view results.
What's next
To export data from MaxCompute after running your queries: