What public datasets are available in MaxCompute - MaxCompute

If you have activated MaxCompute, you can query tables in public datasets using MaxCompute SQL analysis. This lets you quickly try out the service. This topic describes the public datasets and explains how to query and analyze data.

Introduction

MaxCompute offers public datasets in several categories, such as GitHub public event data, national statistics, TPC performance test data, digital commerce data, life service data, and financial stock data. This data is stored in different schemas within the BIGDATA_PUBLIC_DATASET public project in MaxCompute.

Category		Introduction	Dataset name	Schema name
GitHub public event data		Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available.	GitHub public event dataset	github_events
National statistics		Includes annual GDP data for countries around the world and for provinces in China.	National statistics dataset	national_data
TPC performance data	TPC-DS	TPC-DS is a benchmark for decision support systems. It models common aspects of these systems, such as queries and data maintenance. This lets you run benchmark tests on emerging technologies like big data systems.	TPC-DS 10 GB performance test set TPC-DS 100 GB performance test set TPC-DS 1 TB performance test set TPC-DS 10 TB performance test set	tpcds_10g tpcds_100g tpcds_1t tpcds_10t
	TPC-H	TPC-H is a benchmark for decision support systems. It uses a set of business-oriented ad hoc queries and concurrent data modifications. It runs complex queries on large data volumes to answer key business questions.	TPC-H 10 GB performance test set TPC-H 100 GB performance test set TPC-H 1 TB performance test set TPC-H 10 TB performance test set	tpch_10g tpch_100g tpch_1t tpch_10t
	TPCx-BB	TPCx-BB Express Benchmark BB (TPCx-BB) is a big data benchmark. It measures the performance of Hadoop-based big data systems. It evaluates hardware and software components by running 30 common analytical queries.	TPCx-BB 10 GB performance test set TPCx-BB 100 GB performance test set TPCx-BB 1 TB performance test set TPCx-BB 10 TB performance test set	tpcbb_10g tpcbb_100g tpcbb_1t tpcbb_10t
Digital commerce		Includes data from Taobao advertising, Taobao shopping, and Alibaba E-commerce.	Digital commerce dataset	commerce
Life service		Includes data on second-hand real estate, movies and box office results, mobile number attribution, and administrative and urban/rural division codes.	Life service dataset	life_service
Financial stock		Stock information.	Financial stock dataset	finance

Disclaimer

The public datasets provided by MaxCompute are for product testing only. The data is not updated periodically, and its accuracy is not guaranteed. Do not use this data in production environments.
The generation and analysis of TPC data in the MaxCompute public datasets are based on TPC benchmarks. The results cannot be compared with published TPC benchmark results because the tests run on the MaxCompute public datasets do not meet all TPC benchmark requirements.
The TPC performance test data in MaxCompute originates from TPC. You can also generate TPC data yourself. For more information, see the official TPC documentation.

Precautions

Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:

All data of public datasets is stored in the BIGDATA_PUBLIC_DATASET project in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:
```
-- Enable the session-level schema syntax.
set odps.namespace.schema=true; 
-- Query 100 data records from the dwd_github_events_odps table.
select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;
```
Important
You are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing (pay-as-you-go).
You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.
Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.

Detailed table information

The following tables provide detailed information about the tables in each schema of the BIGDATA_PUBLIC_DATASET public project.

GitHub public event data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	github_events
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Table names and descriptions	Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available. For more information about event types, see GitHub Events. MaxCompute processes and develops the large volume of public event data provided by GH Archive offline to generate the following tables: dwd_github_events_odps (fact table for GitHub public event data) dws_overview_by_repo_month (aggregation table for monthly metrics of GitHub public events) Note The data in the tables is from GH Archive.
Update cycle	dwd_github_events_odps: Updated T+1 hour. dws_overview_by_repo_month: Updated T+1 day.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Query the schema of the dwd_github_events_odps table. To query other tables, replace the schema name and table name. DESC bigdata_public_dataset.github_events.dwd_github_events_odps;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Ranks the most starred projects in the last year. (Note: This example does not consider cases where users unstar projects.) SELECT repo_id, repo_name, COUNT(actor_login) total FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds>=date_add(getdate(), -365) AND type = 'WatchEvent' GROUP BY repo_id, repo_name ORDER BY total DESC LIMIT 10;`
For more information about the data and for query samples, see GitHub public event data.

National statistics

Project name	BIGDATA_PUBLIC_DATASET
Schema name	national_data
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Table names and descriptions	annual_gdp_by_province (annual GDP data by province in China) annual_gdp_by_country (annual GDP data by country) Note The data for annual_gdp_by_province is from the National Bureau of Statistics of China. The data for annual_gdp_by_country is from the International Monetary Fund (IMF).
Update cycle	Provides static data that is not updated.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Query the schema of the annual_gdp_by_province table. To query other tables, replace the schema name and table name. DESC bigdata_public_dataset.national_data.annual_gdp_by_province;`
Query example	`--Enable session-level schema syntax. SET odps.namespace.schema=true; --Views the GDP trend of Beijing in the last 20 years. SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province WHERE region='Beijing' ORDER BY year ASC LIMIT 20;`

TPC-DS data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Table names and descriptions	The TPC-DS model simulates the sales system of a large retail chain with a national presence. It includes three sales channels: store (physical outlets), web (online stores), and catalog (phone orders). Each channel uses two tables to simulate sales and return records. The model also includes dimension tables for information about products, promotions, and customers. The details are as follows: call_center (information about customer service centers) catalog_page (information about product catalogs) catalog_returns (product return records from the phone order channel) catalog_sales (product sales records from the phone order channel) customer (customer information) customer_address (customer address information) customer_demographics (basic customer credit information) date_dim (time dimension information) household_demographics (basic household credit information) income_band (income information) inventory (warehouse information) item (product information) promotion (product promotion information) reason (reasons for customer returns) ship_mode (product shipping information) store (merchant information) store_returns (product return records from the physical outlet channel) store_sales (product sales records from the physical outlet channel) time_dim (time dimension information) warehouse (warehouse information) web_page (product web page information) web_returns (product return records from the web channel) web_sales (product sales records from the web channel) web_site (basic product website information) Note The data in the tables is from TPC.
Update cycle	Provides static data that is not updated.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the schema of the call_center table in tpcds_10g. To query tables in other dataset specifications, replace the schema name and table name. DESC bigdata_public_dataset.tpcds_10g.call_center;`
Query example	SET odps.namespace.schema=TRUE; SELECT dt.d_year , item.i_brand_id brand_id , item.i_brand brand , SUM(ss_sales_price) sum_agg FROM bigdata_public_dataset.tpcds_10g.date_dim dt , bigdata_public_dataset.tpcds_10g.store_sales , bigdata_public_dataset.tpcds_10g.item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 190 AND dt.d_moy = 12 GROUP BY dt.d_year , item.i_brand , item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
For query sample files for different data specifications, see TPC-DS data. For more information about the data, see the official TPC Benchmark DS standard specification.

TPC-H data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpch_10g, tpch_100g, tpch_1t, tpch_10t
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Table names and descriptions	TPC-H is a benchmark program used to evaluate Online Analytical Processing (OLAP). It simulates transactions between suppliers and their buyers. It contains information about orders, products, and customers. The details are as follows: customer (consumer information) lineitem (online product information) nation (nation information) orders (order information) part (part information) partsupp (supplier part information) region (region information) supplier (supplier information) Note The data in the tables is from TPC.
Update cycle	Provides static data that is not updated.
Query table schema	`--Enable session-level schema syntax. SET odps.namespace.schema=TRUE; --Queries the schema of the lineitem table in tpch_10g. To query tables in other dataset specifications, replace the schema name and table name. DESC bigdata_public_dataset.tpch_10g.lineitem;`
Query example	SET odps.namespace.schema=TRUE; SET odps.sql.validate.orderby.limit=FALSE; SET odps.sql.hive.compatible=TRUE; SELECT l_returnflag, l_linestatus, sum(l_quantity) AS sum_qty, sum(l_extendedprice) AS sum_base_price, sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge, avg(l_quantity) AS avg_qty, avg(l_extendedprice) AS avg_price, avg(l_discount) AS avg_disc, count(*) AS count_order FROM bigdata_public_dataset.tpch_10g.lineitem WHERE l_shipdate <= date'1998-12-01' - interval '90' DAY GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus;
For more information about the data and for query samples, see the official TPC Benchmark H standard specification.

TPCx-BB data

Project name	BIGDATA_PUBLIC_DATASET
Schema name	tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud
Table names and descriptions	TPCx-BB is a big data benchmark tool. It simulates an online retail scenario that includes sales and return records. It also contains information about products and promotions. The details are as follows: customer (customer information) customer_address (customer address information) customer_demographics (basic customer credit information) date_dim (time dimension information) household_demographics (basic household credit information) income_band (income information) inventory (warehouse information) item (product information) item_marketprices (competitor price information for products) product_reviews (product review information) promotion (product promotion information) reason (reasons for customer returns) ship_mode (product shipping information) store (outlet information) store_returns (product return records from the physical outlet channel) store_sales (product sales records from the physical outlet channel) time_dim (time dimension information) warehouse (warehouse information) web_clickstreams (web clickstream information) web_page (product web page information) web_returns (product return records from the web channel) web_sales (product sales records from the web channel) web_site (product website information) Note The data in the tables is from TPC.
Update cycle	Provides static data that is not updated.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the schema of the web_sales table in tpcxbb_10g. To query tables in other dataset specifications, replace the schema name and table name. DESC bigdata_public_dataset.tpcxbb_10g.web_sales;`
Query example	`SET odps.namespace.schema=TRUE; SELECT * FROM bigdata_public_dataset.tpcxbb_10g.web_sales limit 100;`
For more information about the data and for query samples, see the official TPCx-BB standard specification.

Digital commerce dataset

Project name	BIGDATA_PUBLIC_DATASET
Schema name	commerce
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Table names and descriptions	adv_raw_sample (a raw sample skeleton composed of display ad click logs from over 1 million randomly sampled users on Taobao over 8 days) adv_ad_feature (basic information about some of the ads in the raw_sample table) user_profile (basic information about all users in raw_sample) behavior_log (shopping behavior, such as browsing, adding to cart, liking, and purchasing, of all users in raw_sample over 22 days) Note The data in the tables is from the Tianchi Lab - Taobao Display Ad Click-Through Rate Prediction dataset.
Update cycle	Provides static data. Incremental updates are no longer provided.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the schema of the behavior_log table. To query other tables, replace the table name. DESC bigdata_public_dataset.commerce.behavior_log;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Counts the top three product category IDs with the highest sales within 22 days using behavior_log. SELECT cate, count(btag) sales FROM behavior_log WHERE btag='buy' GROUP BY cate ORDER BY sales DESC LIMIT 3;`

Life service dataset

Project name	BIGDATA_PUBLIC_DATASET
Schema name	life_service
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Table names and descriptions	movie_basic_info (basic movie information table) movie_box (basic box office information table) areacode_basic_info_2020 (basic information table for administrative and urban/rural division codes for 2020) phoneno_basic_info_2020 (basic information table for mobile number attribution for 2020)
Update cycle	movie_basic_info, movie_box: Provides data for fixed date partitions. Incremental updates are no longer provided. areacode_basic_info_2020, phoneno_basic_info_2020: Provides static data. Incremental updates are no longer provided.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the schema of the movie_box table. To query other tables, replace the table name. DESC bigdata_public_dataset.life_service.movie_box;`
Query example	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the names of the top 10 movies at the box office on January 14, 2017. SELECT moviename FROM bigdata_public_dataset.life_service.movie_box WHERE ds ='20170114' ORDER BY rank ASC LIMIT 10;`

Financial stock dataset

Project name	BIGDATA_PUBLIC_DATASET
Schema name	finance
Available regions	China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu)
Table names and descriptions	ods_enterprise_share_basic (basic stock information table) ods_enterprise_share_quarter_cashflow (quarterly cash flow report) ods_enterprise_share_quarter_growth (quarterly business growth data table) ods_enterprise_share_quarter_operation (quarterly financial turnover data table) ods_enterprise_share_quarter_profit (quarterly profit statement) ods_enterprise_share_quarter_report (quarterly report) ods_enterprise_share_trade_h (stock price table)
Update cycle	Provides data for fixed date partitions. Incremental updates are no longer provided.
Query table schema	`-- Enable session-level schema syntax. SET odps.namespace.schema=TRUE; -- Queries the schema of the ods_enterprise_share_basic table. To query other tables, replace the table name. DESC bigdata_public_dataset.finance.ods_enterprise_share_basic;`
Query example	`--Enable session-level schema syntax. SET odps.namespace.schema=TRUE; --Queries basic stock information data for January 14, 2017. SELECT * FROM bigdata_public_dataset.finance.ods_enterprise_share_basic WHERE ds ='20170114' LIMIT 10;`

Use public datasets

Prerequisites

You have activated MaxCompute and created a project. For more information, see Create a MaxCompute project.

Supported tools or platforms

Procedure (DataWorks Data Development node example)

Log on to the DataWorks console and select a region in the upper-left corner.
Create a workspace.
Attach a MaxCompute data source.

Create an ODPS SQL node and enter the following SQL example.

-- View the GDP trend of each province in China over the last 20 years.
SET odps.namespace.schema=true; 
SET odps.sql.validate.orderby.limit = false;
SELECT
    region,
    gdp,
    year
FROM
    bigdata_public_dataset.national_data.annual_gdp_by_province
ORDER BY
    year ASC;

Click to view the results.

References

For more information about how to export MaxCompute data, see the following topics:

Download: Allows you to download data or the execution results of a specified instance to your local computer.
UNLOAD: Allows you to export data to external storage, such as OSS or Hologres.