If you have activated MaxCompute, you can query tables in public datasets using MaxCompute SQL analysis. This lets you quickly try out the service. This topic describes the public datasets and explains how to query and analyze data.
Introduction
MaxCompute offers public datasets in several categories, such as GitHub public event data, national statistics, TPC performance test data, digital commerce data, life service data, and financial stock data. This data is stored in different schemas within the BIGDATA_PUBLIC_DATASET public project in MaxCompute.
Category | Introduction | Dataset name | Schema name | |
GitHub public event data | Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available. | GitHub public event dataset | github_events | |
National statistics | Includes annual GDP data for countries around the world and for provinces in China. | National statistics dataset | national_data | |
TPC performance data | TPC-DS | TPC-DS is a benchmark for decision support systems. It models common aspects of these systems, such as queries and data maintenance. This lets you run benchmark tests on emerging technologies like big data systems. |
|
|
TPC-H | TPC-H is a benchmark for decision support systems. It uses a set of business-oriented ad hoc queries and concurrent data modifications. It runs complex queries on large data volumes to answer key business questions. |
|
| |
TPCx-BB | TPCx-BB Express Benchmark BB (TPCx-BB) is a big data benchmark. It measures the performance of Hadoop-based big data systems. It evaluates hardware and software components by running 30 common analytical queries. |
|
| |
Digital commerce | Includes data from Taobao advertising, Taobao shopping, and Alibaba E-commerce. | Digital commerce dataset | commerce | |
Life service | Includes data on second-hand real estate, movies and box office results, mobile number attribution, and administrative and urban/rural division codes. | Life service dataset | life_service | |
Financial stock | Stock information. | Financial stock dataset | finance | |
Disclaimer
The public datasets provided by MaxCompute are for product testing only. The data is not updated periodically, and its accuracy is not guaranteed. Do not use this data in production environments.
The generation and analysis of TPC data in the MaxCompute public datasets are based on TPC benchmarks. The results cannot be compared with published TPC benchmark results because the tests run on the MaxCompute public datasets do not meet all TPC benchmark requirements.
The TPC performance test data in MaxCompute originates from TPC. You can also generate TPC data yourself. For more information, see the official TPC documentation.
Precautions
Public datasets are available to all MaxCompute users. When you use public datasets, take note of the following items:
All data of public datasets is stored in the
BIGDATA_PUBLIC_DATASETproject in MaxCompute. However, no users are added to this project as members. In this case, you must access the data across projects. When you write an SQL script, specify the project name and schema name before the table name. If you do not enable the tenant-level schema syntax, enable the session-level schema syntax before you execute a statement. Sample statements:-- Enable the session-level schema syntax. set odps.namespace.schema=true; -- Query 100 data records from the dwd_github_events_odps table. select * from bigdata_public_dataset.github_events.dwd_github_events_odps where ds='2024-05-10' limit 100;ImportantYou are not charged for the storage of the data in the public datasets. However, you are charged computing fees if you execute query statements. For more information, see Computing pricing (pay-as-you-go).
You cannot find the tables in the public datasets on the Data Map page of DataWorks because cross-project access is required.
Public datasets are stored by schema. If you do not enable the tenant-level schema syntax, you cannot view the public datasets in DataWorks DataAnalysis. In this case, you can query the public datasets only by executing SQL statements.
Detailed table information
The following tables provide detailed information about the tables in each schema of the BIGDATA_PUBLIC_DATASET public project.
GitHub public event data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | github_events |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Table names and descriptions | Developers on GitHub generate a large volume of events when they work on open source projects. GitHub records the type and details of each event, the developer, and the code repository. Public events, such as starring a repository or committing code, are made available. For more information about event types, see GitHub Events. MaxCompute processes and develops the large volume of public event data provided by GH Archive offline to generate the following tables:
Note The data in the tables is from GH Archive. |
Update cycle |
|
Query table schema | |
Query example | |
For more information about the data and for query samples, see GitHub public event data. | |
National statistics
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | national_data |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Table names and descriptions |
Note The data for annual_gdp_by_province is from the National Bureau of Statistics of China. The data for annual_gdp_by_country is from the International Monetary Fund (IMF). |
Update cycle | Provides static data that is not updated. |
Query table schema | |
Query example | |
TPC-DS data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpcds_10g, tpcds_100g, tpcds_1t, tpcds_10t |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
Table names and descriptions | The TPC-DS model simulates the sales system of a large retail chain with a national presence. It includes three sales channels: store (physical outlets), web (online stores), and catalog (phone orders). Each channel uses two tables to simulate sales and return records. The model also includes dimension tables for information about products, promotions, and customers. The details are as follows:
Note The data in the tables is from TPC. |
Update cycle | Provides static data that is not updated. |
Query table schema | |
Query example | |
For query sample files for different data specifications, see TPC-DS data. For more information about the data, see the official TPC Benchmark DS standard specification. | |
TPC-H data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpch_10g, tpch_100g, tpch_1t, tpch_10t |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
Table names and descriptions | TPC-H is a benchmark program used to evaluate Online Analytical Processing (OLAP). It simulates transactions between suppliers and their buyers. It contains information about orders, products, and customers. The details are as follows:
Note The data in the tables is from TPC. |
Update cycle | Provides static data that is not updated. |
Query table schema | |
Query example | |
For more information about the data and for query samples, see the official TPC Benchmark H standard specification. | |
TPCx-BB data
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | tpcxbb_10g, tpcxbb_100g, tpcxbb_1t, tpcxbb_10t |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), US (Virginia), US (Silicon Valley), UK (London), Germany (Frankfurt), UAE (Dubai), China (Shanghai) Finance Cloud, China (Beijing) Finance Cloud (Invitational Preview), China (Beijing) Alibaba Gov Cloud 1, China (Shenzhen) Finance Cloud |
Table names and descriptions | TPCx-BB is a big data benchmark tool. It simulates an online retail scenario that includes sales and return records. It also contains information about products and promotions. The details are as follows:
Note The data in the tables is from TPC. |
Update cycle | Provides static data that is not updated. |
Query table schema | |
Query example | |
For more information about the data and for query samples, see the official TPCx-BB standard specification. | |
Digital commerce dataset
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | commerce |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Table names and descriptions |
Note The data in the tables is from the Tianchi Lab - Taobao Display Ad Click-Through Rate Prediction dataset. |
Update cycle | Provides static data. Incremental updates are no longer provided. |
Query table schema | |
Query example | |
Life service dataset
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | life_service |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Table names and descriptions |
|
Update cycle |
|
Query table schema | |
Query example | |
Financial stock dataset
Project name | BIGDATA_PUBLIC_DATASET |
Schema name | finance |
Available regions | China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu) |
Table names and descriptions |
|
Update cycle | Provides data for fixed date partitions. Incremental updates are no longer provided. |
Query table schema | |
Query example | |
Use public datasets
Prerequisites
You have activated MaxCompute and created a project. For more information, see Create a MaxCompute project.
Supported tools or platforms
Procedure (DataWorks Data Development node example)
Log on to the DataWorks console and select a region in the upper-left corner.
Create an ODPS SQL node and enter the following SQL example.
-- View the GDP trend of each province in China over the last 20 years. SET odps.namespace.schema=true; SET odps.sql.validate.orderby.limit = false; SELECT region, gdp, year FROM bigdata_public_dataset.national_data.annual_gdp_by_province ORDER BY year ASC;Click
to view the results.
References
For more information about how to export MaxCompute data, see the following topics: