MaxCompute provides access to GitHub public event data sourced from GH Archive, a project that archives public GitHub activity by the hour. Use this dataset to analyze open source ecosystems, track programming language trends, and measure project activity across millions of repositories.
This dataset is for product testing only. Data accuracy is not guaranteed. Do not use it in a production environment.
Available tables
The dataset is stored in the github_events schema of the BIGDATA_PUBLIC_DATASET project. All tables are accessible via cross-project SQL queries.
| Table name | Description | Update cycle |
|---|---|---|
dwd_github_events_odps | Fact table — one row per GitHub public event | T+1 hour |
dws_overview_by_repo_month | Aggregate table — monthly metrics per repository | T+1 day |
db_repos | IDs and names of open source database projects | — |
programming_language_repos | IDs and names of open source programming language projects | — |
The db_repos and programming_language_repos project lists are sourced from ossinsight.
To learn how this data is generated, see Integrated offline and real-time data processing based on the GitHub public event dataset.
Table schemas
dwd_github_events_odps
This fact table stores one record per GitHub public event, updated every T+1 hour.
| Field name | Data type | Description |
|---|---|---|
id | BIGINT | The event ID. |
actor_id | BIGINT | The ID of the event initiator. |
actor_login | STRING | The login name of the event initiator. |
repo_id | BIGINT | The repository ID. |
repo_name | STRING | The repository name, in owner/repository_name format. |
org_id | BIGINT | The ID of the organization that owns the repository. |
org_login | STRING | The name of the organization that owns the repository. |
type | STRING | The event type. For a full list, see GitHub event types. |
created_at | DATETIME | The time when the event occurred. |
action | STRING | The event action (for example, opened or created). |
iss_or_pr_id | BIGINT | The issue or pull request ID. |
number | BIGINT | The issue or pull request sequence number. |
comment_id | BIGINT | The comment ID. |
commit_id | STRING | The commit ID. |
member_id | BIGINT | The member ID. |
rev_or_push_or_rel_id | BIGINT | The review, push, or release ID. |
ref | STRING | The name of the created or deleted resource. |
ref_type | STRING | The type of the created or deleted resource. |
state | STRING | The status of the issue, pull request, or pull request review. |
author_association | STRING | The relationship between the event initiator and the repository. |
language | STRING | The programming language of the code in the merge request. |
merged | BOOLEAN | Whether the pull request was merged. |
merged_at | DATETIME | The time when the code was merged. |
additions | BIGINT | The number of lines added. |
deletions | BIGINT | The number of lines deleted. |
changed_files | BIGINT | The number of files changed by the pull request. |
push_size | BIGINT | The total number of commits in the push. |
push_distinct_size | BIGINT | The number of distinct commits in the push. |
hr | STRING | The hour when the event occurred. For example, an event at 00:23 has the value 00. |
month | STRING | The month when the event occurred, in yyyy-MM format. For example, October 2015 is 2015-10. |
year | STRING | The year when the event occurred. For example, 2015. |
ds | STRING | The date when the event occurred, in yyyy-MM-dd format. |
dws_overview_by_repo_month
This aggregate table stores monthly metrics per repository, updated every T+1 day.
| Field name | Data type | Description |
|---|---|---|
repo_id | BIGINT | The repository ID. |
repo_name | STRING | The repository name, in owner/repository_name format. |
stars | BIGINT | The number of stars the repository received that month. |
commits | BIGINT | The number of commits to the repository that month. |
pushes | BIGINT | The number of pushes to the repository that month. |
total_prs | BIGINT | The total number of pull requests that month. |
pr_creators | BIGINT | The number of pull request creators that month. |
pr_reviews | BIGINT | The number of pull request reviews that month. |
pr_reviewers | BIGINT | The number of pull request reviewers that month. |
total_issues | BIGINT | The total number of issues that month. |
forks | BIGINT | The number of forks that month. |
month | STRING | The month, in yyyy-MM format. For example, October 2015 is 2015-10. |
Available regions
| Region | Region ID |
|---|---|
| China (Hangzhou) | cn-hangzhou |
| China (Shanghai) | cn-shanghai |
| China (Beijing) | cn-beijing |
| China (Zhangjiakou) | cn-zhangjiakou |
| China (Ulanqab) | cn-wulanchabu |
| China (Shenzhen) | cn-shenzhen |
| China (Chengdu) | cn-chengdu |
Usage notes
Public datasets are stored in the BIGDATA_PUBLIC_DATASET project. Because you are not a member of that project, all queries use cross-project access. Keep the following in mind:
Specify the full table path in SQL. Prefix every table name with the project and schema:
bigdata_public_dataset.github_events.<table_name>.Enable schema syntax. If tenant-level schema syntax is not enabled for your account, add the following line at the start of every SQL script:
SET odps.namespace.schema = true;Storage is free; queries are billed. You are not charged for storing public dataset data, but computing resources consumed by your queries are billed. For details, see Computing fees (pay-as-you-go).
Data Map is unavailable for public datasets. Cross-project access means you cannot browse these tables in the DataWorks Data Map.
DataWorks DataAnalysis requires SQL. If tenant-level schema syntax is not enabled, query the data by running SQL statements directly instead of using the DataAnalysis visual interface.
For more information about schema operations, see Schema operations.
Query the dataset
Prerequisites
Before you begin, ensure that you have:
An activated MaxCompute service
A MaxCompute project. For details, see Create a MaxCompute project.
Supported tools
Run queries using any of the following tools:
Run your first query
The following example retrieves 100 rows from the fact table for a specific date:
-- Enable session-level schema syntax if tenant-level syntax is not enabled.
SET odps.namespace.schema = true;
-- Query 100 records from May 10, 2024.
SELECT *
FROM bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE ds = '2024-05-10'
LIMIT 100;Example queries and analysis
The following examples use the dwd_github_events_odps and dws_overview_by_repo_month tables to answer common questions about open source databases and programming languages. All examples require schema syntax to be enabled.
Open source databases
This group of queries uses the db_repos reference table to scope results to known open source database projects, covering popularity rankings, star trends, and contributor activity.
Which open source databases were most popular from 2018 to 2022?
SET odps.namespace.schema = TRUE;
SELECT
dws.repo_id AS repo_id,
repos.name AS repo_name,
SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
WHERE month >= '2015-01' AND month <= '2022-12'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10;Analysis result: elasticsearch was the most popular open source database project, followed by redis and prometheus.

How did the rankings of the top database projects change each year from 2018 to 2022?
SET odps.namespace.schema = TRUE;
SET odps.sql.validate.orderby.limit = FALSE;
WITH tmp AS (
SELECT
dws.repo_id AS repo_id,
repos.name AS repo_name,
SUM(dws.stars) AS stars,
SUBSTR(month, 1, 4) AS year,
ROW_NUMBER() OVER (PARTITION BY SUBSTR(month, 1, 4) ORDER BY SUM(dws.stars) DESC) AS ranknum
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
WHERE month >= '2018-01'
GROUP BY dws.repo_id, repos.name, SUBSTR(month, 1, 4)
)
SELECT repo_id, repo_name, stars, ranknum, year
FROM tmp
WHERE year <= 2022 AND ranknum <= 10
ORDER BY year ASC, ranknum ASC;Analysis result: clickhouse showed the fastest growth of any project, rising from tenth place in 2018 to first place in 2021. In 2022, it was overtaken by the steadily rising redis. taosdata/TDengine reached first place in 2019 but dropped to ninth within a year.

How did monthly star counts for the top 10 database projects trend from 2018 to 2022?
SET odps.namespace.schema = true;
SET odps.sql.validate.orderby.limit = false;
WITH top_10_repos AS (
SELECT
dws.repo_id AS repo_id,
repos.name AS repo_name,
SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
WHERE month >= '2018-01' AND month <= '2022-12'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10
),
tmp AS (
SELECT
month,
repo_id,
stars,
SUM(stars) OVER (PARTITION BY repo_id ORDER BY month ASC) AS total_stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
WHERE month >= '2015-01' AND stars IS NOT NULL
AND repo_id IN (SELECT repo_id FROM top_10_repos)
GROUP BY repo_id, month, stars
ORDER BY month ASC, repo_id
)
SELECT
tmp.month AS month,
top_10_repos.repo_name AS repo_name,
tmp.total_stars AS total_stars
FROM tmp
JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
GROUP BY month, repo_name, total_stars
ORDER BY month ASC, repo_name;Analysis result: elasticsearch was consistently the most starred database project. clickhouse saw rapid cumulative growth starting in 2021.

Which databases were most popular in the first half of 2023?
SET odps.namespace.schema = true;
SELECT
repos.name AS repo_name,
SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id
WHERE month >= '2023-01' AND month <= '2023-06'
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 10;Analysis result: clickhouse was the most popular database in the first half of 2023, followed by prometheus and redis.

Which databases were most actively maintained in the first half of 2023?
This query measures activity by counting opened pull request events, which reflects ongoing development work.
SET odps.namespace.schema = true;
SELECT
repos.name AS repo_name,
COUNT(dwd.id) AS num
FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd
JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id
WHERE type = 'PullRequestEvent'
AND ds >= '2023-01-01' AND ds <= '2023-06-30'
AND action = 'opened'
GROUP BY repos.name
ORDER BY num DESC
LIMIT 10;Analysis result: StarRocks was the most actively maintained open source database project in the first half of 2023.

Who were the top individual contributors to the most active database project?
This query counts contribution events — pull requests, issues, comments, reviews, and pushes — per contributor for a given repository and time range.
SET odps.namespace.schema = true;
WITH a AS (
SELECT
repo_id,
repo_name,
actor_id,
actor_login,
COUNT(*) AS contribution,
ds
FROM bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE ds >= '2021-01-01' AND ds <= '2021-12-31'
AND (
(type = 'PullRequestEvent' AND action = 'opened')
OR (type = 'IssuesEvent' AND action = 'opened')
OR (type = 'IssueCommentEvent' AND action = 'created')
OR (type = 'PullRequestReviewEvent' AND action = 'created')
OR (type = 'PullRequestReviewCommentEvent' AND action = 'created')
OR (type = 'PushEvent' AND action IS NULL)
)
GROUP BY repo_id, repo_name, actor_id, actor_login, ds
)
SELECT
repo_name,
actor_login,
SUM(contribution) AS contribution
FROM a
WHERE repo_name = 'StarRocks/starrocks'
AND actor_login NOT LIKE '%[bot]'
AND actor_login NOT LIKE 'cockroach%'
GROUP BY repo_name, actor_login
ORDER BY contribution DESC
LIMIT 10;Analysis result: kangkaisen was the top individual contributor to the StarRocks/starrocks repository in 2021.

Programming languages
This group of queries uses the programming_language_repos reference table to scope results to language runtime and compiler projects, answering questions about usage and popularity trends.
What were the top 10 most used programming languages over the past year?
SET odps.namespace.schema = true;
SELECT
language,
COUNT(*) AS total
FROM bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE ds >= DATE_ADD(GETDATE(), -365)
AND language IS NOT NULL
GROUP BY language
ORDER BY total DESC
LIMIT 10;Analysis result: JavaScript was the most used language, followed by TypeScript and Python.

Which programming language projects received the most stars from 2018 to 2022?
SET odps.namespace.schema = TRUE;
SELECT
dws.repo_id AS repo_id,
repos.name AS repo_name,
SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
WHERE month >= '2015-01'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10;Analysis result: Go was the most popular programming language project with 81,642 stars, followed by TypeScript and Node.

How did cumulative star counts for the top 10 language projects grow month by month?
SET odps.namespace.schema = true;
SET odps.sql.validate.orderby.limit = false;
WITH top_10_repos AS (
SELECT
dws.repo_id AS repo_id,
repos.name AS repo_name,
SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
WHERE month >= '2018-01' AND month <= '2022-12'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10
),
tmp AS (
SELECT
month,
repo_id,
stars,
SUM(stars) OVER (PARTITION BY repo_id ORDER BY month ASC) AS total_stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
WHERE month >= '2015-01' AND stars IS NOT NULL
AND repo_id IN (SELECT repo_id FROM top_10_repos)
GROUP BY repo_id, month, stars
ORDER BY month ASC, repo_id
)
SELECT
tmp.month AS month,
top_10_repos.repo_name AS repo_name,
tmp.total_stars AS total_stars
FROM tmp
JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
GROUP BY month, repo_name, total_stars
ORDER BY month ASC, repo_name;Analysis result: Go had the fastest cumulative star growth of any programming language project over the five-year period.
