This topic describes the GitHub public event data available in the MaxCompute public dataset. It also explains how to query this data and provides query examples with analysis results.
Overview
Many developers work on open source projects on GitHub. This work generates a large volume of events. GitHub records information about each event, such as the event type, details, developer, and code repository. GitHub also makes public events available, such as starring a repository or submitting code. For more information about event types, see GitHub Event Types. The GH Archive project summarizes public GitHub events by the hour and makes them accessible to developers. For more information about the project, see GH Archive.
MaxCompute processes the large volume of public event data from GH Archive to generate the following tables:
Table name | Description | Update cycle |
dwd_github_events_odps | Fact table of GitHub public event data | T+1 hour |
dws_overview_by_repo_month | Aggregate table of monthly metrics for GitHub public events | T+1 day |
The dwd_github_events_odps and dws_overview_by_repo_month tables are stored in the github_events schema of the BIGDATA_PUBLIC_DATASET public project. For more information, see Schema operations. If you have activated MaxCompute, you can query these tables using cross-project access.
Table details
dwd_github_events_odps
This fact table stores the main information about each GitHub public event. It is updated every T+1 hour. The table contains the following fields:
Field name | Data type | Description |
id | BIGINT | The event ID. |
actor_id | BIGINT | The ID of the event initiator. |
actor_login | string | The logon name of the event initiator. |
repo_id | BIGINT | The repository ID. |
repo_name | String | The repository name in the owner/Repository_name format. |
org_id | BIGINT | The ID of the organization to which the repository belongs. |
org_login | string | The name of the organization to which the repository belongs. |
type | string | The event type. For more information about types and descriptions, see GitHub Event Types. |
created_at | DATETIME | The time when the event occurred. |
action | string | The event behavior. |
iss_or_pr_id | BIGINT | The issue or pull_request ID. |
number | BIGINT | The issue or pull_request ordinal number. |
comment_id | BIGINT | The comment ID. |
commit_id | string | The commit ID. |
member_id | BIGINT | The member ID. |
rev_or_push_or_rel_id | BIGINT | The review, push, or release ID. |
ref | string | The name of the created or deleted resource. |
ref_type | String | The type of the created or deleted resource. |
state | string | The status of the issue, pull_request, or pull_request_review. |
author_association | string | The relationship between the actor and the repository. |
language | string | The language of the code in the merge request. |
merged | boolean | Indicates whether the merge is accepted. |
merged_at | DATETIME | The time when the code was merged. |
additions | BIGINT | The number of lines added to the code. |
deletions | BIGINT | The number of lines deleted from the code. |
changed_files | BIGINT | The number of files changed by the pull request. |
push_size | BIGINT | The number of commits. |
push_distinct_size | BIGINT | The number of distinct commits. |
hr | string | The hour when the event occurred. For example, if the event occurred at |
month | String | The month when the event occurred. For example, if the event occurred in October 2015, the value of this parameter is |
year | string | The year when the event occurred. For example, if the event occurred in 2015, the value of this parameter is |
ds | string | The date on which the event occurred. The value is in |
dws_overview_by_repo_month
This aggregate table stores a monthly summary of event metrics for each repository. It is updated every T+1 day. The table contains the following fields:
Field name | Data type | Description |
repo_id | BIGINT | The repository ID. |
repo_name | string | The repository name in the owner/Repository_name format. |
stars | BIGINT | The number of stars for the repository. |
commits | BIGINT | The number of commits to the repository. |
pushes | BIGINT | The number of pushes to the repository. |
total_prs | BIGINT | The total number of pull requests for the repository. |
pr_creators | BIGINT | The total number of pull request creators for the repository. |
pr_reviews | BIGINT | The total number of pull request reviews for the repository. |
pr_reviewers | BIGINT | The total number of pull request reviewers for the repository. |
total_issues | BIGINT | The total number of issues for the repository. |
forks | BIGINT | The number of stars for the repository. |
month | string | The month when the event occurred. For example, if the event occurred in October 2015, the value of this parameter is |
To learn how this data is generated, see Integrated offline and real-time data processing based on the GitHub public event dataset.
To analyze specific objects, MaxCompute also stores the db_repos table, which contains the IDs and names of open source database projects, and the programming_language_repos table, which contains the IDs and names of open source programming language projects. The project lists are sourced from ossinsight.
Available regions
Region | Region ID |
China (Hangzhou) | cn-hangzhou |
China (Shanghai) | cn-shanghai |
China (Beijing) | cn-beijing |
China (Zhangjiakou) | cn-zhangjiakou |
China (Ulanqab) | cn-wulanchabu |
China (Shenzhen) | cn-shenzhen |
China (Chengdu) | cn-chengdu |
Disclaimer
The GitHub public event data provided by MaxCompute is for product testing only. The accuracy of the data is not guaranteed. Do not use this data in a production environment.
Precautions
Public datasets are available to all MaxCompute users. Note the following:
The data for public datasets is stored in the
BIGDATA_PUBLIC_DATASETproject. Users are not added as members to this project. Therefore, you must access the data across projects. When you write an SQL script, you must specify the project and schema name before the table name. If tenant-level schema syntax is not enabled, you must enable session-level schema syntax to run commands. For example:-- Enable session-level schema syntax. SET odps.namespace.schema=true; -- Query 100 records from the dwd_github_events_odps table. SELECT * FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds='2024-05-10' limit 100;ImportantYou are not charged for data storage in public datasets. However, you are charged for the computing resources consumed by your queries. For more information, see Computing fees (Pay-as-you-go).
Because cross-project access is required, you cannot view tables from the public dataset in the Data Map of DataWorks.
The public dataset project stores data by schema. If tenant-level schema syntax is not enabled for your account, you cannot view the public dataset directly in DataWorks DataAnalysis. You can still query the data by running SQL statements.
Use MaxCompute to explore GitHub public event data
Prerequisites
MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.
Supported tools and platforms
Appendix: Query examples and analysis results
Explore open source databases
View the top 10 most popular open source database projects over the last five years (2018 to 2022)
Query example:
SET odps.namespace.schema = TRUE; SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id WHERE MONTH >= '2015-01' AND MONTH <='2022-12' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10;Analysis result: In the last five years,
elasticsearchwas the most popular open source database project, followed byredisandprometheus.
View the ranking changes of the top 10 most popular open source database projects over the last five years (2018 to 2022)
Query example:
SET odps.namespace.schema = TRUE; SET odps.sql.validate.orderby.limit=FALSE; WITH tmp as (SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars, SUBSTR(MONTH,1,4) AS YEAR, row_number() over (partition BY SUBSTR(MONTH,1,4) ORDER BY SUM(dws.stars) DESC) AS ranknum FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id WHERE MONTH>='2018-01' GROUP BY dws.repo_id, repos.name,SUBSTR(MONTH,1,4)) SELECT repo_id, repo_name, stars, ranknum, YEAR FROM tmp WHERE YEAR<=2022 AND ranknum<=10 ORDER BY YEAR ASC,ranknum ASC;Analysis result: Over the last five years, the rankings of the top 10 open source database projects have shifted.
clickhouseshowed the fastest growth, rising from tenth place in 2018 to first place in 2021. In 2022, it was surpassed by the steadily risingredis. In contrast,taosdata/TDengineranked first in 2019 but dropped to ninth place within a year.
View the monthly star growth trend for the top 10 most popular open source database projects over the last five years (2018 to 2022)
Query example:
SET odps.namespace.schema = true; SET odps.sql.validate.orderby.LIMIT=false; WITH top_10_repos AS ( SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id WHERE MONTH >= '2018-01' AND MONTH <='2022-12' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10 ), tmp AS ( SELECT MONTH, repo_id, stars, SUM(stars) OVER (PARTITION BY repo_id ORDER BY MONTH ASC ) AS total_stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month WHERE MONTH >= '2015-01' AND stars IS NOT NULL AND repo_id IN (SELECT repo_id FROM top_10_repos) GROUP BY repo_id,MONTH,stars ORDER BY MONTH ASC,repo_id) SELECT tmp.MONTH AS MONTH, top_10_repos.repo_name AS repo_name, tmp.total_stars AS total_stars FROM tmp JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id GROUP BY MONTH,repo_name,total_stars ORDER BY MONTH ASC,repo_name ;Analysis result: Among the top 10 open source database projects,
elasticsearchhas consistently been the most popular. Theclickhouseproject has seen rapid growth since 2021.
Which databases were most popular in the first half of 2023
Query example:
set odps.namespace.schema = true; SELECT repos.name AS repo_name, sum(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id WHERE month >= '2023-01' AND month <= '2023-06' GROUP BY repo_name ORDER BY stars DESC LIMIT 10;Analysis result: In the first half of 2023,
clickhousewas the most popular database, followed byprometheusandredis.
Which databases were actively maintained and updated in the first half of 2023
Query example:
SET odps.namespace.schema = true; SELECT repos.name AS repo_name, COUNT(dwd.id) AS num FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id WHERE type = 'PullRequestEvent' AND ds>='2023-01-01' and ds<='2023-06-30' AND action = 'opened' GROUP BY repos.name ORDER BY num DESC LIMIT 10;Analysis result: In the first half of 2023,
StarRockswas the most actively maintained and updated database. Project activity is defined by the number of createdPullRequestevents.
Who were the main individual contributors to the most active open source database project in the first half of 2023
Query example:
SET odps.namespace.schema=true; WITH a AS ( SELECT repo_id ,repo_name ,actor_id ,actor_login ,COUNT(*) AS contribution ,ds FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds >='2021-01-01' and ds<='2021-12-31' AND ( ( type = 'PullRequestEvent' AND action = 'opened' ) OR ( type = 'IssuesEvent' AND action = 'opened' ) OR ( type = 'IssueCommentEvent' AND action = 'created' ) OR ( type = 'PullRequestReviewEvent' AND action = 'created' ) OR ( type = 'PullRequestReviewCommentEvent' AND action = 'created' ) OR ( type = 'PushEvent' AND action IS NULL ) ) GROUP BY repo_id ,repo_name ,actor_id ,actor_login ,ds ) SELECT repo_name ,actor_login ,SUM(contribution) AS contribution FROM a WHERE repo_name = 'StarRocks/starrocks' AND actor_login NOT LIKE '%[bot]' AND actor_login NOT LIKE 'cockroach%' GROUP BY repo_name ,actor_login ORDER BY contribution DESC LIMIT 10 ;Analysis result: In the first half of 2023,
kangkaisenwas the main individual contributor to the most active open source database project.
Explore programming languages
Count the top ten most used programming languages in the last year
Query example:
SET odps.namespace.schema=true; SELECT language, count(*) total FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds>=date_add(getdate(), -365) AND language IS NOT NULL GROUP BY language ORDER BY total DESC LIMIT 10;Analysis result: The most used programming language was
JavaScript, followed byTypeScriptandPython.
View the top ten most popular programming language projects over the last five years (2018 to 2022)
Query example:
SET odps.namespace.schema = TRUE; SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id WHERE MONTH >= '2015-01' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10;Analysis result: Over the last five years, the most popular programming language project was
Go, which received 81,642 stars. It was followed byTypeScriptandNode.
View the monthly star growth trend for the top 10 most popular programming language projects over the last five years (2018 to 2022)
Query example:
SET odps.namespace.schema = true; SET odps.sql.validate.orderby.limit=false; WITH top_10_repos AS ( SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month dws JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id WHERE month >= '2018-01' AND month <='2022-12' GROUP BY dws.repo_id, repos.name ORDER BY stars DESC LIMIT 10 ), tmp AS ( SELECT month, repo_id, stars, SUM(stars) OVER (PARTITION by repo_id ORDER BY month ASC ) AS total_stars FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month WHERE month >= '2015-01' AND stars IS NOT NULL AND repo_id IN (SELECT repo_id FROM top_10_repos) GROUP BY repo_id,month,stars ORDER BY month ASC,repo_id) SELECT tmp.month AS month, top_10_repos.repo_name AS repo_name, tmp.total_stars AS total_stars FROM tmp JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id GROUP BY month,repo_name,total_stars ORDER BY month ASC,repo_name ;Analysis result: Over the last five years,
Gowas the programming language project with the fastest growth in popularity.