All Products
Search
Document Center

MaxCompute:GitHub public event data

Last Updated:Aug 14, 2025

This topic describes the GitHub public event data available in the MaxCompute public dataset. It also explains how to query this data and provides query examples with analysis results.

Overview

Many developers work on open source projects on GitHub. This work generates a large volume of events. GitHub records information about each event, such as the event type, details, developer, and code repository. GitHub also makes public events available, such as starring a repository or submitting code. For more information about event types, see GitHub Event Types. The GH Archive project summarizes public GitHub events by the hour and makes them accessible to developers. For more information about the project, see GH Archive.

MaxCompute processes the large volume of public event data from GH Archive to generate the following tables:

Table name

Description

Update cycle

dwd_github_events_odps

Fact table of GitHub public event data

T+1 hour

dws_overview_by_repo_month

Aggregate table of monthly metrics for GitHub public events

T+1 day

The dwd_github_events_odps and dws_overview_by_repo_month tables are stored in the github_events schema of the BIGDATA_PUBLIC_DATASET public project. For more information, see Schema operations. If you have activated MaxCompute, you can query these tables using cross-project access.

Table details

dwd_github_events_odps

This fact table stores the main information about each GitHub public event. It is updated every T+1 hour. The table contains the following fields:

Field name

Data type

Description

id

BIGINT

The event ID.

actor_id

BIGINT

The ID of the event initiator.

actor_login

string

The logon name of the event initiator.

repo_id

BIGINT

The repository ID.

repo_name

String

The repository name in the owner/Repository_name format.

org_id

BIGINT

The ID of the organization to which the repository belongs.

org_login

string

The name of the organization to which the repository belongs.

type

string

The event type.

For more information about types and descriptions, see GitHub Event Types.

created_at

DATETIME

The time when the event occurred.

action

string

The event behavior.

iss_or_pr_id

BIGINT

The issue or pull_request ID.

number

BIGINT

The issue or pull_request ordinal number.

comment_id

BIGINT

The comment ID.

commit_id

string

The commit ID.

member_id

BIGINT

The member ID.

rev_or_push_or_rel_id

BIGINT

The review, push, or release ID.

ref

string

The name of the created or deleted resource.

ref_type

String

The type of the created or deleted resource.

state

string

The status of the issue, pull_request, or pull_request_review.

author_association

string

The relationship between the actor and the repository.

language

string

The language of the code in the merge request.

merged

boolean

Indicates whether the merge is accepted.

merged_at

DATETIME

The time when the code was merged.

additions

BIGINT

The number of lines added to the code.

deletions

BIGINT

The number of lines deleted from the code.

changed_files

BIGINT

The number of files changed by the pull request.

push_size

BIGINT

The number of commits.

push_distinct_size

BIGINT

The number of distinct commits.

hr

string

The hour when the event occurred.

For example, if the event occurred at 00:23, the value of this parameter is 00.

month

String

The month when the event occurred.

For example, if the event occurred in October 2015, the value of this parameter is 2015-10.

year

string

The year when the event occurred.

For example, if the event occurred in 2015, the value of this parameter is 2015.

ds

string

The date on which the event occurred.

The value is in yyyy-mm-dd format.

dws_overview_by_repo_month

This aggregate table stores a monthly summary of event metrics for each repository. It is updated every T+1 day. The table contains the following fields:

Field name

Data type

Description

repo_id

BIGINT

The repository ID.

repo_name

string

The repository name in the owner/Repository_name format.

stars

BIGINT

The number of stars for the repository.

commits

BIGINT

The number of commits to the repository.

pushes

BIGINT

The number of pushes to the repository.

total_prs

BIGINT

The total number of pull requests for the repository.

pr_creators

BIGINT

The total number of pull request creators for the repository.

pr_reviews

BIGINT

The total number of pull request reviews for the repository.

pr_reviewers

BIGINT

The total number of pull request reviewers for the repository.

total_issues

BIGINT

The total number of issues for the repository.

forks

BIGINT

The number of stars for the repository.

month

string

The month when the event occurred.

For example, if the event occurred in October 2015, the value of this parameter is 2015-10.

To learn how this data is generated, see Integrated offline and real-time data processing based on the GitHub public event dataset.

To analyze specific objects, MaxCompute also stores the db_repos table, which contains the IDs and names of open source database projects, and the programming_language_repos table, which contains the IDs and names of open source programming language projects. The project lists are sourced from ossinsight.

Available regions

Region

Region ID

China (Hangzhou)

cn-hangzhou

China (Shanghai)

cn-shanghai

China (Beijing)

cn-beijing

China (Zhangjiakou)

cn-zhangjiakou

China (Ulanqab)

cn-wulanchabu

China (Shenzhen)

cn-shenzhen

China (Chengdu)

cn-chengdu

Disclaimer

The GitHub public event data provided by MaxCompute is for product testing only. The accuracy of the data is not guaranteed. Do not use this data in a production environment.

Precautions

Public datasets are available to all MaxCompute users. Note the following:

  • The data for public datasets is stored in the BIGDATA_PUBLIC_DATASET project. Users are not added as members to this project. Therefore, you must access the data across projects. When you write an SQL script, you must specify the project and schema name before the table name. If tenant-level schema syntax is not enabled, you must enable session-level schema syntax to run commands. For example:

    -- Enable session-level schema syntax.
    SET odps.namespace.schema=true; 
    -- Query 100 records from the dwd_github_events_odps table.
    SELECT * FROM bigdata_public_dataset.github_events.dwd_github_events_odps WHERE ds='2024-05-10' limit 100;
    Important

    You are not charged for data storage in public datasets. However, you are charged for the computing resources consumed by your queries. For more information, see Computing fees (Pay-as-you-go).

  • Because cross-project access is required, you cannot view tables from the public dataset in the Data Map of DataWorks.

  • The public dataset project stores data by schema. If tenant-level schema syntax is not enabled for your account, you cannot view the public dataset directly in DataWorks DataAnalysis. You can still query the data by running SQL statements.

Use MaxCompute to explore GitHub public event data

Prerequisites

MaxCompute is activated. A MaxCompute project is created. For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Supported tools and platforms

Appendix: Query examples and analysis results

Explore open source databases

View the top 10 most popular open source database projects over the last five years (2018 to 2022)

  • Query example:

    SET odps.namespace.schema = TRUE;
    SELECT dws.repo_id AS repo_id,
     repos.name AS repo_name,
     SUM(dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
    WHERE MONTH >= '2015-01' AND MONTH <='2022-12'
    GROUP BY dws.repo_id,
     repos.name
    ORDER BY stars DESC LIMIT 10;
  • Analysis result: In the last five years, elasticsearch was the most popular open source database project, followed by redis and prometheus.image.png

View the ranking changes of the top 10 most popular open source database projects over the last five years (2018 to 2022)

  • Query example:

    SET odps.namespace.schema = TRUE;
    SET odps.sql.validate.orderby.limit=FALSE;
     WITH tmp as
     (SELECT dws.repo_id AS repo_id, repos.name AS repo_name, SUM(dws.stars) AS stars, SUBSTR(MONTH,1,4) AS YEAR, row_number() over (partition BY SUBSTR(MONTH,1,4)
     ORDER BY SUM(dws.stars) DESC) AS ranknum
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
     JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
     WHERE MONTH>='2018-01'
     GROUP BY dws.repo_id, repos.name,SUBSTR(MONTH,1,4))
    SELECT repo_id,
     repo_name,
     stars,
     ranknum,
     YEAR
    FROM tmp
    WHERE YEAR<=2022
     AND ranknum<=10
    ORDER BY YEAR ASC,ranknum ASC;
  • Analysis result: Over the last five years, the rankings of the top 10 open source database projects have shifted. clickhouse showed the fastest growth, rising from tenth place in 2018 to first place in 2021. In 2022, it was surpassed by the steadily rising redis. In contrast, taosdata/TDengine ranked first in 2019 but dropped to ninth place within a year.image.png

View the monthly star growth trend for the top 10 most popular open source database projects over the last five years (2018 to 2022)

  • Query example:

    SET odps.namespace.schema = true;
    SET odps.sql.validate.orderby.LIMIT=false;
    WITH top_10_repos AS (
     SELECT
     dws.repo_id AS repo_id, 
     repos.name AS repo_name, 
     SUM(dws.stars) AS stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
     JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
     WHERE MONTH >= '2018-01' AND MONTH <='2022-12'
     GROUP BY dws.repo_id, repos.name
     ORDER BY stars DESC
     LIMIT 10
    ),
    tmp AS (
     SELECT
     MONTH,
     repo_id,
     stars,
     SUM(stars) OVER (PARTITION BY repo_id ORDER BY MONTH ASC ) AS total_stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
     WHERE MONTH >= '2015-01' AND stars IS NOT NULL
     AND repo_id IN (SELECT repo_id FROM top_10_repos)
     GROUP BY repo_id,MONTH,stars
     ORDER BY MONTH ASC,repo_id)
     
    SELECT
     tmp.MONTH AS MONTH,
     top_10_repos.repo_name AS repo_name,
     tmp.total_stars AS total_stars
    
     FROM tmp
     JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
     GROUP BY MONTH,repo_name,total_stars
     ORDER BY MONTH ASC,repo_name
    ;
  • Analysis result: Among the top 10 open source database projects, elasticsearch has consistently been the most popular. The clickhouse project has seen rapid growth since 2021.

    image.png

Which databases were most popular in the first half of 2023

  • Query example:

    set odps.namespace.schema = true;
    SELECT
     repos.name AS repo_name,
     sum(dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id
    WHERE month >= '2023-01' AND month <= '2023-06'
    GROUP BY repo_name
    ORDER BY stars DESC
    LIMIT 10;
  • Analysis result: In the first half of 2023, clickhouse was the most popular database, followed by prometheus and redis.image.png

Which databases were actively maintained and updated in the first half of 2023

  • Query example:

    SET odps.namespace.schema = true;
    SELECT
     repos.name AS repo_name,
     COUNT(dwd.id) AS num
    FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd
     JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id
    WHERE type = 'PullRequestEvent'
     AND ds>='2023-01-01' and ds<='2023-06-30'
     AND action = 'opened'
    GROUP BY repos.name
    ORDER BY num DESC
    LIMIT 10;
  • Analysis result: In the first half of 2023, StarRocks was the most actively maintained and updated database. Project activity is defined by the number of created PullRequest events.

    image.png

Who were the main individual contributors to the most active open source database project in the first half of 2023

  • Query example:

    SET odps.namespace.schema=true; 
    WITH a AS 
    (
     SELECT repo_id
     ,repo_name
     ,actor_id
     ,actor_login
     ,COUNT(*) AS contribution
     ,ds
     FROM bigdata_public_dataset.github_events.dwd_github_events_odps
     WHERE ds >='2021-01-01' and ds<='2021-12-31'
     AND (
     (
     type = 'PullRequestEvent'
     AND action = 'opened'
     )
     OR (
     type = 'IssuesEvent'
     AND action = 'opened'
     )
     OR (
     type = 'IssueCommentEvent'
     AND action = 'created'
     )
     OR (
     type = 'PullRequestReviewEvent'
     AND action = 'created'
     )
     OR (
     type = 'PullRequestReviewCommentEvent'
     AND action = 'created'
     )
     OR (
     type = 'PushEvent'
     AND action IS NULL
     )
     )
     GROUP BY repo_id
     ,repo_name
     ,actor_id
     ,actor_login
     ,ds
    )
    SELECT repo_name
     ,actor_login
     ,SUM(contribution) AS contribution
    FROM a
    WHERE repo_name = 'StarRocks/starrocks'
    AND actor_login NOT LIKE '%[bot]'
    AND actor_login NOT LIKE 'cockroach%'
    
    GROUP BY repo_name
     ,actor_login
    ORDER BY contribution DESC
    LIMIT 10
    ;
  • Analysis result: In the first half of 2023, kangkaisen was the main individual contributor to the most active open source database project.image.png

Explore programming languages

Count the top ten most used programming languages in the last year

  • Query example:

    SET odps.namespace.schema=true; 
    SELECT
     language,
     count(*) total
    FROM
     bigdata_public_dataset.github_events.dwd_github_events_odps
    WHERE
     ds>=date_add(getdate(), -365)
     AND language IS NOT NULL
    GROUP BY
     language
    ORDER BY
     total DESC
    LIMIT 10;
  • Analysis result: The most used programming language was JavaScript, followed by TypeScript and Python.image.png

View the top ten most popular programming language projects over the last five years (2018 to 2022)

  • Query example:

    SET odps.namespace.schema = TRUE;
    
    SELECT dws.repo_id AS repo_id,
     repos.name AS repo_name,
     SUM(dws.stars) AS stars
    FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
    JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
    WHERE MONTH >= '2015-01'
    GROUP BY dws.repo_id,
     repos.name
    ORDER BY stars DESC LIMIT 10;
  • Analysis result: Over the last five years, the most popular programming language project was Go, which received 81,642 stars. It was followed by TypeScript and Node.

    image.png

View the monthly star growth trend for the top 10 most popular programming language projects over the last five years (2018 to 2022)

  • Query example:

    SET odps.namespace.schema = true;
    SET odps.sql.validate.orderby.limit=false;
    WITH top_10_repos AS (
     SELECT
     dws.repo_id AS repo_id, 
     repos.name AS repo_name, 
     SUM(dws.stars) AS stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month dws
     JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
     WHERE month >= '2018-01' AND month <='2022-12'
     GROUP BY dws.repo_id, repos.name
     ORDER BY stars DESC
     LIMIT 10
    ),
    tmp AS (
     SELECT
     month,
     repo_id,
     stars,
     SUM(stars) OVER (PARTITION by repo_id ORDER BY month ASC ) AS total_stars
     FROM bigdata_public_dataset.github_events.dws_overview_by_repo_by_month
     WHERE month >= '2015-01' AND stars IS NOT NULL
     AND repo_id IN (SELECT repo_id FROM top_10_repos)
     GROUP BY repo_id,month,stars
     ORDER BY month ASC,repo_id)
     
    SELECT
     tmp.month AS month,
     top_10_repos.repo_name AS repo_name,
     tmp.total_stars AS total_stars
    
     FROM tmp
     JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
     GROUP BY month,repo_name,total_stars
     ORDER BY month ASC,repo_name
    ;
  • Analysis result: Over the last five years, Go was the programming language project with the fastest growth in popularity.image.png