All Products
Search
Document Center

MaxCompute:GitHub public event data

Last Updated:Mar 26, 2026

MaxCompute provides access to GitHub public event data sourced from GH Archive, a project that archives public GitHub activity by the hour. Use this dataset to analyze open source ecosystems, track programming language trends, and measure project activity across millions of repositories.

This dataset is for product testing only. Data accuracy is not guaranteed. Do not use it in a production environment.

Available tables

The dataset is stored in the github_events schema of the BIGDATA_PUBLIC_DATASET project. All tables are accessible via cross-project SQL queries.

Table nameDescriptionUpdate cycle
dwd_github_events_odpsFact table — one row per GitHub public eventT+1 hour
dws_overview_by_repo_monthAggregate table — monthly metrics per repositoryT+1 day
db_reposIDs and names of open source database projects
programming_language_reposIDs and names of open source programming language projects

The db_repos and programming_language_repos project lists are sourced from ossinsight.

To learn how this data is generated, see Integrated offline and real-time data processing based on the GitHub public event dataset.

Table schemas

dwd_github_events_odps

This fact table stores one record per GitHub public event, updated every T+1 hour.

Field nameData typeDescription
idBIGINTThe event ID.
actor_idBIGINTThe ID of the event initiator.
actor_loginSTRINGThe login name of the event initiator.
repo_idBIGINTThe repository ID.
repo_nameSTRINGThe repository name, in owner/repository_name format.
org_idBIGINTThe ID of the organization that owns the repository.
org_loginSTRINGThe name of the organization that owns the repository.
typeSTRINGThe event type. For a full list, see GitHub event types.
created_atDATETIMEThe time when the event occurred.
actionSTRINGThe event action (for example, opened or created).
iss_or_pr_idBIGINTThe issue or pull request ID.
numberBIGINTThe issue or pull request sequence number.
comment_idBIGINTThe comment ID.
commit_idSTRINGThe commit ID.
member_idBIGINTThe member ID.
rev_or_push_or_rel_idBIGINTThe review, push, or release ID.
refSTRINGThe name of the created or deleted resource.
ref_typeSTRINGThe type of the created or deleted resource.
stateSTRINGThe status of the issue, pull request, or pull request review.
author_associationSTRINGThe relationship between the event initiator and the repository.
languageSTRINGThe programming language of the code in the merge request.
mergedBOOLEANWhether the pull request was merged.
merged_atDATETIMEThe time when the code was merged.
additionsBIGINTThe number of lines added.
deletionsBIGINTThe number of lines deleted.
changed_filesBIGINTThe number of files changed by the pull request.
push_sizeBIGINTThe total number of commits in the push.
push_distinct_sizeBIGINTThe number of distinct commits in the push.
hrSTRINGThe hour when the event occurred. For example, an event at 00:23 has the value 00.
monthSTRINGThe month when the event occurred, in yyyy-MM format. For example, October 2015 is 2015-10.
yearSTRINGThe year when the event occurred. For example, 2015.
dsSTRINGThe date when the event occurred, in yyyy-MM-dd format.

dws_overview_by_repo_month

This aggregate table stores monthly metrics per repository, updated every T+1 day.

Field nameData typeDescription
repo_idBIGINTThe repository ID.
repo_nameSTRINGThe repository name, in owner/repository_name format.
starsBIGINTThe number of stars the repository received that month.
commitsBIGINTThe number of commits to the repository that month.
pushesBIGINTThe number of pushes to the repository that month.
total_prsBIGINTThe total number of pull requests that month.
pr_creatorsBIGINTThe number of pull request creators that month.
pr_reviewsBIGINTThe number of pull request reviews that month.
pr_reviewersBIGINTThe number of pull request reviewers that month.
total_issuesBIGINTThe total number of issues that month.
forksBIGINTThe number of forks that month.
monthSTRINGThe month, in yyyy-MM format. For example, October 2015 is 2015-10.

Available regions

RegionRegion ID
China (Hangzhou)cn-hangzhou
China (Shanghai)cn-shanghai
China (Beijing)cn-beijing
China (Zhangjiakou)cn-zhangjiakou
China (Ulanqab)cn-wulanchabu
China (Shenzhen)cn-shenzhen
China (Chengdu)cn-chengdu

Usage notes

Public datasets are stored in the BIGDATA_PUBLIC_DATASET project. Because you are not a member of that project, all queries use cross-project access. Keep the following in mind:

  • Specify the full table path in SQL. Prefix every table name with the project and schema: bigdata_public_dataset.github_events.<table_name>.

  • Enable schema syntax. If tenant-level schema syntax is not enabled for your account, add the following line at the start of every SQL script:

    SET odps.namespace.schema = true;
  • Storage is free; queries are billed. You are not charged for storing public dataset data, but computing resources consumed by your queries are billed. For details, see Computing fees (pay-as-you-go).

  • Data Map is unavailable for public datasets. Cross-project access means you cannot browse these tables in the DataWorks Data Map.

  • DataWorks DataAnalysis requires SQL. If tenant-level schema syntax is not enabled, query the data by running SQL statements directly instead of using the DataAnalysis visual interface.

For more information about schema operations, see Schema operations.

Query the dataset

Prerequisites

Before you begin, ensure that you have:

Supported tools

Run queries using any of the following tools:

Run your first query

The following example retrieves 100 rows from the fact table for a specific date:

-- Enable session-level schema syntax if tenant-level syntax is not enabled.
SET odps.namespace.schema = true;

-- Query 100 records from May 10, 2024.
SELECT *
FROM bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE ds = '2024-05-10'
LIMIT 100;

Example queries and analysis

The following examples use the dwd_github_events_odps and dws_overview_by_repo_month tables to answer common questions about open source databases and programming languages. All examples require schema syntax to be enabled.

Open source databases

This group of queries uses the db_repos reference table to scope results to known open source database projects, covering popularity rankings, star trends, and contributor activity.

Which open source databases were most popular from 2018 to 2022?

SET odps.namespace.schema = TRUE;

SELECT
  dws.repo_id AS repo_id,
  repos.name AS repo_name,
  SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
WHERE month >= '2015-01' AND month <= '2022-12'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10;

Analysis result: elasticsearch was the most popular open source database project, followed by redis and prometheus.

image.png

How did the rankings of the top database projects change each year from 2018 to 2022?

SET odps.namespace.schema = TRUE;
SET odps.sql.validate.orderby.limit = FALSE;

WITH tmp AS (
  SELECT
    dws.repo_id AS repo_id,
    repos.name AS repo_name,
    SUM(dws.stars) AS stars,
    SUBSTR(month, 1, 4) AS year,
    ROW_NUMBER() OVER (PARTITION BY SUBSTR(month, 1, 4) ORDER BY SUM(dws.stars) DESC) AS ranknum
  FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
  JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
  WHERE month >= '2018-01'
  GROUP BY dws.repo_id, repos.name, SUBSTR(month, 1, 4)
)
SELECT repo_id, repo_name, stars, ranknum, year
FROM tmp
WHERE year <= 2022 AND ranknum <= 10
ORDER BY year ASC, ranknum ASC;

Analysis result: clickhouse showed the fastest growth of any project, rising from tenth place in 2018 to first place in 2021. In 2022, it was overtaken by the steadily rising redis. taosdata/TDengine reached first place in 2019 but dropped to ninth within a year.

image.png

How did monthly star counts for the top 10 database projects trend from 2018 to 2022?

SET odps.namespace.schema = true;
SET odps.sql.validate.orderby.limit = false;

WITH top_10_repos AS (
  SELECT
    dws.repo_id AS repo_id,
    repos.name AS repo_name,
    SUM(dws.stars) AS stars
  FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
  JOIN bigdata_public_dataset.github_events.db_repos repos ON dws.repo_id = repos.id
  WHERE month >= '2018-01' AND month <= '2022-12'
  GROUP BY dws.repo_id, repos.name
  ORDER BY stars DESC
  LIMIT 10
),
tmp AS (
  SELECT
    month,
    repo_id,
    stars,
    SUM(stars) OVER (PARTITION BY repo_id ORDER BY month ASC) AS total_stars
  FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
  WHERE month >= '2015-01' AND stars IS NOT NULL
    AND repo_id IN (SELECT repo_id FROM top_10_repos)
  GROUP BY repo_id, month, stars
  ORDER BY month ASC, repo_id
)
SELECT
  tmp.month AS month,
  top_10_repos.repo_name AS repo_name,
  tmp.total_stars AS total_stars
FROM tmp
JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
GROUP BY month, repo_name, total_stars
ORDER BY month ASC, repo_name;

Analysis result: elasticsearch was consistently the most starred database project. clickhouse saw rapid cumulative growth starting in 2021.

image.png

Which databases were most popular in the first half of 2023?

SET odps.namespace.schema = true;

SELECT
  repos.name AS repo_name,
  SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dws.repo_id
WHERE month >= '2023-01' AND month <= '2023-06'
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 10;

Analysis result: clickhouse was the most popular database in the first half of 2023, followed by prometheus and redis.

image.png

Which databases were most actively maintained in the first half of 2023?

This query measures activity by counting opened pull request events, which reflects ongoing development work.

SET odps.namespace.schema = true;

SELECT
  repos.name AS repo_name,
  COUNT(dwd.id) AS num
FROM bigdata_public_dataset.github_events.dwd_github_events_odps dwd
JOIN bigdata_public_dataset.github_events.db_repos repos ON repos.id = dwd.repo_id
WHERE type = 'PullRequestEvent'
  AND ds >= '2023-01-01' AND ds <= '2023-06-30'
  AND action = 'opened'
GROUP BY repos.name
ORDER BY num DESC
LIMIT 10;

Analysis result: StarRocks was the most actively maintained open source database project in the first half of 2023.

image.png

Who were the top individual contributors to the most active database project?

This query counts contribution events — pull requests, issues, comments, reviews, and pushes — per contributor for a given repository and time range.

SET odps.namespace.schema = true;

WITH a AS (
  SELECT
    repo_id,
    repo_name,
    actor_id,
    actor_login,
    COUNT(*) AS contribution,
    ds
  FROM bigdata_public_dataset.github_events.dwd_github_events_odps
  WHERE ds >= '2021-01-01' AND ds <= '2021-12-31'
    AND (
      (type = 'PullRequestEvent' AND action = 'opened')
      OR (type = 'IssuesEvent' AND action = 'opened')
      OR (type = 'IssueCommentEvent' AND action = 'created')
      OR (type = 'PullRequestReviewEvent' AND action = 'created')
      OR (type = 'PullRequestReviewCommentEvent' AND action = 'created')
      OR (type = 'PushEvent' AND action IS NULL)
    )
  GROUP BY repo_id, repo_name, actor_id, actor_login, ds
)
SELECT
  repo_name,
  actor_login,
  SUM(contribution) AS contribution
FROM a
WHERE repo_name = 'StarRocks/starrocks'
  AND actor_login NOT LIKE '%[bot]'
  AND actor_login NOT LIKE 'cockroach%'
GROUP BY repo_name, actor_login
ORDER BY contribution DESC
LIMIT 10;

Analysis result: kangkaisen was the top individual contributor to the StarRocks/starrocks repository in 2021.

image.png

Programming languages

This group of queries uses the programming_language_repos reference table to scope results to language runtime and compiler projects, answering questions about usage and popularity trends.

What were the top 10 most used programming languages over the past year?

SET odps.namespace.schema = true;

SELECT
  language,
  COUNT(*) AS total
FROM bigdata_public_dataset.github_events.dwd_github_events_odps
WHERE ds >= DATE_ADD(GETDATE(), -365)
  AND language IS NOT NULL
GROUP BY language
ORDER BY total DESC
LIMIT 10;

Analysis result: JavaScript was the most used language, followed by TypeScript and Python.

image.png

Which programming language projects received the most stars from 2018 to 2022?

SET odps.namespace.schema = TRUE;

SELECT
  dws.repo_id AS repo_id,
  repos.name AS repo_name,
  SUM(dws.stars) AS stars
FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
WHERE month >= '2015-01'
GROUP BY dws.repo_id, repos.name
ORDER BY stars DESC
LIMIT 10;

Analysis result: Go was the most popular programming language project with 81,642 stars, followed by TypeScript and Node.

image.png

How did cumulative star counts for the top 10 language projects grow month by month?

SET odps.namespace.schema = true;
SET odps.sql.validate.orderby.limit = false;

WITH top_10_repos AS (
  SELECT
    dws.repo_id AS repo_id,
    repos.name AS repo_name,
    SUM(dws.stars) AS stars
  FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month dws
  JOIN bigdata_public_dataset.github_events.programming_language_repos repos ON dws.repo_id = repos.id
  WHERE month >= '2018-01' AND month <= '2022-12'
  GROUP BY dws.repo_id, repos.name
  ORDER BY stars DESC
  LIMIT 10
),
tmp AS (
  SELECT
    month,
    repo_id,
    stars,
    SUM(stars) OVER (PARTITION BY repo_id ORDER BY month ASC) AS total_stars
  FROM bigdata_public_dataset.github_events.dws_overview_by_repo_month
  WHERE month >= '2015-01' AND stars IS NOT NULL
    AND repo_id IN (SELECT repo_id FROM top_10_repos)
  GROUP BY repo_id, month, stars
  ORDER BY month ASC, repo_id
)
SELECT
  tmp.month AS month,
  top_10_repos.repo_name AS repo_name,
  tmp.total_stars AS total_stars
FROM tmp
JOIN top_10_repos ON top_10_repos.repo_id = tmp.repo_id
GROUP BY month, repo_name, total_stars
ORDER BY month ASC, repo_name;

Analysis result: Go had the fastest cumulative star growth of any programming language project over the five-year period.

image.png