All Products
Search
Document Center

Platform For AI:Clean GitHub code data for LLM training

Last Updated:Mar 10, 2026

This topic uses a small data sample from the open source RedPajama project on GitHub to demonstrate how to use the data processing components for large language models (LLMs) in PAI to clean and process GitHub code data.

Prerequisites

Dataset

This demonstration uses 5,000 data samples extracted from the raw GitHub data of the RedPajama open source project.

You can clean and process the data by following the steps in the Data processing workflow section. This process improves data quality and model training performance.

Data processing workflow

  1. Go to the Machine Learning Designer page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).

  2. Build the workflow.

    1. On the Machine Learning Designer page, click the Preset Templates tab.

    2. On the LLM tab, in the LLM Data Processing - GitHub Code section, click Create.

    3. In the Create Workflow dialog box, configure the parameters and click OK. You can use the default values.

      The Workflow Data Storage parameter specifies the path of the OSS bucket used to store data that is generated when the workflow runs.

    4. In the workflow list, double-click the target workflow to open it.

    5. The system automatically builds a workflow based on the preset template, as shown in the following figure.image

      Component

      Description

      LLM-MaskSensitiveInfo-1

      Masks sensitive information. For example:

      • Replaces email addresses with [EMAIL].

      • Replaces phone numbers with [TELEPHONE] or [MOBILEPHONE].

      • Replaces ID card numbers with IDNUM.

      The following example shows the data in the content field after processing. The email address is replaced with [EMAIL].

      • Beforeimage

      • After processing:image

      LLM-RemoveSpecialContent-1

      Deletes URLs from the content field.

      The following example shows the data in the content field after processing. The URL is deleted.

      • Beforeimage

      • Afterimage

      LLM-NormalizeText-1

      Applies Unicode normalization to the text in the content field.

      The following example shows the data in the content field after processing. The text is normalized.

      LLM-RemoveCopyright-1

      Deletes copyright information from the content field.

      The following example shows the data in the content field after processing. The copyright information is deleted.

      • Beforeimage

      • Processing resultimage

      LLM-CountFilter-1

      Removes samples from the content field that do not meet the specified ratio of digits and letters. Most characters in the GitHub code dataset are letters and digits. This component can remove some dirty data.

      The following list shows some of the removed data. A large amount of dirty data is removed.

      image

      LLM-LengthFilter-1

      Filters samples based on the total length, average line length, and maximum line length of the content field. The average and maximum line lengths are calculated by splitting the sample by the line feed character ("\n").

      The following list shows some of the removed datasets. Much of the dirty code data that is too short or too long is removed.image

      LLM-FilterByNGramRepetitionRatio-1

      Filters samples based on the character-level and word-level N-gram repetition ratio of the content field.

      The component processes the text using a sliding window of size N at the character or word level. This creates a sequence of N-length fragments, called grams. The component counts the occurrences of each gram. The repetition ratio is calculated as (total count of grams that appear more than once) / (total count of all grams). Samples are filtered based on this ratio.

      Note

      For word-level statistics, all words are converted to lowercase before the repetition ratio is calculated.

      LLM-LengthFilter-2

      This component splits a sample into a list of words based on spaces. It then filters the sample based on the length of the resulting list. This effectively filters samples by the number of words.

      LLM-DeduplicateByMinHash-1

      This component removes similar text.

  3. Click the Run button image above the canvas to run the workflow.

  4. After the workflow runs successfully, right-click the Write to Data Table-1 component and choose View Data > Output from the shortcut menu.

    The output shows the data samples that have been filtered and processed by all preceding components.image