This topic uses a small data sample from the open source RedPajama project on GitHub to demonstrate how to use the data processing components for large language models (LLMs) in PAI to clean and process GitHub code data.
Prerequisites
-
You have created a workspace. For more information, see Create and manage a workspace.
-
You have associated MaxCompute resources with the workspace. For more information, see Manage a workspace.
Dataset
This demonstration uses 5,000 data samples extracted from the raw GitHub data of the RedPajama open source project.
You can clean and process the data by following the steps in the Data processing workflow section. This process improves data quality and model training performance.
Data processing workflow
-
Go to the Machine Learning Designer page.
-
Log on to the PAI console.
-
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
-
In the left-side navigation pane, choose .
-
-
Build the workflow.
-
On the Machine Learning Designer page, click the Preset Templates tab.
-
On the LLM tab, in the LLM Data Processing - GitHub Code section, click Create.
-
In the Create Workflow dialog box, configure the parameters and click OK. You can use the default values.
The Workflow Data Storage parameter specifies the path of the OSS bucket used to store data that is generated when the workflow runs.
-
In the workflow list, double-click the target workflow to open it.
-
The system automatically builds a workflow based on the preset template, as shown in the following figure.

Component
Description
LLM-MaskSensitiveInfo-1
Masks sensitive information. For example:
-
Replaces email addresses with
[EMAIL]. -
Replaces phone numbers with
[TELEPHONE]or[MOBILEPHONE]. -
Replaces ID card numbers with
IDNUM.
The following example shows the data in the content field after processing. The email address is replaced with
[EMAIL].-
Before

-
After processing:

LLM-RemoveSpecialContent-1
Deletes URLs from the content field.
The following example shows the data in the content field after processing. The URL is deleted.
-
Before

-
After

LLM-NormalizeText-1
Applies Unicode normalization to the text in the content field.
The following example shows the data in the content field after processing. The text is normalized.
LLM-RemoveCopyright-1
Deletes copyright information from the content field.
The following example shows the data in the content field after processing. The copyright information is deleted.
-
Before

-
Processing result

LLM-CountFilter-1
Removes samples from the content field that do not meet the specified ratio of digits and letters. Most characters in the GitHub code dataset are letters and digits. This component can remove some dirty data.
The following list shows some of the removed data. A large amount of dirty data is removed.

LLM-LengthFilter-1
Filters samples based on the total length, average line length, and maximum line length of the content field. The average and maximum line lengths are calculated by splitting the sample by the line feed character ("\n").
The following list shows some of the removed datasets. Much of the dirty code data that is too short or too long is removed.

LLM-FilterByNGramRepetitionRatio-1
Filters samples based on the character-level and word-level N-gram repetition ratio of the content field.
The component processes the text using a sliding window of size N at the character or word level. This creates a sequence of N-length fragments, called grams. The component counts the occurrences of each gram. The repetition ratio is calculated as
(total count of grams that appear more than once) / (total count of all grams). Samples are filtered based on this ratio.NoteFor word-level statistics, all words are converted to lowercase before the repetition ratio is calculated.
LLM-LengthFilter-2
This component splits a sample into a list of words based on spaces. It then filters the sample based on the length of the resulting list. This effectively filters samples by the number of words.
LLM-DeduplicateByMinHash-1
This component removes similar text.
-
-
-
Click the Run button
above the canvas to run the workflow. -
After the workflow runs successfully, right-click the Write to Data Table-1 component and choose from the shortcut menu.
The output shows the data samples that have been filtered and processed by all preceding components.
