Large language model (LLM) data processing algorithms allow you to edit and transform data samples, filter low-quality samples, and identify and remove duplicate samples. You can combine different algorithms to filter data and generate text that meets your requirements. This process provides high-quality data for subsequent LLM training. This topic uses a small dataset from the open-source RedPajama-Data project as an example to demonstrate how to use the LLM data processing components provided by PAI to clean and process GitHub code data.
The DLC component supports large-scale data processing using the distributed Ray framework and features intelligent aggregation. This enables efficient data processing and resource utilization, and reduces unnecessary data storage operations. For more information, see Group and aggregate large model data processing components.
Dataset description
The dataset used in the "Data processing for LLM (GitHub code) - DLC component" preset template in Machine Learning Designer consists of 5,000 samples extracted from the raw data of the open-source RedPajama-Data project.
Create and run a workflow
Go to the Machine Learning Designer page.
Log on to the PAI console.
In the upper-left corner, select a region.
In the left navigation pane, choose Workspaces, and click the name of the target workspace.
In the left navigation pane, choose Model Development And Training > Visualized Modeling (Designer) to open the Machine Learning Designer page.
Create a workflow.
On the Preset Templates tab, choose Business Area > LLM. On the Data Processing for LLM (Github Code) - DLC Component template card, click Create.

Configure the workflow parameters, or keep the default settings, and then click OK.
In the workflow list, select the created workflow and click Open.
Workflow description:

The key algorithm components in the workflow are described as follows:
LLM-Sensitive Content Mask (DLC)-1
Masks sensitive information in the "content" field. For example:
Replaces email addresses with
[EMAIL].Replaces mobile phone numbers with
[TELEPHONE]or[MOBILEPHONE].Replaces ID card numbers with
IDNUM.
LLM-Clean Special Content (DLC)-1
Deletes URL links from the "content" field.
LLM-Text Normalizer (DLC)-1
Performs Unicode normalization on the text in the "content" field.
LLM-Clean Copyright Information (DLC)-1
Deletes copyright information from the "content" field.
LLM-Count Filter (DLC)-1
Removes samples from the "content" field that do not meet the specified ratio of alphanumeric characters or the specified ratio of alphabetic characters to text tokens. Most characters in a GitHub code dataset are letters and numbers. This component can remove some dirty data.
LLM-Length Filter (DLC)-1
Filters text samples in the content field based on the text length, the average length, and the maximum line length. The average and maximum line lengths are calculated from samples split by the line feed character
\n.LLM-N-Gram Repetition Filter (DLC)-1
Filters samples in the 'content' field based on the character-level and word-level N-gram repetition ratio. For word-level ratios, all words are converted to lowercase before the repetition is calculated. The component applies a sliding window of size N to the text to create a sequence of segments of length N. Each segment is called a gram. The component counts the occurrences of all grams. Finally, samples are filtered based on the repetition ratio, which is calculated as:
(Total frequency of grams that appear more than once) / (Total frequency of all grams).LLM-Length Filter (DLC)-2
Filters samples based on the length of the "content" field.
LLM-Document Deduplicator (DLC)-1
Removes similar samples based on the configured values for window_size, num_blocks, and hamming_distance.
Run the workflow.
After the workflow finishes running, right-click the LLM-Document Deduplicator (DLC)-1 component and choose View Data > Output Data (OSS) to view the sample files processed by the preceding components.

References
For more information about the LLM algorithm components, see LLM Data Processing (DLC).