LLM data processing algorithms allow you to edit and transform data samples, filter low-quality samples, and identify and remove duplicate samples. You can combine different algorithms as needed to filter data, generate text that meets your requirements, and provide high-quality data for subsequent LLM training. This topic uses a small amount of data from the open source Alpaca-CoT project as an example. It shows how to use the PAI large model data processing components to clean and process SFT data. The DLC component supports the distributed Ray framework for large-scale data processing. It also supports an intelligent aggregation feature. This feature enables efficient data processing and resource utilization, and reduces unnecessary data storage operations. For more information, see Group large model data processing components by aggregation.
Dataset description
The 'LLM data processing - Alpaca-CoT (SFT data) - DLC component' preset template in Machine Learning Designer uses a dataset of 5,000 samples. These samples are extracted from the raw data of the open source Alpaca-CoT project.
Create and run a workflow
Go to the Machine Learning Designer page.
Log on to the PAI console.
In the upper-left corner of the page, select a region as needed.
In the navigation pane on the left, select Workspaces, and click the name of the target workspace.
In the navigation pane on the left, select Model Training > Visualized Modeling (Designer) to go to the Designer page.
Create a workflow.
On the Preset Templates tab, select Business Area > LLM, and click Create on the LLM Data Processing-Alpaca-CoT (SFT Data)-DLC template card.

Configure the workflow parameters (or keep the defaults), and click OK.
In the workflow list, select the created workflow, and click Open.
Workflow description:

Descriptions of key algorithm components in the workflow:
LLM-MD5 Deduplication (DLC)-1
Calculates the hash value for the text in the `text` field and removes duplicate text. Only one instance of text with the same hash value is retained.
LLM-Count Filter (DLC)-1
Removes samples from the `text` field that do not meet the specified ratio of digits and letters. In an SFT dataset, most characters are letters and digits. This component can remove some dirty data.
LLM-N-Gram Repetition Ratio Filter (DLC)-1
Filters samples based on the character-level N-gram repetition ratio in the `text` field. The component applies a sliding window of size N to the text characters, creating a sequence of segments with length N. Each segment is a gram. The component counts the occurrences of each gram. Finally, the component filters samples based on the repetition ratio, which is calculated as:
(total frequency of grams that appear more than once) / (total frequency of all grams).LLM-Sensitive Word Filter (DLC)-1
Uses the system's preset sensitive word file to filter samples from the `text` field that contain sensitive words.
LLM-Length Filter (DLC)-1
Filters samples based on the length of the `text` field and the maximum line length. The maximum line length is determined by splitting the sample at the line feed character (
\n).LLM-SimHash Similarity Deduplication (DLC)-1
Removes similar samples based on the configured `window_size`, `num_blocks`, and `hamming_distance` values.
Run the workflow.
After the run is complete, right-click the LLM-SimHash Similarity Deduplication (DLC)-1 component and select View Data > Output Data (OSS) to view the sample file processed by all the preceding components.

References
For more information about the LLM algorithm components, see LLM data processing (DLC).