Data processing for LLM - Wikipedia - Platform For AI - Alibaba Cloud Documentation Center

LLM data processing components allow you to edit, transform, filter, and deduplicate data samples. You can combine different components as needed to select high-quality data and generate text that meets your requirements. This provides quality data for training large language models (LLMs). This topic uses a small amount of data from the open-source RedPajama Wikipedia dataset as an example to demonstrate how to use PAI's large model data processing components to clean and process Wikipedia data.

Dataset description

The "LLM Data Processing-Wikipedia (web text data)" preset template in Machine Learning Designer uses a dataset of 5,000 samples extracted from the raw data of the open-source RedPajama project.

Create and run a workflow

Go to the Machine Learning Designer page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region as needed.
3. In the navigation pane on the left, click Workspaces. On the page that appears, click the name of the workspace that you want to use.
4. In the navigation pane on the left, choose Model Training > Visualized Modeling (Designer).
Create a workflow.
1. On the Preset Templates tab, choose Business Area > Large Language Model. On the LLM Data Processing-Wikipedia (Web Text Data) template card, click Create.
2. Configure the workflow parameters or keep the default settings, and then click OK.
3. In the workflow list, select the workflow that you created and click Open.

Workflow description:

The following table describes the key algorithm components in the workflow:

LLM-Sensitive Content Mask (MaxCompute)-1
Masks sensitive information in the "text" field. For example:
- Replaces email addresses with [EMAIL].
- Replaces phone numbers with [TELEPHONE] or [MOBILEPHONE].
- Replaces ID card numbers with IDNUM.
LLM-Clean Special Content (MaxCompute)-1
Deletes URLs from the "text" field.
LLM-Text Normalizer (MaxCompute)-1
Normalizes text in the "text" field to the Unicode format and converts traditional Chinese characters to simplified Chinese characters.
LLM-Count Filter (MaxCompute)-1
Removes samples from the "text" field that do not meet the required number or ratio of alphanumeric characters. Most characters in the Wikipedia dataset are letters and numbers. This component can remove some dirty data.
LLM-Length Filter (MaxCompute)-1
Filters samples based on the average line length in the "text" field. Lines are split by the line feed character \n.
LLM-N-Gram Repetition Filter (MaxCompute)-1
Filters samples based on the character-level N-gram repetition rate in the "text" field. The text is processed using a sliding window of size N characters to form a sequence of N-character segments. Each segment is called a gram. The component counts the occurrences of each gram. The repetition rate is calculated as the total count of grams that appear more than once / total count of all grams. Samples are filtered based on this rate.
LLM-Sensitive Keywords Filter (MaxCompute)-1
Uses the system's preset sensitive word file to filter samples in the "text" field that contain sensitive words.
LLM-Language Recognition and Filter (MaxCompute)-1
Calculates the confidence level of the text in the "text" field and filters samples based on the configured confidence threshold.
LLM-Length Filter (MaxCompute)-2
Filters samples based on the maximum line length in the "text" field. Lines are split by the line feed character \n.
LLM-Perplexity Filter (MaxCompute)-1
Calculates the perplexity of the text in the "text" field and filters samples based on the configured perplexity threshold.
LLM-Special Characters Ratio Filter (MaxCompute)-1
Removes samples from the "text" field that do not meet the required ratio of special characters.
LLM-Length Filter (MaxCompute)-3
Filters samples based on the length of the "text" field.
LLM-Tokenization (MaxCompute)-1
Tokenizes the text in the "text" field and saves the result to a new column.
LLM-Length Filter (MaxCompute)-4
Splits samples in the "text" field into a list of words using the space character " " as the separator. It then filters samples based on the length of the resulting list, which means filtering by the number of words.
LLM-N-Gram Repetition Filter (MaxCompute)-2
Filters samples based on the word-level N-gram repetition rate in the "text" field. All words are first converted to lowercase to calculate the repetition rate. The text is processed using a sliding window of size N words to form a sequence of N-word segments. Each segment is called a gram. The component counts the occurrences of each gram. The repetition rate is calculated as the total count of grams that appear more than once / total count of all grams. Samples are filtered based on this rate.
LLM-MinHash Deduplicator (MaxCompute)-1
Removes similar samples based on the MinHash algorithm.

Run the workflow.
After the run is complete, right-click the Write To Data Table-1 component and choose View Data > Outputs to view the samples processed by the preceding components.

References

For more information about the LLM algorithm components, see LLM data processing (MaxCompute).