All Products
Search
Document Center

Platform For AI:LLM-Clean Special Content (MaxCompute)

Last Updated:Apr 02, 2026

The LLM-Clean Special Content (MaxCompute) component strips boilerplate and noise from raw text—navigation breadcrumbs, author metadata, timestamps, URLs, non-printable characters, and HTML markup—before you use the text as large language model (LLM) training data. It runs on MaxCompute resources and integrates with Machine Learning Designer in the Platform for AI (PAI) console.

Limitations

LLM-Clean Special Content (MaxCompute) supports only MaxCompute resources.

How it works

The component processes text in the following order:

  1. Splits text into lines using line breaks.

  2. Removes navigation information.

  3. Removes author information.

  4. Removes source information (first five lines only).

  5. Removes URLs.

  6. Removes non-printable characters.

  7. Parses and cleans HTML markup.

Steps 3 and 4 are order-dependent. If navigation and author information are removed in steps 2 and 3, the "first five lines" in step 4 are counted from the remaining text, not the original text.

The following table describes how each operation identifies and removes content:

Operation Trigger type Trigger condition Scope
Remove navigation information Keywords 'Homepage>', 'Homepage»', 'Homepage/', 'Homepage|' Full text
Regex 'Current location:.*[>]{1,}', 'Location:.*[>]{1,}' Full text
Remove author information Keywords + special characters Line contains one of the keywords and at least one of . ? ! ; : . ? ! ; , , ! Full text
Remove source information Regex r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})' First 5 lines
Regex r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[Source: | Edit:]' First 5 lines
Remove URLs Regex r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+' Full text
Remove non-printable characters Regex '[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+' Full text
Parse HTML markup Tag replacement Replaces <li> and <ol> with \n*, removes </li> and </ol>, then parses the HTML Full text

Author information keywords

The author information removal step matches lines that contain one of the following keywords and at least one special character:

'Newspaper reporter', 'Source:', 'Edit:', 'Login | Register', 'Address of this topic:', 'Date of publication:', 'Addition time:', 'Share to:', '"Scan"', 'Related links:', 'Lottery', 'Website navigation', '| Contact us', 'Homepage', 'Current location:', 'Published at', 'Location: '

Example: URL removal

The following example shows a text snippet before and after URL removal.

Before processing:

image

After processing:

image

Configure the component

Configure the following parameters in Machine Learning Designer in the PAI console.

Tab Parameter Required Description Default value
Fields Setting Select Target Column Yes The columns to process. Select one or more columns. No default value
Output table lifecycle No The retention period for temporary tables generated by the component, in days. Valid values: positive integers. After the lifecycle period elapses, the temporary tables are recycled. 28
Tuning Number of CPUs per instance of map task No The number of CPUs for each map task instance. Valid values: [50, 800]. 100
The memory size per instance of map task No The memory size for each map task instance. Unit: MB. Valid values: [256, 12288]. 1024
The maximum size of input data for a map No The maximum amount of input data that each map task instance processes. Unit: MB. Valid values: [1, Integer.MAX_VALUE]. 256

Related topics