LLM-Clean Special Content (MaxCompute) - Platform For AI

The LLM-Clean Special Content (MaxCompute) component strips boilerplate and noise from raw text—navigation breadcrumbs, author metadata, timestamps, URLs, non-printable characters, and HTML markup—before you use the text as large language model (LLM) training data. It runs on MaxCompute resources and integrates with Machine Learning Designer in the Platform for AI (PAI) console.

Limitations

LLM-Clean Special Content (MaxCompute) supports only MaxCompute resources.

How it works

The component processes text in the following order:

Splits text into lines using line breaks.
Removes navigation information.
Removes author information.
Removes source information (first five lines only).
Removes URLs.
Removes non-printable characters.
Parses and cleans HTML markup.

Steps 3 and 4 are order-dependent. If navigation and author information are removed in steps 2 and 3, the "first five lines" in step 4 are counted from the remaining text, not the original text.

The following table describes how each operation identifies and removes content:

Operation	Trigger type	Trigger condition	Scope
Remove navigation information	Keywords	`'Homepage>'`, `'Homepage»'`, `'Homepage/'`, `'Homepage\|'`	Full text
Remove navigation information	Regex	`'Current location:.[>]{1,}'`, `'Location:.[>]{1,}'`	Full text
Remove author information	Keywords + special characters	Line contains one of the keywords and at least one of `. ? ! ; : . ? ! ; , , !`	Full text
Remove source information	Regex	`r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})'`	First 5 lines
Remove source information	Regex	`r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[Source: \| Edit:]'`	First 5 lines
Remove URLs	Regex	`r'(https?\|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'`	Full text
Remove non-printable characters	Regex	`'[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+'`	Full text
Parse HTML markup	Tag replacement	Replaces `<li>` and `<ol>` with `\n*`, removes `</li>` and `</ol>`, then parses the HTML	Full text

Author information keywords

The author information removal step matches lines that contain one of the following keywords and at least one special character:

'Newspaper reporter', 'Source:', 'Edit:', 'Login | Register', 'Address of this topic:', 'Date of publication:', 'Addition time:', 'Share to:', '"Scan"', 'Related links:', 'Lottery', 'Website navigation', '| Contact us', 'Homepage', 'Current location:', 'Published at', 'Location: '

Example: URL removal

The following example shows a text snippet before and after URL removal.

Before processing:

After processing:

Configure the component

Configure the following parameters in Machine Learning Designer in the PAI console.

Tab	Parameter	Required	Description	Default value
Fields Setting	Select Target Column	Yes	The columns to process. Select one or more columns.	No default value
Fields Setting	Output table lifecycle	No	The retention period for temporary tables generated by the component, in days. Valid values: positive integers. After the lifecycle period elapses, the temporary tables are recycled.	28
Tuning	Number of CPUs per instance of map task	No	The number of CPUs for each map task instance. Valid values: [50, 800].	100
	The memory size per instance of map task	No	The memory size for each map task instance. Unit: MB. Valid values: [256, 12288].	1024
	The maximum size of input data for a map	No	The maximum amount of input data that each map task instance processes. Unit: MB. Valid values: [1, Integer.MAX_VALUE].	256

Platform For AI:LLM-Clean Special Content (MaxCompute)

Limitations

How it works

Author information keywords

Example: URL removal

Configure the component

Related topics