All Products
Search
Document Center

Platform For AI:LLM-Clean Special Content (MaxCompute)

Last Updated:Jul 10, 2024

The LLM-Clean Special Content (MaxCompute) component of Platform for AI (PAI) is used to remove special content from text, such as navigation information, the author or the source information of the text, URLs, non-printable characters, or special HTML characters. You can use the component during text preprocessing of large language models (LLMs).

Limits

The LLM-Clean Special Content (MaxCompute) component supports only MaxCompute resources.

Algorithm

The LLM-Clean Special Content (MaxCompute) component performs the following operations on the text:

Uses line breaks to split the text into multiple lines.

  • Removes navigation information.

    • Keywords: 'Homepage>', 'Homepage»', 'Homepage/', and 'Homepage|'.

    • Regular expressions: 'Current location:.*[>]{1,}' and 'Location:.*[>]{1,}'.

    • The component removes the text lines that contain the preceding keywords or match the preceding regular expressions from the text.

  • Removes author information.

    The component removes the text lines that contain one of the following keywords and at least one of the special characters from the text. The special characters include . ? ! ; : . ? ! ; , , !.

    Keywords: 'Newspaper reporter', 'Source:', 'Edit:', 'Login | Register', 'Address of this topic:', 'Date of publication:', 'Addition time:', 'Share to:', '"Scan"', 'Related links:', 'Lottery', 'Website navigation', '| Contact us', 'Homepage', 'Current location:', 'Published at', and 'Location: '.

  • Removes source information.

    Regular expressions: r'(\d{4}[-/year]\d{1,2}[-/month]\d{1,2}[day]{0,}\s\d{1,2}:\d{1,2}:\d{1,2})' and r'\d{4}[-/]\d{1,2}[-/]\d{1,2}.*[Source: | Edit:]'.

    The component matches the preceding regular expressions only in the first five text lines, and removes the matched text lines from the first five text lines.

    Note

    If the navigation information and author information are removed from the text, the first five text lines are counted based on the text after the navigation information and author information are removed, rather than based on the original text.

  • Removes URLs.

    The component removes characters that match the regular expression r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+' from the text.

  • Removes non-printable characters.

    The component removes characters that match the regular expression '[\001\002\003\004\005\006\007\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+' from the text.

  • Removes HTML characters and parses the HTML text.

    The component replaces '<li>' in the text with '\n*', replaces '<ol>' in the text with '\n*', and removes '</li>' and '</ol>' from the text. Then, the component parses the HTML text.

The following figures show an example of removing a URL from the text.

  • Before processing

    image

  • After processing

    image

Configure the component

You can configure the parameters of the component in the Machine Learning Designer module of the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The columns that you want to process. You can select multiple columns.

No default value

Output table lifecycle

No

The value is a positive integer. Unit: days. Default value: 28. After the table lifecycle elapses, the temporary tables generated by the component are recycled.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: [50,800].

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: [256,12288].

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: [1,Integer.MAX_VALUE].

256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.