All Products
Search
Document Center

Platform For AI:LLM-Clean Copyright Information (MaxCompute)

Last Updated:Jan 03, 2025

The LLM-Clean Copyright Information (MaxCompute) component of Platform for AI (PAI) is used to remove copyright information from text, such as the copyright comment header from code text. You can use the component during text preprocessing of large language models (LLMs).

Supported computing resources

MaxCompute

Algorithm description

The algorithm performs the following operations to remove the copyright information from text:

  1. Checks whether the text includes strings that conform to the regular expression '/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/'.

    • If the corresponding string is matched, the algorithm checks whether the string contains the copyright fields. If the string contains the field, the algorithm deletes the string and returns the result. If the string does not contain the field, the algorithm directly returns the result.

    • If the regular expression is not matched, go to step 2.

  2. Splits the text with line feeds. The algorithm traverses the text by line to check whether the line starts with the following comment character: //,#, or --. If a line that meets the condition is matched, the algorithm continues to traverse the text until the comment symbol terminates. The consecutive comment lines in the text are removed.

The algorithm checks only the header of the text. Examples:

Before processing

image.png

After processing

image.png

Configure the component

Add an LLM-Clean Copyright Information (MaxCompute) component on the pipeline page of Machine Learning Designer and configure the following parameters.

Category

Parameter

Default value

Description

Fields Setting

Select Target Column

None

The columns that you want to process. You can select multiple columns.

Output table lifecycle

28

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

Tuning

Number of CPUs per instance of map task

100

The number of CPUs for each instance of a map task. Valid values: 50 to 800.

The memory size per instance of map task

1024

The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.

The maximum size of input data for a map

256

The maximum amount of data that each instance of a map task can process. You can control the input of the map by using this parameter. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.