Clean LLM Training Data by Removing Copyright Headers - Platform for AI

The LLM-Clean Copyright Information (MaxCompute) component of Platform for AI (PAI) is used to remove copyright information from text, such as the copyright comment header from code text. You can use the component during text preprocessing of large language models (LLMs).

Supported computing resources

MaxCompute

Algorithm description

The algorithm performs the following operations to remove the copyright information from text:

Checks whether the text includes strings that conform to the regular expression '/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/'.
- If the corresponding string is matched, the algorithm checks whether the string contains the copyright fields. If the string contains the field, the algorithm deletes the string and returns the result. If the string does not contain the field, the algorithm directly returns the result.
- If the regular expression is not matched, go to step 2.
Splits the text with line feeds. The algorithm traverses the text by line to check whether the line starts with the following comment character: //,#, or --. If a line that meets the condition is matched, the algorithm continues to traverse the text until the comment symbol terminates. The consecutive comment lines in the text are removed.

The algorithm checks only the header of the text. Examples:

Before processing

After processing

Configure the component

Add an LLM-Clean Copyright Information (MaxCompute) component on the pipeline page of Machine Learning Designer and configure the following parameters.

Category	Parameter	Default value	Description
Fields Setting	Select Target Column	None	The columns that you want to process. You can select multiple columns.
Fields Setting	Output table lifecycle	28	The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.
Tuning	Number of CPUs per instance of map task	100	The number of CPUs for each instance of a map task. Valid values: 50 to 800.
	The memory size per instance of map task	1024	The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.
	The maximum size of input data for a map	256	The maximum amount of data that each instance of a map task can process. You can control the input of the map by using this parameter. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.