The LLM-Clean Copyright Information (MaxCompute) component of Platform for AI (PAI) is used to remove copyright information from text, such as the copyright comment header from code text. You can use the component during text preprocessing of large language models (LLMs).
Supported computing resources
Algorithm description
The algorithm performs the following operations to remove the copyright information from text:
Checks whether the text includes strings that conform to the regular expression
'/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/'.If the corresponding string is matched, the algorithm checks whether the string contains the
copyrightfields. If the string contains the field, the algorithm deletes the string and returns the result. If the string does not contain the field, the algorithm directly returns the result.If the regular expression is not matched, go to step 2.
Splits the text with line feeds. The algorithm traverses the text by line to check whether the line starts with the following comment character:
//,#, or--. If a line that meets the condition is matched, the algorithm continues to traverse the text until the comment symbol terminates. The consecutive comment lines in the text are removed.
The algorithm checks only the header of the text. Examples:
Before processing
| After processing
|
Configure the component
Add an LLM-Clean Copyright Information (MaxCompute) component on the pipeline page of Machine Learning Designer and configure the following parameters.
Category | Parameter | Default value | Description |
Fields Setting | Select Target Column | None | The columns that you want to process. You can select multiple columns. |
Output table lifecycle | 28 | The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled. | |
Tuning | Number of CPUs per instance of map task | 100 | The number of CPUs for each instance of a map task. Valid values: 50 to 800. |
The memory size per instance of map task | 1024 | The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288. | |
The maximum size of input data for a map | 256 | The maximum amount of data that each instance of a map task can process. You can control the input of the map by using this parameter. Unit: MB. Valid values: 1 to Integer.MAX_VALUE. |

