The LLM-Clean Copyright Information (DLC) component of Platform for AI (PAI) is used to remove copyright information from text, such as removing the copyright comment header from code text. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.
Supported computing resources
Algorithm description
The algorithm performs the following operations to remove the copyright information from the text:
Check whether the text includes strings that match the regular expression
'/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/'.If a string matches the regular expression, the algorithm checks whether the string contains the
copyrightfield. If the string contains the field, the algorithm deletes the string and returns the result. If the string does not contain the field, the algorithm directly returns the result.If no string matches the regular expression, the algorithm directly performs Step 2.
Split the text based on line breaks. The algorithm traverses the text by line to check whether a line starts with one of the following comment symbols:
//,#, and--. If a line that meets this condition is found, the algorithm continues to traverse the text until the comment ends. Removes consecutive comment lines in the text.
The algorithm checks only the header of the text. Example:
Before processing
| After processing
|
Configure the component
On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Clean Copyright Information (DLC) component.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | N/A | |
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | N/A | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use. | N/A | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | N/A | ||
Maximum Running Duration | No | The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated. | N/A | ||

