The LLM-Clean Special Content (DLC) component of Platform for AI (PAI) is used to remove URLs from text and remove HTML-formatted characters and parse HTML text. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.
Supported computing resources
Algorithm description
The LLM-Clean Special Content (DLC) component performs the following operations on the text:
Remove URLs
Remove the characters from the text that match the following regular expression:
r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'.Remove HTML-formatted characters and parse HTML text
Perform the following operations on the text: replace
'<li>'with'\n*', replace'<ol>'with'\n*', remove'</li>'and'</ol>'characters, and then parse the HTML text and return the result.
Example of removing URLs from text:
Before processing
| After processing
|
Configure the component
On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Clean Special Content (DLC) component.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | N/A | |
Whether to remove the URL link | No | Specifies whether to remove URLs from the text. | Selected | ||
Whether to remove html format characters and parse html text | No | Specifies whether to remove HTML-formatted characters and parse HTML text. | Unselected | ||
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | N/A | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use. | N/A | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | N/A | ||
Maximum Running Duration | No | The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated. | N/A | ||

