The LLM - Special Content Removal (DLC) component removes URL links, strips HTML tags, and parses the resulting text. The input OSS data file must be in JSONL format (example), where each line is a valid JSON object, but the entire file is not.
Supported compute resources
Algorithm
The LLM - Special Content Removal (DLC) component supports the following features:
-
Remove URL links
Removes characters from the text that match the regular expression
r'(https?|http)?:\/\/[\w\.\/\?\=\&\%\-\_]+'. -
Remove HTML tags and parse HTML text
Replaces
'<li>'and'<ol>'with'\n*', and removes the'</li>'and'</ol>'tags. The component then parses and returns the resulting text.
For example, to remove URL links from an article:
|
Before Before processing, the current field value is the minified source code of AngularJS v1.3.0-beta.2, where the URL
|
After The current field value dialog box displays the processed content. It is a snippet of minified JavaScript code from AngularJS v1.3.0-beta.2, including copyright comments ((c) 2010-2014 Google, Inc., License: MIT) and partial function definitions. The URL http://angularjs.org has been removed. |
Configure the component
In the Designer workflow, add the LLM - Special Content Removal (DLC) component and configure its parameters in the right-side pane.
|
Parameter type |
Parameter |
Required |
Description |
Default |
|
|
field settings |
target processing field |
Yes |
The name of the field to process. |
None |
|
|
Remove URL links |
No |
Whether to remove URL links from the text. |
Selected |
||
|
Remove HTML tags and parse HTML text |
No |
Whether to remove HTML tags and parse the resulting text. |
Not selected |
||
|
data output OSS directory |
No |
The OSS directory to store the processed data. If this parameter is left empty, the component uses the default workspace path. |
None |
||
|
execution tuning |
number of processes |
No |
The number of processes to use for the job. |
8 |
|
|
Select resource group |
public resource group |
No |
Select the instance specification (CPU or GPU), number of nodes, and Virtual Private Cloud. |
None |
|
|
dedicated resource group |
No |
Select the number of CPU cores, memory, shared memory, number of GPUs, and number of nodes. |
None |
||
|
maximum runtime |
No |
The maximum runtime of the component. If this time is exceeded, the system terminates the job. |
None |
||