The LLM-N-Gram Repetition Filter (DLC) component of Platform for AI (PAI) is used to filter texts based on the repetition ratio of character-level or word-level N-Grams. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.
Supported computing resources
Algorithm description
The LLM-N-Gram Repetition Filter (DLC) component moves an N-character window across a text to generate sequences of N characters or words. Each sequence is called an N-gram. The component calculates the frequency of each N-gram and then calculates the repetition ratio by using the following formula: Cumulative frequency of N-grams that occur more than once/Total frequency of all N-grams. On this basis, the component filters texts based on the repetition ratio.
If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio.
Configure the component
On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-N-Gram Repetition Filter (DLC) component.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | The name of the field that you want to process. | N/A | |
Whether to Filter with Character-level N-Gram Repetition Ratio | No | If you select this option, you must configure the following parameters:
| Unselected | ||
Whether to Filter with Word-level N-Gram Repetition Ratio | No | If you select this option, you must configure the following parameters:
| Unselected | ||
OSS Directory for Saving OutputData | No | The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used. | N/A | ||
Tuning | Number of Processes | No | The number of processes. | 8 | |
Select Resource Group | Public Resource Group | No | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use. | N/A | |
Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | N/A | ||
Maximum Running Duration | No | The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated. | N/A | ||