The LLM-Special Characters Ratio Filter (DLC) component in Platform for AI (PAI) filters text samples based on the proportion of special characters. Use this component in a Machine Learning Designer pipeline to remove low-quality samples from LLM training data, such as texts dominated by punctuation, digits, or whitespace.
Input data must be stored in Object Storage Service (OSS) in JSON Lines format: each line is a standalone JSON object. The file itself is not a valid JSON object. For a sample input file, see Example.
Supported computing resources
How it works
The algorithm scans every character in the text and calculates the special character ratio:
special character ratio = number of special characters / total text lengthA text is filtered out if its ratio falls outside the range defined by Minimum Ratio and Maximum Ratio:
Ratio < Minimum Ratio: the text is removed.
Ratio > Maximum Ratio: the text is removed.
Special character categories
The following character types count as special characters:
| Category | Source |
|---|---|
| Punctuation | string.punctuation |
| Digits | string.digits |
| Whitespace | string.whitespace |
| Emojis | Unicode emoji characters |
| Other special characters | Additional Unicode symbols |
Example
With Minimum Ratio = 0.0 and Maximum Ratio = 0.25:
| Text | Approximate ratio | Result |
|---|---|---|
HelloWorld (no spaces or punctuation) | 0.0 | Kept (ratio equals Minimum Ratio) |
Hello, World! (3 special chars out of 13) | ~0.23 | Kept (within range) |
!!!Hello!!! (6 special chars out of 11) | ~0.55 | Filtered out (exceeds Maximum Ratio) |
@#$%^&* (all special chars) | 1.0 | Filtered out (exceeds Maximum Ratio) |
Spaces, digits, and emojis all count as special characters. A sentence like Hello World 123 has a ratio that includes the two spaces and three digits.Configure the component
On the pipeline page of Machine Learning Designer, configure the LLM-Special Characters Ratio Filter (DLC) component with the following parameters.
Fields Setting
| Parameter | Required | Description | Default |
|---|---|---|---|
| Target Process Field | Yes | The name of the field that you want to process. | N/A |
| Minimum Ratio | No | Texts with a special character ratio below this threshold are filtered out. Value type: FLOAT. | 0 |
| Maximum Ratio | Yes | Texts with a special character ratio above this threshold are filtered out. Value type: FLOAT. | N/A |
| OSS Directory for Saving OutputData | No | The OSS path where output data is stored. If not specified, the default workspace path is used. | N/A |
Tuning
| Parameter | Required | Description | Default |
|---|---|---|---|
| Number of Processes | No | The number of parallel processes for data processing. If not specified, the value is automatically set to the number of CPU cores. | N/A |
Select Resource Group
| Parameter | Required | Description | Default |
|---|---|---|---|
| Public Resource Group | No | The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use. | N/A |
| Dedicated resource group | No | The number of vCPUs, memory, shared memory, number of GPUs, and number of instances to use. | N/A |
| Maximum Running Duration | No | The maximum time the component can run. If exceeded, the job is terminated. | N/A |