Filters text samples based on the proportion of special characters. Use this component in a Machine Learning Designer pipeline to remove low-quality samples from LLM training data, such as texts dominated by punctuation, digits, or whitespace.
Input data must be stored in Object Storage Service (OSS) in JSON Lines format. Each line is a standalone JSON object, and the file itself is not a valid JSON object. For a sample input file, see Example.
Supported computing resources
How it works
The algorithm scans every character in the text and calculates the special character ratio:
special character ratio = number of special characters / total text length
A text is filtered out if its ratio falls outside the range defined by Minimum Ratio and Maximum Ratio:
-
Ratio < Minimum Ratio: the text is removed.
-
Ratio > Maximum Ratio: the text is removed.
Special character categories
The following character types count as special characters:
| Category | Source |
|---|---|
| Punctuation | string.punctuation |
| Digits | string.digits |
| Whitespace | string.whitespace |
| Emojis | Unicode emoji characters |
| Other special characters | Additional Unicode symbols |
Example
With Minimum Ratio = 0.0 and Maximum Ratio = 0.25:
| Text | Approximate ratio | Result |
|---|---|---|
HelloWorld (no spaces or punctuation) |
0.0 | Kept (ratio equals Minimum Ratio) |
Hello, World! (3 special chars out of 13) |
~0.23 | Kept (within range) |
!!!Hello!!! (6 special chars out of 11) |
~0.55 | Filtered out (exceeds Maximum Ratio) |
@#$%^&* (all special chars) |
1.0 | Filtered out (exceeds Maximum Ratio) |
Spaces, digits, and emojis all count as special characters. A sentence like Hello World 123 has a ratio that includes the two spaces and three digits.
Configure the component
On the pipeline page of Machine Learning Designer, configure the LLM-Special Characters Ratio Filter (DLC) component with the following parameters.
Fields Setting
| Parameter | Required | Description | Default |
|---|---|---|---|
| Target Process Field | Yes | The field to process. | N/A |
| Minimum Ratio | No | Texts with a ratio below this value are filtered out. Type: FLOAT. | 0 |
| Maximum Ratio | Yes | Texts with a ratio above this value are filtered out. Type: FLOAT. | N/A |
| OSS Directory for Saving OutputData | No | OSS path for output data. Defaults to the workspace path. | N/A |
Tuning
| Parameter | Required | Description | Default |
|---|---|---|---|
| Number of Processes | No | Number of parallel processes. Defaults to the number of CPU cores. | N/A |
Select Resource Group
| Parameter | Required | Description | Default |
|---|---|---|---|
| Public Resource Group | No | Instance type (CPU or GPU), number of instances, and VPC. | N/A |
| Dedicated resource group | No | Number of vCPUs, memory, shared memory, GPUs, and instances. | N/A |
| Maximum Running Duration | No | Maximum runtime. The job is terminated if this limit is exceeded. | N/A |