All Products
Search
Document Center

Platform For AI:LLM-Special Characters Ratio Filter (DLC)

Last Updated:Jun 23, 2026

Filters text samples based on the proportion of special characters. Use this component in a Machine Learning Designer pipeline to remove low-quality samples from LLM training data, such as texts dominated by punctuation, digits, or whitespace.

Input data must be stored in Object Storage Service (OSS) in JSON Lines format. Each line is a standalone JSON object, and the file itself is not a valid JSON object. For a sample input file, see Example.

Supported computing resources

How it works

The algorithm scans every character in the text and calculates the special character ratio:

special character ratio = number of special characters / total text length

A text is filtered out if its ratio falls outside the range defined by Minimum Ratio and Maximum Ratio:

  • Ratio < Minimum Ratio: the text is removed.

  • Ratio > Maximum Ratio: the text is removed.

Special character categories

The following character types count as special characters:

Category Source
Punctuation string.punctuation
Digits string.digits
Whitespace string.whitespace
Emojis Unicode emoji characters
Other special characters Additional Unicode symbols

Example

With Minimum Ratio = 0.0 and Maximum Ratio = 0.25:

Text Approximate ratio Result
HelloWorld (no spaces or punctuation) 0.0 Kept (ratio equals Minimum Ratio)
Hello, World! (3 special chars out of 13) ~0.23 Kept (within range)
!!!Hello!!! (6 special chars out of 11) ~0.55 Filtered out (exceeds Maximum Ratio)
@#$%^&* (all special chars) 1.0 Filtered out (exceeds Maximum Ratio)
Spaces, digits, and emojis all count as special characters. A sentence like Hello World 123 has a ratio that includes the two spaces and three digits.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-Special Characters Ratio Filter (DLC) component with the following parameters.

Fields Setting

Parameter Required Description Default
Target Process Field Yes The field to process. N/A
Minimum Ratio No Texts with a ratio below this value are filtered out. Type: FLOAT. 0
Maximum Ratio Yes Texts with a ratio above this value are filtered out. Type: FLOAT. N/A
OSS Directory for Saving OutputData No OSS path for output data. Defaults to the workspace path. N/A

Tuning

Parameter Required Description Default
Number of Processes No Number of parallel processes. Defaults to the number of CPU cores. N/A

Select Resource Group

Parameter Required Description Default
Public Resource Group No Instance type (CPU or GPU), number of instances, and VPC. N/A
Dedicated resource group No Number of vCPUs, memory, shared memory, GPUs, and instances. N/A
Maximum Running Duration No Maximum runtime. The job is terminated if this limit is exceeded. N/A