All Products
Search
Document Center

Platform For AI:LLM-Special Characters Ratio Filter (DLC)

Last Updated:Feb 27, 2026

The LLM-Special Characters Ratio Filter (DLC) component in Platform for AI (PAI) filters text samples based on the proportion of special characters. Use this component in a Machine Learning Designer pipeline to remove low-quality samples from LLM training data, such as texts dominated by punctuation, digits, or whitespace.

Input data must be stored in Object Storage Service (OSS) in JSON Lines format: each line is a standalone JSON object. The file itself is not a valid JSON object. For a sample input file, see Example.

Supported computing resources

How it works

The algorithm scans every character in the text and calculates the special character ratio:

special character ratio = number of special characters / total text length

A text is filtered out if its ratio falls outside the range defined by Minimum Ratio and Maximum Ratio:

  • Ratio < Minimum Ratio: the text is removed.

  • Ratio > Maximum Ratio: the text is removed.

Special character categories

The following character types count as special characters:

CategorySource
Punctuationstring.punctuation
Digitsstring.digits
Whitespacestring.whitespace
EmojisUnicode emoji characters
Other special charactersAdditional Unicode symbols

Example

With Minimum Ratio = 0.0 and Maximum Ratio = 0.25:

TextApproximate ratioResult
HelloWorld (no spaces or punctuation)0.0Kept (ratio equals Minimum Ratio)
Hello, World! (3 special chars out of 13)~0.23Kept (within range)
!!!Hello!!! (6 special chars out of 11)~0.55Filtered out (exceeds Maximum Ratio)
@#$%^&* (all special chars)1.0Filtered out (exceeds Maximum Ratio)
Spaces, digits, and emojis all count as special characters. A sentence like Hello World 123 has a ratio that includes the two spaces and three digits.

Configure the component

On the pipeline page of Machine Learning Designer, configure the LLM-Special Characters Ratio Filter (DLC) component with the following parameters.

Fields Setting

ParameterRequiredDescriptionDefault
Target Process FieldYesThe name of the field that you want to process.N/A
Minimum RatioNoTexts with a special character ratio below this threshold are filtered out. Value type: FLOAT.0
Maximum RatioYesTexts with a special character ratio above this threshold are filtered out. Value type: FLOAT.N/A
OSS Directory for Saving OutputDataNoThe OSS path where output data is stored. If not specified, the default workspace path is used.N/A

Tuning

ParameterRequiredDescriptionDefault
Number of ProcessesNoThe number of parallel processes for data processing. If not specified, the value is automatically set to the number of CPU cores.N/A

Select Resource Group

ParameterRequiredDescriptionDefault
Public Resource GroupNoThe instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use.N/A
Dedicated resource groupNoThe number of vCPUs, memory, shared memory, number of GPUs, and number of instances to use.N/A
Maximum Running DurationNoThe maximum time the component can run. If exceeded, the job is terminated.N/A