Description of the LLM-Text Normalizer (DLC) component - Platform For AI

Raw text datasets used for LLM training often contain broken Unicode characters — garbled punctuation, misencoded apostrophes, and fullwidth characters that degrade model quality — as well as traditional Chinese text that needs to be unified to simplified Chinese. The LLM-Text Normalizer (DLC) component fixes both issues automatically as part of a Machine Learning Designer pipeline, so you can feed cleaner data to downstream filtering and deduplication steps without writing custom preprocessing scripts.

How it works

The component processes each JSON object in the input file and applies the selected normalization operations to the target field:

Unicode normalization uses the ftfy library to repair broken Unicode and then applies NFKC (Normalization Form Compatibility Composition) normalization via ftfy.fix_text(text, normalization='NFKC').
Traditional-to-simplified Chinese conversion uses the opencc library to convert traditional Chinese characters to simplified Chinese.

Both operations are enabled by default and can be applied independently.

Prerequisites

Before you begin, make sure you have:

Input data stored in Object Storage Service (OSS) in JSON Lines (JSONL) format

Input data format

The input file must meet the following requirements:

Each line is a valid JSON object
The file consists of multiple JSON objects, one per line
The file as a whole is not a valid JSON object

For a sample input file, see the example data.

Supported computing resources

Deep Learning Containers (DLC)

Configure the component

On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Text Normalizer (DLC) component.

Fields setting

Parameter	Required	Description	Default
Target Process Field	Yes	The name of the JSON field to normalize.	—
Whether to normalize Unicode text (NFKC form)	No	Normalizes Unicode text using the NFKC method via ftfy.	Selected
Whether to convert traditional to simplified chinese	No	Converts traditional Chinese characters to simplified Chinese using opencc.	Selected
OSS Directory for Saving OutputData	No	The OSS directory for the output data. If left blank, the default workspace path is used.	—

Tuning

Parameter	Required	Description	Default
Number of Processes	No	The number of parallel processes for normalization.	8

Select resource group

Parameter	Required	Description	Default
Public Resource Group	No	The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC).	—
Dedicated resource group	No	The number of vCPUs, memory, shared memory, number of GPUs, and number of instances.	—
Maximum Running Duration	No	The maximum time the component can run. If exceeded, the job is terminated.	—

Examples

The following screenshots show the same text field before and after normalization.

Before processing:

After processing: