All Products
Search
Document Center

Platform For AI:LLM-Quality Predict and Language Recognition-FastText (DLC)

Last Updated:Feb 28, 2026

When preprocessing large text datasets for LLM training, you need to filter out records that are not in your target language or that have low-quality language predictions. The LLM-Quality Predict and Language Recognition-FastText (DLC) component in Platform for AI (PAI) uses fastText to detect the language of each text record, assign a confidence score, and filter out records that do not match the target languages or fall below the minimum score threshold. The component supports 176 languages.

How it works

The component runs three operations on every record in the input dataset:

  1. Language identification -- fastText analyzes each text and returns an ISO 639 language code (for example, en, zh, or fr).

  2. Confidence scoring -- fastText assigns a confidence score between 0 and 1 that indicates prediction reliability.

  3. Filtering -- Records whose detected language is not in the target language list, or whose confidence score falls below the minimum threshold, are removed from the output.

In a typical LLM data preprocessing pipeline, language identification and filtering sits between raw text extraction and downstream quality operations such as deduplication and content scoring. Use this component after converting source data to JSON Lines format, and before running quality or deduplication components.

Text extraction  -->  Language identification & filtering  -->  Quality scoring  -->  Deduplication

Input requirements

The input data file must be stored in Object Storage Service (OSS) and meet the following requirements:

RequirementDetails
FormatJSON Lines (.jsonl). Each line is a standalone JSON object. The file itself is not a valid JSON object.
EncodingUTF-8. The fastText model was trained on UTF-8 data and expects UTF-8 input.
FieldEach JSON object must contain a text field. Specify the field name in the Target Process Field parameter.

Example input

{"content": "Alibaba Cloud provides scalable cloud computing services.", "id": "doc_001"}
{"content": "阿里云提供弹性云计算服务。", "id": "doc_002"}
{"content": "Le cloud computing offre des services évolutifs.", "id": "doc_003"}

For a complete sample file, see the example data.

Expected output

After processing with Language ID Name set to en and Minimum Score set to 0.65, only records identified as English with a confidence score of 0.65 or higher are retained. In the example above, only the first record (doc_001) would pass the filter because doc_002 is Chinese and doc_003 is French.

Supported computing resources

This component runs on Deep Learning Containers (DLC).

Supported languages

The component recognizes 176 languages. Specify languages in the Language ID Name parameter using the ISO codes listed below. Separate multiple codes with commas -- for example, en,zh,fr.

af, als, am, an, ar, arz, as, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, pt, qu, rm, ro, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh

Configure the component

On the pipeline page of Machine Learning Designer, drag the LLM-Quality Predict and Language Recognition-FastText (DLC) component onto the canvas and configure the parameters described below.

Tab

Parameter

Required

Description

Default value

Fields Setting

Target Process Field

Yes

Name of the JSON field that contains the text to process.

No default

Language ID Name

Yes

Target language codes for filtering. Separate multiple codes with commas (,). Example: en,zh. See Supported languages for valid codes.

No default

Minimum Score

Yes

Confidence score threshold. Records with a score below this value are filtered out. A higher value produces stricter filtering.

No default

OSS Directory for Saving OutputData

No

OSS path for the output data. If left blank, the default workspace path is used.

Default workspace path

Tuning

Number of Processes

No

Number of parallel processes for data processing. Increase this value for larger datasets.

8

Select Resource Group

Public Resource Group

No

Instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use.

No default

Dedicated resource group

No

Number of vCPUs, memory, shared memory, number of GPUs, and number of instances to use.

No default

Maximum Running Duration

No

Maximum time the component is allowed to run. The DLC job is terminated if this duration is exceeded.

No default

Best practices

  • Set Minimum Score based on the precision and recall trade-off for your use case. A threshold of 0.65 is a common starting point. Lower values retain more data but include more misidentified records. Higher values produce cleaner output but may discard valid records.

  • Short texts (fewer than 20 characters) tend to produce lower confidence scores. Consider removing very short records before running this component.

  • The fastText model assigns a single language label per record. For mixed-language texts, the model returns the dominant language.