When preprocessing large text datasets for LLM training, you need to filter out records that are not in your target language or that have low-quality language predictions. The LLM-Quality Predict and Language Recognition-FastText (DLC) component in Platform for AI (PAI) uses fastText to detect the language of each text record, assign a confidence score, and filter out records that do not match the target languages or fall below the minimum score threshold. The component supports 176 languages.
How it works
The component runs three operations on every record in the input dataset:
Language identification -- fastText analyzes each text and returns an ISO 639 language code (for example,
en,zh, orfr).Confidence scoring -- fastText assigns a confidence score between 0 and 1 that indicates prediction reliability.
Filtering -- Records whose detected language is not in the target language list, or whose confidence score falls below the minimum threshold, are removed from the output.
In a typical LLM data preprocessing pipeline, language identification and filtering sits between raw text extraction and downstream quality operations such as deduplication and content scoring. Use this component after converting source data to JSON Lines format, and before running quality or deduplication components.
Text extraction --> Language identification & filtering --> Quality scoring --> DeduplicationInput requirements
The input data file must be stored in Object Storage Service (OSS) and meet the following requirements:
| Requirement | Details |
|---|---|
| Format | JSON Lines (.jsonl). Each line is a standalone JSON object. The file itself is not a valid JSON object. |
| Encoding | UTF-8. The fastText model was trained on UTF-8 data and expects UTF-8 input. |
| Field | Each JSON object must contain a text field. Specify the field name in the Target Process Field parameter. |
Example input
{"content": "Alibaba Cloud provides scalable cloud computing services.", "id": "doc_001"}
{"content": "阿里云提供弹性云计算服务。", "id": "doc_002"}
{"content": "Le cloud computing offre des services évolutifs.", "id": "doc_003"}For a complete sample file, see the example data.
Expected output
After processing with Language ID Name set to en and Minimum Score set to 0.65, only records identified as English with a confidence score of 0.65 or higher are retained. In the example above, only the first record (doc_001) would pass the filter because doc_002 is Chinese and doc_003 is French.
Supported computing resources
This component runs on Deep Learning Containers (DLC).
Supported languages
The component recognizes 176 languages. Specify languages in the Language ID Name parameter using the ISO codes listed below. Separate multiple codes with commas -- for example, en,zh,fr.
af, als, am, an, ar, arz, as, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, pt, qu, rm, ro, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh
Configure the component
On the pipeline page of Machine Learning Designer, drag the LLM-Quality Predict and Language Recognition-FastText (DLC) component onto the canvas and configure the parameters described below.
Tab | Parameter | Required | Description | Default value | |
Fields Setting | Target Process Field | Yes | Name of the JSON field that contains the text to process. | No default | |
Language ID Name | Yes | Target language codes for filtering. Separate multiple codes with commas ( | No default | ||
Minimum Score | Yes | Confidence score threshold. Records with a score below this value are filtered out. A higher value produces stricter filtering. | No default | ||
OSS Directory for Saving OutputData | No | OSS path for the output data. If left blank, the default workspace path is used. | Default workspace path | ||
Tuning | Number of Processes | No | Number of parallel processes for data processing. Increase this value for larger datasets. | 8 | |
Select Resource Group | Public Resource Group | No | Instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) to use. | No default | |
Dedicated resource group | No | Number of vCPUs, memory, shared memory, number of GPUs, and number of instances to use. | No default | ||
Maximum Running Duration | No | Maximum time the component is allowed to run. The DLC job is terminated if this duration is exceeded. | No default | ||
Best practices
Set Minimum Score based on the precision and recall trade-off for your use case. A threshold of
0.65is a common starting point. Lower values retain more data but include more misidentified records. Higher values produce cleaner output but may discard valid records.Short texts (fewer than 20 characters) tend to produce lower confidence scores. Consider removing very short records before running this component.
The fastText model assigns a single language label per record. For mixed-language texts, the model returns the dominant language.