All Products
Search
Document Center

Platform For AI:LLM-Quality Predict and Language Recognition-FastText (DLC)

Last Updated:Mar 11, 2026

Detect language, assign confidence scores, and filter text records using fastText. Supports 176 languages.

How it works

The component runs three operations on each record in the input dataset:

  1. Language identification -- fastText analyzes each text and returns an ISO 639 language code (such as en, zh, or fr).

  2. Confidence scoring -- fastText assigns a confidence score (0 to 1) indicating prediction reliability.

  3. Filtering -- Records whose detected language is not in the target language list or whose confidence score falls below the minimum threshold are removed from the output.

In a typical LLM data preprocessing pipeline, language identification and filtering sits between raw text extraction and downstream quality operations such as deduplication and content scoring. Use this component after converting source data to JSON Lines format and before running quality or deduplication components.

Text extraction  -->  Language identification & filtering  -->  Quality scoring  -->  Deduplication

Input requirements

The input data file must be stored in Object Storage Service (OSS) and meet these requirements:

Requirement Details
Format JSON Lines (.jsonl). Each line is a standalone JSON object. The file itself is not a valid JSON object.
Encoding UTF-8. The fastText model was trained on UTF-8 data and requires UTF-8 input.
Field Each JSON object must contain a text field. Specify the field name in the Target Process Field parameter.

Example input

{"content": "Alibaba Cloud provides scalable cloud computing services.", "id": "doc_001"}
{"content": "阿里云提供弹性云计算服务。", "id": "doc_002"}
{"content": "Le cloud computing offre des services évolutifs.", "id": "doc_003"}

For a complete sample file, see the example data.

Example output

After processing with Language ID Name set to en and Minimum Score set to 0.65, only English records with a confidence score of 0.65 or higher are retained. In the example above, only the first record (doc_001) passes the filter because doc_002 is Chinese and doc_003 is French.

Supported computing resources

This component runs on Deep Learning Containers (DLC).

Supported languages

The component recognizes 176 languages. Specify languages in the Language ID Name parameter using the ISO codes listed below. Separate multiple codes with commas (for example, en,zh,fr).

af, als, am, an, ar, arz, as, ast, av, az, azb, ba, bar, bcl, be, bg, bh, bn, bo, bpy, br, bs, bxr, ca, cbk, ce, ceb, ckb, co, cs, cv, cy, da, de, diq, dsb, dty, dv, el, eml, en, eo, es, et, eu, fa, fi, fr, frr, fy, ga, gd, gl, gn, gom, gu, gv, he, hi, hif, hr, hsb, ht, hu, hy, ia, id, ie, ilo, io, is, it, ja, jbo, jv, ka, kk, km, kn, ko, krc, ku, kv, kw, ky, la, lb, lez, li, lmo, lo, lrc, lt, lv, mai, mg, mhr, min, mk, ml, mn, mr, mrj, ms, mt, mwl, my, myv, mzn, nah, nap, nds, ne, new, nl, nn, no, oc, or, os, pa, pam, pfl, pl, pms, pnb, ps, pt, qu, rm, ro, ru, rue, sa, sah, sc, scn, sco, sd, sh, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, tyv, ug, uk, ur, uz, vec, vep, vi, vls, vo, wa, war, wuu, xal, xmf, yi, yo, yue, zh

Configure the component

On the pipeline page in Machine Learning Designer, drag the LLM-Quality Predict and Language Recognition-FastText (DLC) component onto the canvas and configure these parameters.

Section

Parameter

Required

Description

Default

Field settings

Target Process Field

Yes

Name of the JSON field containing the text to process.

No default

Language ID Name

Yes

Target language codes for filtering. Separate multiple codes with commas (,). Example: en,zh. See Supported languages.

No default

Minimum Score

Yes

Confidence score threshold. Records with scores below this value are filtered out. Higher values produce stricter filtering.

No default

OSS Directory for Saving OutputData

No

OSS path for output data. If left blank, the default workspace path is used.

Default workspace path

Performance tuning

Number of Processes

No

Number of parallel processes for data processing. Increase this value for larger datasets.

8

Select Resource Group

Public Resource Group

No

Instance type (CPU or GPU), number of instances, and VPC to use.

No default

Dedicated resource group

No

Number of vCPUs, memory, shared memory, GPUs, and instances to use.

No default

Maximum Running Duration

No

Maximum time allowed for the component to run. The DLC job terminates if this duration is exceeded.

No default

Best practices

  • Set Minimum Score based on your precision and recall trade-off. A threshold of 0.65 is a common starting point. Lower values retain more data but include more misidentified records. Higher values produce cleaner output but may discard valid records.

  • Short texts (fewer than 20 characters) tend to have lower confidence scores. Consider removing very short records before running this component.

  • The fastText model assigns a single language label per record. For mixed-language texts, the model returns the dominant language.