Use the LLM-Count Filter (DLC) to filter text by digit and letter ratio - Platform For AI

The LLM-Count Filter (DLC) component of Platform for AI (PAI) is used to filter the text based on the ratio of digits and letters. The input Object Storage Service (OSS) data file must be in the JSON Lines format and meet the following requirements: Each line in the file is a valid JSON object and the file consists of multiple lines of JSON objects, but the file is not a valid JSON object. For more information, see Example.

Supported computing resources

Deep Learning Containers (DLC)

Algorithm description

The LLM-Count Filter (DLC) component supports the following features:

Filter text based on the number or ratio of digits and letters.
The algorithm calculates the number of digits and letters in the text and filters the text based on the threshold value.
Filter the text based on the ratio of letters to text tokens
The algorithm splits the text into tokens by using the pythia-6.9b-deduped model, calculates the ratio of digits and letters to tokens, and filters the text based on the ratio.

Configure the component

On the pipeline page of Machine Learning Designer, configure the parameters of the LLM-Count Filtering (DLC) component.

Tab	Parameter		Required	Description	Default value
Fields Setting	Target Process Field		Yes	The name of the field that you want to process.	N/A
	Whether to Filter with AlphaNumeric Count or Ratio		No	Specifies whether to filter the text based on the ratio of digits and letters to the text length. If you select this option, you must configure the following parameters: Minimum Counts or Ratio of AlphaNumeric Chars Maximum Counts or Ratio of AlphaNumeric Chars	Unselected
	Whether to Filter with the Ratio of the Number of alpha chars to the Number of Text Tokens		No	The algorithm splits the text into tokens by using the pythia-6.9b-deduped model, calculates the ratio of digits and letters to tokens, and then filters the text based on the ratio. If you select this option, you must configure the following parameters: Minimum Ratio of Alpha Chars to Text Tokens Maximum Ratio of Alpha Chars to Text Tokens	Unselected
	OSS Directory for Saving OutputData		No	The OSS directory in which the generated data is stored. If you do not specify this parameter, the default path of the workspace is used.	N/A
Tuning	Number of Processes		No	The number of processes.	8
	Select Resource Group	Public Resource Group	No	The instance type (CPU or GPU), number of instances, and virtual private cloud (VPC) that you want to use.	N/A
	Select Resource Group	Dedicated resource group	No	The number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use.	N/A
	Maximum Running Duration		No	The maximum period of time for which the component can run. If this period of time is exceeded, the job is terminated.	N/A