All Products
Search
Document Center

Platform For AI:LLM-Count Filter (MaxCompute)

Last Updated:May 22, 2024

You can use the LLM-Count Filter (MaxCompute) component to preprocess the text data that is used to train large language models (LLMs). The component filters text samples based on metrics related to letters, digits, or delimiters.

Limits

The LLM-Count Filter (MaxCompute) component supports only MaxCompute resources.

Algorithm

The LLM-Count Filter (MaxCompute) component filters text samples based on the following metrics:

  • The number of digits or the ratio of digits to total characters

    If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.

  • The number of letters or the ratio of letters to total characters

    If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.

  • The number of alphanumeric characters or the ratio of alphanumeric characters to total characters

    If you specify a delimiter, the component splits the text samples into word lists and calculates the metric values by words.

  • The ratio of letters to total tokens

    The component uses the pythia-6.9b-deduped model to split a text sample into tokens and divides the number of letters by the total number of tokens to calculate the ratio.

  • The number of delimiters

Configure the component

You can configure the parameters of the LLM-Count Filter (MaxCompute) component in the Machine Learning Designer module of the Platform for AI (PAI) console. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The column that you want to process. You can select multiple columns.

No default value

Text Separator

No

The delimiter that is used to split a text sample into word lists. After the split, the metric value is calculated by words.

If you leave this parameter empty, the component calculates the metric value by characters. Enclose the delimiter in double quotation marks ("").

" "

Whether to Filter with Numeric Count or Ratio

No

  • Minimum Counts or Ratio of Numeric Chars: If the number of digits or the ratio of digits to total characters is less than this value, the text sample is filtered out.

    To filter text samples based on the number of digits, specify a value greater than 1. To filter text samples based on the ratio of digits to total characters, specify a value between 0.0 and 1.0.

  • Maximum Counts or Ratio of Numeric Chars: If the number of digits or the ratio of digits to total characters is greater than this value, the text sample is filtered out.

    To filter text samples based on the number of digits, specify a value greater than 1. To filter text samples based on the ratio of digits to total characters, specify a value between 0.0 and 1.0.

No default value

Whether to Filter with Alpha Count or Ratio

No

  • Minimum Counts or Ratio of Alpha chars: If the number of letters or the ratio of letters to total characters is less than this value, the text sample is filtered out.

    To filter text samples based on the number of letters, specify a value greater than 1. To filter text samples based on the ratio of letters to total characters, specify a value between 0.0 and 1.0.

  • Maximum Counts or Ratio of Alpha Chars: If the number of letters or the ratio of letters to total characters is greater than this value, the text sample is filtered out.

    To filter text samples based on the number of letters, specify a value greater than 1. To filter text samples based on the ratio of letters to total characters, specify a value between 0.0 and 1.0.

No default value

Whether to Filter with AlphaNumeric Count or Ratio

No

  • Minimum Counts or Ratio of AlphaNumeric Chars: If the number of alphanumeric characters or the ratio of alphanumeric characters to total characters is less than this value, the text sample is filtered out.

    To filter text samples based on the number of alphanumeric characters, specify a value greater than 1. To filter text samples based on the ratio of alphanumeric characters to total characters, specify a value between 0.0 and 1.0.

  • Maximum Counts or Ratio of AlphaNumeric Chars: If the number of alphanumeric characters or the ratio of alphanumeric characters to total characters is greater than this value, the text sample is filtered out.

    To filter text samples based on the number of alphanumeric characters, specify a value greater than 1. To filter text samples based on the ratio of alphanumeric characters to total characters, specify a value between 0.0 and 1.0.

No default value

Whether to Filter with the Ratio of the Number of alpha chars to the Number of Text Tokens

No

  • Minimum Ratio of Alpha Chars to Text Tokens: If the ratio of letters to total tokens is less than this value, the text sample is filtered out. The component uses the pythia-6.9b-deduped model to split a text sample into tokens and divides the number of letters by the total number of tokens to calculate the ratio.

  • Maximum Ratio of Alpha Chars to Text Tokens: If the ratio of letters to total tokens is greater than this value, the text sample is filtered out. The component uses the pythia-6.9b-deduped model to split a text sample into tokens and divides the number of letters by the total number of tokens to calculate the ratio.

No default value

Whether to Filter with Separator Count

No

  • Minimum Counts of Separators: If the number of delimiters in a text sample is less than this value, the text sample is filtered out.

  • Maximum Counts of Separators: If the number of delimiters in a text sample is greater than this value, the text sample is filtered out.

No default value

Output table lifecycle

No

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: [50,800].

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: [256,12288].

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: [1,Integer.MAX_VALUE].

256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.