All Products
Search
Document Center

Platform For AI:LLM-Sensitive Content Mask (MaxCompute)

Last Updated:May 31, 2024

The LLM-Sensitive Content Mask (MaxCompute) component masks sensitive information in the text data that is used to train large language models (LLMs).

Limits

The LLM-Sensitive Content Mask (MaxCompute) component supports only MaxCompute resources.

Algorithm

The LLM-Sensitive Content Mask (MaxCompute) component masks the following sensitive information:

  • Mobile phone numbers: Strings that match the following regular expressions are replaced with [MOBILEPHONE].

    • r'(?<!\d)(1(3[0-9]|4[579]|5[0-3,5-9]|6[6]|7[0135678]|8[0-9]|9[89])\d{8})(?!\d)'

    • r'(?<!\d)(1[\d]{2}-\d{4}-\d{4}\D|\D1\d{10}\D|\D1[\d]{2} \d{4} \d{4})(?!\d)'

    • r'(?<!\d)(1[3-9]\d{9})(?!\d)'

  • Landline phone numbers: Strings that match the following regular expression are replaced with [TELEPHONE].

    • r'(?<!\d)(\(?0\d{2,3}[-\s)]?\d{7,8})(?!\d)'

  • Email addresses: Strings that match the following regular expression are replaced with [EMAIL].

    • r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+'

  • China resident identity card (PRC) numbers: Strings that match the following regular expressions are replaced with [IDNUM].

    • r'(?<!\d)([1-6]\d{5}[12]\d{3}(0[1-9]|1[12])(0[1-9]|1[0-9]|2[0-9]|3[01])\d{3}(\d|X|x))(?!\d)'

    • r'(?<!\d)([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])(?!\d)'

The following example shows how an email address is masked.

  • Before processing

    image

  • After processing

    image

Configure the component

You can configure the parameters of the LLM-Sensitive Content Mask (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The columns that you want to process. You can select multiple columns.

No default value

Output table lifecycle

No

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: 50 to 800.

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.

256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.