LLM-Sensitive Content Mask (MaxCompute) - Platform For AI

The LLM-Sensitive Content Mask (MaxCompute) component masks sensitive information in the text data that is used to train large language models (LLMs).

Limits

The LLM-Sensitive Content Mask (MaxCompute) component supports only MaxCompute resources.

Algorithm

The LLM-Sensitive Content Mask (MaxCompute) component masks the following sensitive information:

Mobile phone numbers: Strings that match the following regular expressions are replaced with [MOBILEPHONE].
- r'(?<!\d)(1(3[0-9]|4[579]|5[0-3,5-9]|6[6]|7[0135678]|8[0-9]|9[89])\d{8})(?!\d)'
- r'(?<!\d)(1[\d]{2}-\d{4}-\d{4}\D|\D1\d{10}\D|\D1[\d]{2} \d{4} \d{4})(?!\d)'
- r'(?<!\d)(1[3-9]\d{9})(?!\d)'
Landline phone numbers: Strings that match the following regular expression are replaced with [TELEPHONE].
- r'(?<!\d)(\(?0\d{2,3}[-\s)]?\d{7,8})(?!\d)'
Email addresses: Strings that match the following regular expression are replaced with [EMAIL].
- r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+'
China resident identity card (PRC) numbers: Strings that match the following regular expressions are replaced with [IDNUM].
- r'(?<!\d)([1-6]\d{5}[12]\d{3}(0[1-9]|1[12])(0[1-9]|1[0-9]|2[0-9]|3[01])\d{3}(\d|X|x))(?!\d)'
- r'(?<!\d)([1-9]\d{5}[12]\d{3}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])\d{3}[0-9xX])(?!\d)'

The following example shows how an email address is masked.

Before processing
After processing

Configure the component

You can configure the parameters of the LLM-Sensitive Content Mask (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Required	Description	Default value
Fields Setting	Select Target Column	Yes	The columns that you want to process. You can select multiple columns.	No default value
Fields Setting	Output table lifecycle	No	The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.	28
Tuning	Number of CPUs per instance of map task	No	The number of CPUs for each instance of a map task. Valid values: 50 to 800.	100
	The memory size per instance of map task	No	The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.	1024
	The maximum size of input data for a map	No	The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.	256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.