Supported data de-identification algorithms - - Alibaba Cloud Documentation Center

This topic describes the data de-identification algorithms that are supported by Data Security Center (DSC).


Category	Description	Algorithm	Input	Applicable sensitive data and scenario
Hashing	Raw data cannot be retrieved after it is de-identified by using this type of algorithm. This type of algorithm is applicable to password protection or scenarios in which you must check whether data is sensitive by comparison. You can use common hash algorithms and specify a salt value.	MD5	Salt value	Sensitive data: keys Scenario: data storage
		Secure Hash Algorithm 1 (SHA-1)	Salt value
		SHA-256	Salt value
		Hash-based Message Authentication Code (HMAC)	Salt value
Redaction by using asterisks (*) or number signs (#)	Raw data cannot be retrieved after it is de-identified by using this type of algorithm. This type of algorithm is applicable to scenarios in which sensitive data is to be shown on a user interface or shared with others. This type of algorithm redacts specified text in sensitive data with asterisks (*) or number signs (#).	Keeps the first N characters and the last M characters	Values of N and M	Sensitive data: sensitive personal information Scenarios: Data usage Data sharing
		Keeps characters from the Xth position to the Yth position	Values of X and Y
		Redacts the first N characters and the last M characters	Values of N and M
		Redacts characters from the Xth position to the Yth position	Values of X and Y
		Redacts characters that precede a special character when the special character appears for the first time	At sign (@), ampersand (&), or period (.)
		Redacts characters that follow a special character when the special character appears for the first time	At sign (@), ampersand (&), or period (.)
Substitution (customization supported)	Raw data can be retrieved after it is de-identified by using some of the algorithms. This type of algorithm can be used to de-identify fields in fixed formats, such as ID card numbers. This type of algorithm substitutes the entire value or part of the value of a field with a mapped value by using a mapping table. In this case, raw data can be retrieved after it is de-identified. This type of algorithm also substitutes the entire value or part of the value of a field randomly based on a random interval. In this case, raw data cannot be retrieved after it is de-identified. DSC provides multiple built-in mapping tables and allows you to customize substitution algorithms.	Substitutes specific content in ID card numbers with mapped values	Mapping table for substituting the IDs of administrative regions	Sensitive data: Sensitive personal information Sensitive information of enterprises Sensitive information of devices Scenarios: Data storage Data sharing
		Randomly substitutes specific content in ID card numbers	Code table for randomly substituting the IDs of administrative regions
		Randomly substitutes specific content in the IDs of military officer cards	Code table for randomly substituting type codes
		Randomly substitutes specific content in passport numbers	Code table for randomly substituting purpose fields
		Randomly substitutes specific content in permit numbers of Exit-Entry Permits for Travelling to and from Hong Kong and Macao	Code table for randomly substituting purpose fields
		Randomly substitutes specific content in bank card numbers	Code table for randomly substituting Bank Identification Numbers (BINs)
		Randomly substitutes specific content in landline telephone numbers	Code table for randomly substituting the IDs of administrative regions
		Randomly substitutes specific content in mobile numbers	Code table for randomly substituting mobile network codes
		Randomly substitutes specific content in unified social credit codes	Code table for randomly substituting the IDs of registration authorities, code table for randomly substituting type codes, and code table for randomly substituting the IDs of administrative regions
		Substitutes specific content in general tables with mapped values	Mapping table for substituting uppercase letters, mapping table for substituting lowercase letters, mapping table for substituting digits, and mapping table for substituting special characters
		Randomly substitutes specific content in general tables	Code table for randomly substituting uppercase letters, code table for randomly substituting lowercase letters, code table for randomly substituting digits, and code table for randomly substituting special characters
Rounding	Raw data can be retrieved after it is de-identified by using some of the algorithms. This type of algorithm can be used to analyze and collect statistics on sensitive datasets. DSC provides two types of rounding algorithms. One algorithm rounds numbers and dates, and raw data cannot be retrieved after it is de-identified. The other algorithm bit-shifts text, and raw data can be retrieved after it is de-identified.	Rounds numbers	Numbers are rounded to the Nth digit before the decimal point. Valid values of N: 1 to 19.	Sensitive data: general sensitive information Scenarios: Data storage Data usage
		Rounds dates	Dates are rounded to the year, month, day, hour, or minute level.
		Shifts characters	Number of places by which specific bits are moved and shift direction (left or right)
Encryption	Raw data can be retrieved after it is de-identified by using this type of algorithm. This type of algorithm can be used to encrypt sensitive fields that need to be retrieved after encryption. Common symmetrical encryption algorithms are supported.	Data Encryption Standard (DES) algorithm	Encryption key	Sensitive data: Sensitive personal information Sensitive information of enterprises Scenario: data storage
		Triple Data Encryption Standard (3DES) algorithm	Encryption key
		Advanced Encryption Standard (AES) algorithm	Encryption key
Shuffling	Raw data cannot be retrieved after it is de-identified by using this type of algorithm. This type of algorithm can be used to de-identify structured data columns. This type of algorithm extracts values of a field in a specified range from the source table and rearranges the values in a specific column. Alternatively, this type of algorithm randomly selects values from a specific column within the value range and rearranges the selected values. This way, the values are mixed up and de-identified.	Randomly shuffles data	Shuffle method: rearrangement or random selection	Sensitive data: Sensitive information of devices Sensitive location information Scenario: data storage