This topic describes how to use the dynamic data masking feature provided by the PolarDB proxy.

Prerequisites

The version of the PolarDB proxy must be 2.4.12 or later. For more information about how to view and upgrade the version of PolarDB proxy, see Upgrade versions.

Data masking solutions

You may need to obtain the latest customer data from databases in the production environment in real time to generate reports, analyze data, and perform development tests. However, to avoid disclosing personal identifiable information, you must mask the data before providing it to third parties. Alibaba Cloud provides two data masking solutions: dynamic data masking and static data masking. PolarDB proxy adopts dynamic data masking.
Table 1. Comparison of data masking solutions
Data masking solution Description Advantage Disadvantage
Dynamic data masking When your application initiates a data query request, PolarDB masks the sensitive data that is queried before returning the data to the application.

To achieve this, you need to specify the database account and name and the table or column that requires data masking before the data is queried.

  • You do not need to make code changes in your business system, saving your costs.
  • Your application can query the real-time data from production databases.
Production databases have lower query performance compared with mirror databases as production databases need to mask sensitive data in real time.
Static data masking The sensitive data of a production database is encrypted or masked when all the data in the database is exported to a mirror database. Your application queries mirror databases rather than production databases. In this case, data masking is not performed on the production databases.
  • You need to develop a set of components used for masking sensitive data in the data import toolkit, which incurs high development costs.
  • Unlike data in production databases, data in mirror databases may not be up-to-date.

How it works

After you set data masking rules on the PolarDB console, the console will write the rules to the proxy. When your application connects to the database by using the account specified in the data masking rules and queries the specified columns, the proxy masks the data that is returned from the database before returning the data to the client. 1
The data masking rules shown in the preceding figure:
  • take effect only when you use the testAcc account to query a database.
  • require that only data in the name and age columns be masked.

When your application uses the testAcc account to connect to a database and queries data in the name, age, and hobby columns in a table, the proxy masks data in the name and age columns before returning the data together with the unmasked data in the hobby column.

The proxy uses different methods to mask different types of data. The following table describes specific masking methods.

Data type Masking rule Example
Integer data, such as TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT The proxy selects a random value that meets the established rules based on the type of data.
  • Actual value: 12345
  • Masked value (random): 28175
Decimal data, such as DECIMAL, FLOAT, and DOUBLE
  • Actual value: 1.2345
  • Masked value (random): 8.2547
Data of the time and date type, such as DATE, DATETIME, TIMESTAMP, TIME, and YEAR
  • Actual value: 2021-01-01 00:00:00
  • Masked value (random): 4926-12-13 17:23:07
Data other than the preceding three types The proxy replaces the data with asterisks (*).
  • Actual value: Elon Musk
  • Masked value: *********

Considerations

  • The dynamic data masking feature applies only to cluster endpoints, including default cluster endpoints and customized cluster endpoints. When you query data from a primary endpoint, the dynamic data masking feature is not applied. For more information about how to view and apply for a cluster endpoint, see View an endpoint and port number.
  • If the query results contain data that needs to be masked and the size of a single row exceeds 16 MB, the query session is closed.

    For example, you want to query the name and description columns of the Person table in which the name column needs to be masked. However, the size of the data in a row of the description column exceeds 16 MB. In this case, the query session is closed when you execute the SELECT name, description FROM person statement.

  • If the data column you want to mask is used as a function parameter, data masking is not applied.

    For example, if a rule has been created to mask data in the name column, your application can still read the actual value of the name column when you execute the SELECT CONCAT(name, '') FROM person statement.

  • If the data column you want to mask is used in the UNION operator, data masking is not applied.

    For example, if a rule has been created to mask data in the name column, your application can still read the actual value of the name column when you execute the SELECT hobby FROM person UNION SELECT name FROM person statement.

How to enable the data masking feature

For more information, see Manage data masking rules.

Appendix: Impact on cluster performance

The data masking feature affects the performance of clusters in the following three scenarios:

Note This topic uses the read-only queries per second (QPS) of clusters for comparison.
Scenario Impact on performance
Account included in the data masking rule? Query hits the data masking rule?
No No Data masking rules are not applied to queries made by the account. The performance of the clusters is not affected.
Yes No The proxy only analyzes the column definition data in the result set and does not mask the actual data in the query results.

This results in a 6% performance loss, which means that the read-only QPS when the data masking feature is enabled is about 6% lower than that when the feature is disabled.

Yes The proxy analyzes the column definition data in the result set and masks the actual data in the query results.

In this case, the performance loss depends on the size of the result set. The more query results, the greater the loss.

In the event that a single row query result is returned, the performance loss is about 6%.