This topic provides an overview of the dynamic data masking feature provided by the PolarDB proxy.

Prerequisites

The version of the PolarDB proxy must be V2.4.12 or later. For more information about how to view and upgrade the version of the PolarDB proxy, see Upgrade the cluster version.

Data masking solutions

If you want to use third parties to generate reports, analyze data, and perform development and test activities, you may need to obtain the latest customer data from databases in the production environment in real time. To avoid disclosing personal information, data must be masked before it is provided to third parties. Alibaba Cloud provides the following data masking solutions: dynamic data masking and static data masking. The PolarDB proxy uses dynamic data masking.
Table 1. Comparison of data masking solutions
Data masking solutionDescriptionAdvantageLimits
Dynamic data maskingWhen your application initiates a data query request, the PolarDB proxy masks the sensitive data that is queried before the PolarDB proxy returns the data to the application.

Before your application queries data, you need only to specify the database account and the name of the database, table, or column that requires data masking.

  • You do not need to change code in your business system. This reduces costs.
  • Your application can query the real-time data from production databases.
Compared with mirror databases, production databases have lower query performance because the PolarDB proxy masks the sensitive real-time data in the production databases.
Static data maskingThe PolarDB proxy exports all data in a production database to a mirror database, and encrypts or masks the sensitive data during the export. Your application queries data from mirror databases instead of production databases. In this case, data masking does not affect the services that require access to production databases.
  • You must develop a set of components used for masking the sensitive data in the data import toolkit. This incurs high development costs.
  • Data in mirror databases is not as up-to-date as data in production databases.

How it works

After you configure data masking rules in the PolarDB console, the console writes these rules to the PolarDB proxy. When your application connects to a database by using the account specified in the data masking rules and queries the specified columns, the PolarDB proxy masks the data that is queried from the database and returns the masked data to the client. 1
The preceding figure shows the following data masking rules:
  • The data masking rules take effect only when you use the testAcc account to query data from a database.
  • The PolarDB proxy masks only the data that is queried in the name and age columns.

If your application uses the testAcc account to connect to a database and queries data in the name, age, and hobby columns of a table, the PolarDB proxy masks data in the name and age columns and returns the masked data together with the unmasked data in the hobby column.

The PolarDB proxy uses different methods to mask different types of data. The following table describes data masking methods.

Data typeData masking methodExample
Integer data types: TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINTThe PolarDB proxy returns a random value in the format defined in the data type of the raw data.
  • Raw value: 12345
  • Masked value that is randomly selected: 28175
Decimal data types: DECIMAL, FLOAT, and DOUBLE
  • Raw value: 1.2345
  • Masked value that is randomly selected: 8.2547
Date and time data types: DATE, TIME, DATETIME, TIMESTAMP, and YEAR
  • Raw value: 2021-01-01 00:00:00
  • Masked value that is randomly selected: 4926-12-13 17:23:07
Other data typesThe PolarDB proxy replaces the data with asterisks (*).
  • Raw value: John Smith
  • Masked value: *********

Additional considerations

  • The dynamic data masking feature applies only to cluster endpoints. Cluster endpoints consist of the default cluster endpoint and custom cluster endpoints. If you use the primary endpoint to connect to a database and query data from the database, the dynamic data masking feature does not take effect. For more information about how to view a cluster endpoint, see View an endpoint.
  • If query results contain data that must be masked and the size of a single row exceeds 16 MB, the query session is closed.

    For example, you want to query data in the name and description columns of the person table. In this table, the sensitive data in the name column must be masked. The size of the data in a row of the description column exceeds 16 MB. In this case, when you execute the SELECT name, description FROM person statement, the query session is closed.

  • If a column in which you want to mask the sensitive data is used as the value of an input parameter in a function, data masking does not take effect.

    For example, a data masking rule is created to mask the sensitive data in the name column. When you execute the SELECT CONCAT(name, '') FROM person statement, your application can still read the raw values of the name column.

  • If a column in which you want to mask the sensitive data is used together with the UNION operator, data masking may not take effect.

    For example, a data masking rule is created to mask the sensitive data in the name column. When you execute the SELECT hobby FROM person UNION SELECT name FROM person statement, your application can still read the raw values of the name column.

Enable the dynamic data masking feature

For more information, see Manage data masking rules.

Appendix: Impacts on cluster performance

The dynamic data masking feature affects the performance of clusters in the following scenarios.

Note In this example, the read-only queries per second (QPS) of clusters are used to show the difference in performance.
ScenarioImpact on performance
Whether your account is included in the data masking ruleWhether your query hits the data masking rule
NoNoData masking does not take effect on queries made by your account. This way, the performance of your cluster is not affected.
YesNoThe PolarDB proxy analyzes only the column definition data in the result set and does not mask the raw data in the query results.

This results in performance overhead of approximately 6%. After the dynamic data masking feature is enabled, the read-only QPS decreases by approximately 6%.

YesThe PolarDB proxy analyzes the column definition data in the result set and masks the raw data in the query results.

In this case, performance overhead is based on the size of the result set. A larger number of rows in the query results cause greater performance overhead.

If the query result of a single row is returned, the performance overhead of approximately 6% occurs.