All Products
Search
Document Center

MaxCompute:Dynamic data masking

Last Updated:Feb 05, 2024

For sensitive data that contains key information in a MaxCompute project, if you expect such sensitive data to be visible only to specific users, you can enable the dynamic data masking feature of MaxCompute. This way, sensitive data can be hidden or replaced in real time when unauthorized users access or view the data. This prevents leaks of sensitive data. This topic describes how to enable the dynamic data masking feature of MaxCompute and provides examples of using the feature for reference.

Background information

The dynamic data masking feature of MaxCompute depends on Data Security Guard of DataWorks. You must activate Data Security Guard of DataWorks before you can enable the dynamic data masking feature for a MaxCompute project.

After you enable the dynamic data masking feature for a MaxCompute project, you can configure data masking rules for the project based on the data identification rules that are configured in DataWorks. The masking rules define the types of data that you want to mask. When you query sensitive data from the MaxCompute client or Logview, the returned data is masked based on the configured masking rules. The dynamic data masking feature can effectively protect sensitive data, such as mobile phone numbers, ID card numbers, bank card numbers, license plate numbers, and IP addresses. After the dynamic data masking feature is enabled, only sensitive data in the query results is masked and the data that is stored at the underlying layer is not affected.

We recommend that you use the data identification rules that are preset in DataWorks. For more information about how to configure custom data identification rules, see Configure sensitive data identification rules.

Limits

  • You can use the dynamic data masking feature only if DataWorks Professional Edition or a more advanced edition is used. If you use DataWorks Basic Edition, upgrade DataWorks to an appropriate edition based on your business requirements. For more information about differences among DataWorks editions, see Differences among DataWorks editions.

  • The underlying data masking service of MaxCompute can be used for MaxCompute projects that reside in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Chengdu), China (Shenzhen), China North 2 Ali Gov 1, China East 2 Finance, China (Hong Kong), Singapore, Germany (Frankfurt), Malaysia (Kuala Lumpur), and US (Silicon Valley).

  • The configuration of the underlying data masking service of MaxCompute takes effect for sessions. When you perform data queries for a session, you must add the SET commands related to masking service calls to make the masking configuration take effect.

  • The underlying data masking service of MaxCompute cannot be used to mask primary keys in MaxCompute tables.

  • The underlying data masking service of MaxCompute can be used only to mask fields of the STRING type.

  • The data masking feature can be used only if data already exists in the MaxCompute project and the data is created for 24 hours.

Preparations

  1. Prepare a MaxCompute project and data for masking. For more information, see Create a MaxCompute project and Import data to tables.

  2. Go to the Data Security Guard page and activate Data Security Guard. For more information, see the "Go to the Data Security Guard page" section in Overview.

    On the Terms of Service page, read the terms, select I have read and agree to all the preceding terms, and then click Activate.

  3. Apply for a whitelist.

    Use your Alibaba Cloud account to submit a ticket to apply for access to external networks from your MaxCompute project. You can call the data masking service only after the application is approved.

    If no access control is imposed on the destination IP address or endpoint, you can use your MaxCompute project to access the destination IP address or endpoint after the application is approved. The application processing period does not exceed three business days.

    The following sample code shows the format of the application content.

    Project name (name of the project for which you want to enable data masking): data_shield_hz
    Log address:
    Description: Enable an endpoint whitelist for the project to ensure that a specific user-defined function (UDF) can access the endpoints in the whitelist. 
    Region: China (Hangzhou)
    Destination endpoints: dsg-cn-hangzhou.data.aliyun.com and dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com.
    Port numbers: 80 and 443

    The endpoints vary with regions. The following content lists the endpoints that correspond to different regions.

    China (Shanghai): dsg-cn-shanghai.data.aliyun.com, dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com
    China (Hangzhou): dsg-cn-hangzhou.data.aliyun.com, dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com
    China (Beijing): dsg-cn-beijing.data.aliyun.com, dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com   
    China (Chengdu): dsg-cn-chengdu.data.aliyun.com, dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com
    China (Shenzhen): dsg-cn-shenzhen.data.aliyun.com, dsg-oss-dic-ori-sz.oss-cn-shenzhen.aliyuncs.com
    China North 2 Ali Gov 1: dsg-cn-north-2-gov-1.data.aliyun.com, dsg-oss-dic-ori-north-2-gov-1.oss-cn-north-2-gov-1-internal.aliyuncs.com
    China East 2 Finance: dsg-cn-shanghai-finance-1.data.aliyun.com, dsg-oss-dic-ori-sh-fin-1.oss-cn-shanghai.aliyuncs.com
    China (Hong Kong): dsg-cn-hongkong.data.aliyun.com, dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com
    Singapore: dsg-ap-southeast-1.data.aliyun.com, dsg-oss-ap-southeast-1.oss-ap-southeast-1.aliyuncs.com
    US (Silicon Valley): dsg-us-west-1.data.aliyun.com, dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com
    Malaysia (Kuala Lumpur): dsg-ap-southeast-3.data.aliyun.com, dsg-oss-ap-malaysia.oss-ap-southeast-3.aliyuncs.com
    Germany (Frankfurt): dsg-eu-central-1.data.aliyun.com, dsg-oss-eu-central-1.oss-eu-central-1-internal.aliyuncs.com

Enable the data masking feature

  1. Select a data masking scenario.

    1. Log on to the DataWorks console and go to the Data Security Guard page. For more information, see the "Go to the Data Security Guard page" section in Overview.

    2. In the left-side navigation pane, choose Rule Change > Data Masking.

    3. On the Data Masking page, click Layer masking of the MaxCompute engine in the Masking Scene section.

      Note

      To show the data masking effect in the DataWorks console, you must enable masking of displayed data in DataStudio and Data Map in the DataWorks console.

      For more information about how to create a data masking scenario, see Create a data masking scenario.

  2. Create a data masking rule.

  3. Optional: If the data that is specified by the masking rule does not need to be masked for specific users, configure a masking rule whitelist.

    1. On the Data Masking page, click the Whitelist tab.

    2. In the upper-right corner of the Whitelist tab, click Add Account.

    3. In the Add Account dialog box, configure the Rule, Account, and Effective From parameters.

      Note

      If a user account in the whitelist queries data out of the time range that is specified in the whitelist, sensitive data in the query results is still masked.

View the execution results of SQL statements

Use the DataStudio page in the DataWorks console

  1. Turn off the data masking switch. For more information, see the "Go to the Security Settings and Others tab" section in Configure settings on the Security Settings and Others tab.

    关闭项目

  2. Execute an SQL statement for data queries.

    Before you execute an SQL statement, run SET commands to call the underlying masking service in the current session. The following code shows the SET commands that are used to call the underlying masking service in different regions.

    Note

    The underlying data masking service of MaxCompute can be used only at the session level.

    China (Shanghai)
    set odps.output.field.formatter={"name":"aegis:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-shanghai.data.aliyun.com:80,dsg-cn-shanghai.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com:80,dsg-cn-shanghai.data.aliyun.com:443;
    China (Hangzhou)
    set odps.output.field.formatter={"name":"aegis_hz:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    China (Beijing)
    set odps.output.field.formatter={"name":"aegis_bj:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-beijing.data.aliyun.com:80,dsg-cn-beijing.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:80,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:443;
    China (Chengdu)
    set odps.output.field.formatter={"name":"aegis_cd:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-chengdu.data.aliyun.com:80,dsg-cn-chengdu.data.aliyun.com:443,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:80,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:443;
    China (Hong Kong)
    set odps.output.field.formatter={"name":"aegis_hk:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hongkong.data.aliyun.com:80,dsg-cn-hongkong.data.aliyun.com:443,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:80,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:443;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.output.field.formatter

    The MaxCompute masking function that you want to call. To use this function, you must make sure that the field that you want to mask is of the STRING type.

    • aegis_hz:<SchemaName>:masking_v2: the function name.

      The SchemaName parameter specifies whether to configure a three-layer schema model for the MaxCompute project. If the three-layer schema model is configured, you must specify the SchemaName parameter. For more information about schemas, see Schema-related operations.

    • ["alias","index"]: the parameters. These are default parameters.

    odps.isolation.session.enable

    Specifies whether to enable calls at the session level. After the session ends, the data masking feature becomes ineffective.

    odps.internet.access.list

    The list of endpoints that are accessed when you execute the specified function. The endpoints are used to query the masking information preconfigured in Data Security Guard.

    The following code shows a sample script for querying data from a MaxCompute project whose SchemaName is default in the China (Hangzhou) region after the underlying data masking service is enabled for the project.

    set odps.output.field.formatter={"name":"aegis_hz:default:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    select * from table;
  3. View the masking result on the DataStudio page.

Use the MaxCompute client (odpscmd)

  1. Configure the endpoints.

    Before you execute an SQL statement, configure the endpoints that you want to access in the Config file of the MaxCompute client.

    The following code shows the endpoints that correspond to different regions.

    China (Shanghai)
    set odps.internet.access.list=dsg-cn-shanghai.data.aliyun.com:80,dsg-cn-shanghai.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com:80,dsg-cn-shanghai.data.aliyun.com:443;
    China (Hangzhou)
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    China (Beijing)
    set odps.internet.access.list=dsg-cn-beijing.data.aliyun.com:80,dsg-cn-beijing.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:80,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:443;
    China (Chengdu)
    set odps.internet.access.list=dsg-cn-chengdu.data.aliyun.com:80,dsg-cn-chengdu.data.aliyun.com:443,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:80,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:443;
    China (Hong Kong)
    set odps.internet.access.list=dsg-cn-hongkong.data.aliyun.com:80,dsg-cn-hongkong.data.aliyun.com:443,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:80,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:443;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.internet.access.list

    The list of endpoints that are accessed when you execute the specified function. The endpoints are used to query the masking information preconfigured in Data Security Guard.

    The following code is the sample code in the Config file for a MaxCompute project whose SchemaName is default in the China (Hangzhou) region.

    project_name=data_shield_hz
    # app access id and key are optional for individual users
    # app_access_id=<app_accessid>
    # app_access_key=<app_accesskey>
    access_id=AccessKey ID
    access_key=AccessKey secret
    # this endpoint is for office environment
    end_point=http://service.odps.aliyun.com/api
    # this url is for odpscmd update
    update_url=http://odps.alibaba-inc.com/official_downloads
    # download sql results by instance tunnel
    use_instance_tunnel=true
    # the max records when download sql results by instance tunnel
    instance_tunnel_max_record=10000
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
  2. Execute an SQL statement for data queries.

    Before you execute an SQL statement, run SET commands to call the underlying masking service in the current session. The following code shows the SET commands that are used to call the underlying masking service in different regions.

    Note

    The underlying data masking service of MaxCompute can be used only at the session level.

    China (Shanghai)
    set odps.output.field.formatter={"name":"aegis:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Hangzhou)
    set odps.output.field.formatter={"name":"aegis_hz:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Beijing)
    set odps.output.field.formatter={"name":"aegis_bj:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Chengdu)
    set odps.output.field.formatter={"name":"aegis_cd:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Hong Kong)
    set odps.output.field.formatter={"name":"aegis_hk:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.output.field.formatter

    The MaxCompute masking function that you want to call. To use this function, you must make sure that the field that you want to mask is of the STRING type.

    • aegis_hz:<SchemaName>masking_v2: the function name.

      The SchemaName parameter specifies whether to configure a three-layer schema model for the MaxCompute project. If the three-layer schema model is configured, you must specify the SchemaName parameter. For more information about schemas, see Schema-related operations.

    • ["alias","index"]: the parameters. These are default parameters.

    odps.isolation.session.enable

    Specifies whether to enable calls at the session level. After the session ends, the data masking feature becomes ineffective.

    The following code shows a sample script for querying data from a MaxCompute project in the China (Hangzhou) region after the underlying data masking service is enabled for the project.

    set odps.output.field.formatter={"name":"aegis_hz:default:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    select * from table;
  3. View the masking result.

    image.png

Disable the underlying data masking service

Execute the following SQL statements to disable the underlying data masking service:

set odps.output.field.formatter=;
select * from table;

If you configure a data masking scenario in DataWorks, do not select the destination MaxCompute project. For more information, see "Configure a data masking scenario" in Create a data masking scenario.