All Products
Search
Document Center

DataWorks:Sample practice for performing underlying data masking on MaxCompute projects (old version)

Last Updated:Aug 13, 2024

If a user has the permissions to query specific sensitive data in a MaxCompute project but you do not want the user to view complete sensitive data, you can enable the dynamic data masking feature of MaxCompute. This way, MaxCompute can dynamically mask sensitive data in the query results. This topic describes how to enable the dynamic data masking feature of MaxCompute and provides an example.

Background information

MaxCompute cannot dynamically mask sensitive data and depends on the Data Security Guard service of DataWorks to mask sensitive data. Before you can enable the MaxCompute underlying data masking feature for a MaxCompute project, you must activate the Data Security Guard service of DataWorks.

After you enable the MaxCompute underlying data masking feature for a MaxCompute project, you can configure a data masking rule for the project based on the sensitive data identification rules that are configured in DataWorks. The data masking rule defines the data that you want to mask. When you query sensitive data from the MaxCompute client or MaxCompute LogView, the returned data is masked based on the configured masking rule. The MaxCompute underlying data masking feature can effectively protect sensitive data, such as mobile phone numbers, ID card numbers, bank card numbers, license plate numbers, and IP addresses. After you enable the MaxCompute underlying data masking feature, only sensitive data in the query results is masked and the data that is stored at the underlying layer is not affected.

We recommend that you use the sensitive data identification rules that are preset in DataWorks. For information about how to configure custom sensitive data identification rules, see Configure a sensitive data identification rule and run a sensitive data identification task.

Limits

  • You can use the MaxCompute underlying data masking feature only if DataWorks Professional Edition or a more advanced edition is used. If you use DataWorks Basic Edition, upgrade DataWorks to an appropriate edition based on your business requirements. For information about the differences among DataWorks editions, see Differences among DataWorks editions.

  • The MaxCompute underlying data masking feature is available for MaxCompute projects that reside in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Chengdu), China (Shenzhen), China North 2 Ali Gov 1, China East 2 Finance, China (Hong Kong), Singapore, Germany (Frankfurt), Malaysia (Kuala Lumpur), and US (Silicon Valley).

  • The MaxCompute underlying data masking feature takes effect for sessions. When you perform data queries for a session, you must add the SET commands related to masking service calls to make the masking configuration take effect.

  • The MaxCompute underlying data masking feature cannot be used to mask primary keys in MaxCompute tables.

  • The MaxCompute underlying data masking feature can be used only to mask fields of the STRING type.

  • The MaxCompute underlying data masking feature can be used only if data already exists in a MaxCompute project and the data is created for 24 hours.

Preparations

  1. Prepare a MaxCompute project and data for masking. For more information, see Create a MaxCompute project and Import data to tables.

  2. Go to the Data Security Guard page and activate Data Security Guard. For more information, see the "Go to the Data Security Guard page" section in Overview.

    On the Terms of Service page, read the terms, select I have read and agree to all the preceding terms, and then click Activate.

  3. Apply for a whitelist.

    Use your Alibaba Cloud account to submit a ticket to apply for access to external networks from your MaxCompute project. You can call the MaxCompute underlying data masking feature only after the application is approved.

    If no access control is imposed on the destination IP address or endpoint, you can use your MaxCompute project to access the destination IP address or endpoint after the application is approved. The application processing period does not exceed three business days.

    The following sample code shows the format of the application content.

    Project name (name of the project for which you want to enable data masking): data_shield_hz
    Log address:
    Description: Enable an endpoint whitelist for the project to ensure that a specific user-defined function (UDF) can access the endpoints in the whitelist. 
    Region: China (Hangzhou)
    Destination endpoints: dsg-cn-hangzhou.data.aliyun.com and dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com.
    Port numbers: 80 and 443

    The endpoints vary with regions. The following content lists the endpoints that correspond to different regions.

    China (Shanghai): dsg-cn-shanghai.data.aliyun.com, dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com
    China (Hangzhou): dsg-cn-hangzhou.data.aliyun.com, dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com
    China (Beijing): dsg-cn-beijing.data.aliyun.com, dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com   
    China (Chengdu): dsg-cn-chengdu.data.aliyun.com, dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com
    China (Shenzhen): dsg-cn-shenzhen.data.aliyun.com, dsg-oss-dic-ori-sz.oss-cn-shenzhen.aliyuncs.com
    China North 2 Ali Gov: dsg-cn-north-2-gov-1.data.aliyun.com, dsg-oss-dic-ori-north-2-gov-1.oss-cn-north-2-gov-1-internal.aliyuncs.com
    China East 2 Finance: dsg-cn-shanghai-finance-1.data.aliyun.com, dsg-oss-dic-ori-sh-fin-1.oss-cn-shanghai.aliyuncs.com
    China (Hong Kong): dsg-cn-hongkong.data.aliyun.com, dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com
    Singapore: dsg-ap-southeast-1.data.aliyun.com, dsg-oss-ap-southeast-1.oss-ap-southeast-1.aliyuncs.com
    Silicon Valley: dsg-us-west-1.data.aliyun.com, dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com
    Malaysia (Kuala Lumpur): dsg-ap-southeast-3.data.aliyun.com, dsg-oss-ap-malaysia.oss-ap-southeast-3.aliyuncs.com
    Germany (Frankfurt): dsg-eu-central-1.data.aliyun.com, dsg-oss-eu-central-1.oss-eu-central-1-internal.aliyuncs.com

Enable the MaxCompute underlying data masking feature

  1. Select a data masking scenario.

    1. Log on to the DataWorks console and go to the Data Security Guard page. For more information, see the "Go to the Data Security Guard page" section in Overview.

    2. In the left-side navigation pane, choose Rule Configuration > Data Masking Management.

    3. On the Data Masking Management page, click Layer masking of the MaxCompute engine in the Underlying desensitization scenario subsection.

      Note

      To show the data masking effect in the DataWorks console, you must enable masking of displayed data in DataStudio and Data Map in the DataWorks console.

      For more information about how to create a data masking scenario, see Create a data masking scenario.

  2. Create a data masking rule.

  3. Optional: If the data that is specified by the masking rule does not need to be masked for specific users, configure a masking rule whitelist.

    1. On the Data Masking Management page, click the Configure Whitelist tab.

    2. In the upper-right corner of the Configure Whitelist tab, click Whitelist.

    3. In the Create Whitelist panel, configure the Sensitive Field Type, User Group Range, and Effective Time parameters.

      Note

      If a user account in the whitelist queries data out of the time range that is specified in the whitelist, sensitive data in the query results is still masked.

View the execution results of SQL statements

Use the DataStudio page in the DataWorks console

  1. Turn off the data masking switch. For more information, see the "Go to the Security Settings and Others tab" section in Configure settings on the Security Settings and Others tab.

    关闭项目

  2. Execute an SQL statement for data queries.

    Before you execute an SQL statement, run SET commands to call the MaxCompute underlying data masking feature in the current session. The following code shows the SET commands that are used to call the MaxCompute underlying data masking feature in different regions.

    Note

    The MaxCompute underlying data masking feature can be used only at the session level.

    China (Shanghai)
    set odps.output.field.formatter={"name":"aegis:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-shanghai.data.aliyun.com:80,dsg-cn-shanghai.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com:80,dsg-cn-shanghai.data.aliyun.com:443;
    China (Hangzhou)
    set odps.output.field.formatter={"name":"aegis_hz:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    China (Beijing)
    set odps.output.field.formatter={"name":"aegis_bj:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-beijing.data.aliyun.com:80,dsg-cn-beijing.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:80,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:443;
    China (Chengdu)
    set odps.output.field.formatter={"name":"aegis_cd:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-chengdu.data.aliyun.com:80,dsg-cn-chengdu.data.aliyun.com:443,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:80,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:443;
    China (Hong Kong)
    set odps.output.field.formatter={"name":"aegis_hk:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hongkong.data.aliyun.com:80,dsg-cn-hongkong.data.aliyun.com:443,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:80,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:443;
    US (Silicon Valley)
    set odps.output.field.formatter={"name":"data_sheild_silicon_dev:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-us-west-1.data.aliyun.com:80,dsg-us-west-1.data.aliyun.com:443,dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com:80,dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com:443;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.output.field.formatter

    The MaxCompute masking function that you want to execute. To use this function, make sure that the field that you want to mask is of the STRING type.

    • aegis_hz:<SchemaName>:masking_v2: the function name.

      The SchemaName parameter specifies the schema name if the MaxCompute project uses a three-layer schema model. For more information about schemas, see Schema-related operations.

    • ["alias","index"]: the parameters. These are default parameters.

    odps.isolation.session.enable

    Specifies whether to enable calls at the session level. After the session ends, the data masking feature becomes ineffective.

    odps.internet.access.list

    The list of endpoints that are accessed when you execute the specified function. The endpoints are used to query the masking information preconfigured in Data Security Guard.

    The following code shows a sample script for querying data from a MaxCompute project whose SchemaName is default in the China (Hangzhou) region after the MaxCompute underlying data masking feature is enabled for the project.

    set odps.output.field.formatter={"name":"aegis_hz:default:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    select * from table;
  3. View the masking result on the DataStudio page.

Use the MaxCompute client (odpscmd)

  1. Configure the endpoints.

    Before you execute an SQL statement, configure the endpoints that you want to access in the Config file of the MaxCompute client.

    The following code shows the endpoints that correspond to different regions.

    China (Shanghai)
    set odps.internet.access.list=dsg-cn-shanghai.data.aliyun.com:80,dsg-cn-shanghai.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-shanghai.aliyuncs.com:80,dsg-cn-shanghai.data.aliyun.com:443;
    China (Hangzhou)
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
    China (Beijing)
    set odps.internet.access.list=dsg-cn-beijing.data.aliyun.com:80,dsg-cn-beijing.data.aliyun.com:443,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:80,dsg-oss-dic-ori.oss-cn-beijing.aliyuncs.com:443;
    China (Chengdu)
    set odps.internet.access.list=dsg-cn-chengdu.data.aliyun.com:80,dsg-cn-chengdu.data.aliyun.com:443,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:80,dsg-oss-dic-ori-cd.oss-cn-chengdu.aliyuncs.com:443;
    China (Hong Kong)
    set odps.internet.access.list=dsg-cn-hongkong.data.aliyun.com:80,dsg-cn-hongkong.data.aliyun.com:443,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:80,dsg-oss-hongkong.oss-cn-hongkong.aliyuncs.com:443;
    US (Silicon Valley)
    set odps.internet.access.list=dsg-us-west-1.data.aliyun.com:80,dsg-us-west-1.data.aliyun.com:443,dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com:80,dsg-oss-us-west-1.oss-us-west-1.aliyuncs.com:443;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.internet.access.list

    The list of endpoints that are accessed when you execute the specified function. The endpoints are used to query the masking information preconfigured in Data Security Guard.

    The following code is the sample code in the Config file for a MaxCompute project whose SchemaName is default in the China (Hangzhou) region.

    project_name=data_shield_hz
    # app access id and key are optional for individual users
    # app_access_id=<app_accessid>
    # app_access_key=<app_accesskey>
    access_id=AccessKey ID
    access_key=AccessKey secret
    # this endpoint is for office environment
    end_point=http://service.odps.aliyun.com/api
    # this url is for odpscmd update
    update_url=http://odps.alibaba-inc.com/official_downloads
    # download sql results by instance tunnel
    use_instance_tunnel=true
    # the max records when download sql results by instance tunnel
    instance_tunnel_max_record=10000
    set odps.internet.access.list=dsg-cn-hangzhou.data.aliyun.com:80,dsg-cn-hangzhou.data.aliyun.com:443,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:80,dsg-oss-dic-ori-hz.oss-cn-hangzhou.aliyuncs.com:443;
  2. Execute an SQL statement for data queries.

    Before you execute an SQL statement, run SET commands to call the MaxCompute underlying data masking feature in the current session. The following code shows the SET commands that are used to call the MaxCompute underlying data masking feature in different regions.

    Note

    The MaxCompute underlying data masking feature can be used only at the session level.

    China (Shanghai)
    set odps.output.field.formatter={"name":"aegis:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Hangzhou)
    set odps.output.field.formatter={"name":"aegis_hz:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Beijing)
    set odps.output.field.formatter={"name":"aegis_bj:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Chengdu)
    set odps.output.field.formatter={"name":"aegis_cd:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    China (Hong Kong)
    set odps.output.field.formatter={"name":"aegis_hk:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    US (Silicon Valley)
    set odps.output.field.formatter={"name":"data_sheild_silicon_dev:<SchemaName>:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;

    The following table describes the parameters in the preceding commands.

    Parameter

    Description

    odps.output.field.formatter

    The MaxCompute masking function that you want to execute. To use this function, make sure that the field that you want to mask is of the STRING type.

    • aegis_hz:<SchemaName>masking_v2: the function name.

      The SchemaName parameter specifies the schema name if the MaxCompute project uses a three-layer schema model. For more information about schemas, see Schema-related operations.

    • ["alias","index"]: the parameters. These are default parameters.

    odps.isolation.session.enable

    Specifies whether to enable calls at the session level. After the session ends, the data masking feature becomes ineffective.

    The following code shows a sample script for querying data from a MaxCompute project in the China (Hangzhou) region after the MaxCompute underlying data masking feature is enabled for the project:

    set odps.output.field.formatter={"name":"aegis_hz:default:masking_v2","param":["alias","index"]};
    set odps.isolation.session.enable=true;
    select * from table;
  3. View data masking results.

    image.png

Disable the MaxCompute underlying data masking feature

Execute the following SQL statements to disable the MaxCompute underlying data masking feature:

set odps.output.field.formatter=;
select * from table;

If you configure a data masking scenario in DataWorks, do not select the destination MaxCompute project. For more information, see "Configure a data masking scenario" in Create a data masking scenario.