Sample practice for masking underlying data in E-MapReduce clusters - DataWorks

If a user has permissions to query specific sensitive data in an E-MapReduce (EMR) cluster but you don't want them to view complete sensitive data, you can enable the dynamic data masking feature for EMR to dynamically mask sensitive data in query results. This topic describes how to enable the dynamic data masking feature for EMR and provides examples for reference.

Limits

EMR clusters support only the sensitive data identification and data masking features of Data Security Guard.
The sensitive data identification and data masking features are supported only by specific types of EMR clusters and tables. For more information, see Which types of Hive tables can be previewed in Data Map?.
The metadata at the Data Security Guard side is updated with a delay of one day. If you want to mask EMR data, the EMR data that you want to mask must be created one day earlier.
Only exclusive resource groups for scheduling are supported. For more information, see Exclusive resource groups for scheduling.

Preparations

Prerequisites

By default, Data Security Guard uses the EMR cluster account that maps to your Alibaba Cloud account to sample data. If Lightweight Directory Access Protocol (LDAP) or Kerberos authentication is enabled for your EMR cluster and Ranger or DLF-Auth is used to manage table permissions, you must configure a mapping between the Alibaba Cloud account and the EMR cluster account. You must ensure that the mapped EMR cluster account has the required permissions to access tables in the EMR cluster. For more information, see DataStudio (old version): Associate an EMR computing resource.

Prepare data

Create an EMR table

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
On the Data Development page, click Create and select Create Node > EMR Hive to create a Hive node.

Modify the node code and create the onefall_test_dsg table.

CREATE TABLE IF NOT EXISTS onefall_test_dsg
(
    username  STRING
    ,gender   STRING
    ,phone    STRING
    ,email    STRING
    ,card_no  STRING
    ,address  STRING
    ,zip_code STRING
)
ROW FORMAT DELIMITED 
FIELDS
TERMINATED
BY','
;

Import test data to the onefall_test_dsg table.
1. Download the test data file data.csv.
2. Import the test data.
  - Upload the data.csv file to a node in the EMR cluster and execute the following SQL statement to load the test data:
```
LOAD DATA LOCAL INPATH '/…/data.csv' OVERWRITE INTO TABLE onefall_test_dsg;
```
  - Upload the data.csv file to an Object Storage Service (OSS) bucket and execute the following SQL statement to load the test data:
```
LOAD DATA INPATH 'oss://bucket-name.Endpoint/…/data.csv' OVERWRITE INTO TABLE onefall_test_dsg
;
```

Update metadata at the Data Security Guard side

The metadata at the Data Security Guard side is updated with a delay of one day. After you create and publish the onefall_test_dsg table, you must wait until the next day before you perform the data masking operation.

Configure data masking

Step 1: Create a sensitive data identification rule

DataWorks uses sensitive data identification rules to identify sensitive fields in E-MapReduce tables. You must configure sensitive data identification rules before you configure data masking rules. For more information, see Configure a sensitive data identification rule and run a sensitive data identification task.

Go to the Data Identification Rules tab

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Click the icon in the upper-left corner. Then, choose All Products > Data Governance > Data Security Guard. On the page that appears, click Try Now to go to the Data Security Guard page.
Note
- If your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.
- If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.

In the navigation pane on the left, click Rule Configuration > Sensitive Data Detection. The Data Detection Rules page appears.

Configure sensitive data identification rules

This example shows how to create a sensitive data detection rule to identify and desensitize the gender, phone, and email fields in the onefall_test_dsg table created in the Data Preparation module.

Specify a data category for the sensitive field types that you want to create.
In the Built-in Classification Template area on the left, select the data category of the newly added sensitive field. For more information, see Configure a sensitive data identification rule and run a sensitive data identification task.
Create a sensitive field type and configure a sensitive data identification rule for this type.
In the upper-right corner, click Sensitive Field Type. The sensitive data identification rule configuration page is displayed. For more information, see Configure a sensitive data identification rule and run a sensitive data identification task.
Note
To help you understand sensitive field types, you can configure them as the onefall_test_dsg table's field names: gender, phone, and email.
After you configure the Data Identification Rules, click Batch Publish in the upper-right corner and select the created rules to publish them.

Step 2: Configure data masking management

DataWorks lets you configure data masking rules to mask sensitive fields in E-MapReduce tables. For more information, see Create a data masking rule.

Go to the Data Masking Management page

Log on to the DataWorks Console and go to the Data Security Guard page. For more information, see Data Security Guard.
Click Try Now. The Data Security Guard Homepage appears.
In the navigation pane on the left, click Rule Configuration > Data Masking Management. On the Data Masking Management page, you can create a new scenario type and configure data masking rules.

Create a data masking scenario

DataWorks provides several fixed, level-1 data masking scenarios. These include dynamic data masking scenarios, such as Data Development/Data Map Display Masking, Data Analysis Display Masking, MaxCompute Engine-layer Masking, and Hologres Engine-layer Masking, and the static data masking scenario of Data Integration Static Masking. You cannot add, edit, or delete these built-in scenarios. However, you can create custom level-2 scenarios based on the level-1 scenarios to meet your business requirements. For more information, see Create a data masking scenario.
This example focuses on the Data Development/Data Map Display Masking and Data Analysis Display Masking scenarios.
- Level-2 scenario name under Data Development / Data Map Display Desensitization: Development Display.
- The level-2 scenario name for Data Analysis And Display Desensitization is SQL analysis.

Create a data masking rule

After you create a data masking scenario, you can click Masking Rule in the upper-right corner to create a data masking rule. Repeat the steps to create data masking rules for the gender, phone, and email sensitive field types. For more information, see Create a data masking rule.

Select a data masking scenario.
On the Data Masking Management page, select Masking Scenario as Data Development/Data Map Display Masking > Default Scenario, and click + Masking Rule on the right.

Create a data masking rule.

On the Create Data Masking Rule page, you can configure items such as Sensitive Field Type, Data Masking Rule Name, Data Masking Scenario, and Data Masking Method. For more information, see Data Masking Rule Configuration.

The following table describes the configuration of the data masking rule for each created sensitive field type.

Parameter			Description
Parameter			gender	email	phone
Sensitive Field Type			gender	email	phone
Data Masking Rule Name			gender	email	phone
Data Masking Scenario			`development demonstration`, `SQL analysis`	`development demonstration` and `SQL analysis`	`development demonstration`, `SQL analysis`
Masking Mode	Characters to replace	Replacement Position	Replace All
	Characters to replace	Replacement Position	Replace with Random Value
	HASH encryption	Data watermark		Turned Off
		Encryption algorithm		MDS
		Salt value		5
	Redaction	Masking Method			Recommended Method > Show First Three And Last Four Characters

Note

Multiple data masking methods are available. This example uses Characters To Replace, HASH, and Masking Out. For more information, see Configure the data masking method.

Step 3: Enable sensitive data identification

After Data Security Guard in the production environment obtains the EMR metadata every day, Data Security Guard calls the DataWorks API operations to obtain the sample data of the table and identify sensitive fields based on the sensitive data identification rules. In this example, you can manually enable the sensitive data identification rules to identify sensitive fields.

In the navigation pane on the left, click Rule Configuration > Sensitive Data Identification. The Sensitive Data Identification page appears.
In the upper-left corner of the Sensitive Data Identification page, click Run Task. In the Enable Sensitive Data Identification Task panel, configure the parameters.
- Task Type: one-time task.
- Account Used For Identification: The current account is used to sample and scan data. The range of data that can be sampled varies based on the account permissions. In this example, Alibaba Cloud Account is selected.
- Content Identification: Set it to Content recognition or metadata recognition. In this example, Content recognition is selected.
- Sampling Quantity: You can specify a custom number of samples. We recommend that you use the default value of 100.
- Scan Scope: Set to Custom Scope to specify the projects or databases to be scanned.
- In this example, the table name is onefall_test_dsg.
After you select the scanning range, click Run in the lower-right corner of the panel to start the sensitive data identification task.
Note
On the Sensitive Data Identification page, you can click Task Execution Records to view the execution details of the sensitive data identification task.

View the execution results of SQL statements

Preview the data masking result of the EMR table

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.
Click the button on the left. On the search page that appears, click the drop-down list at the top of the page and select the E-MapReduce data source. Then, enter the table name onefall_test_dsg in the search box.
Click the name of the onefall_test_dsg table to go to the details page of the table. Then, click the Data Preview tab to preview the table data.

Note

On the Data Preview tab, the fields in the table are masked based on the configured sensitive data identification rules and data masking rules.

View the data masking result on the Data Studio page

Whether you can view the data masking result on the Data Studio page is controlled by the configuration of the Mask Data in Page Query Results parameter in the Data Security section on the Security Settings and Others tab in Data Studio. You can perform the following steps to configure the parameter:

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the navigation pane on the left of the Data Studio page, click the icon. The Settings page appears.
On the Settings page, click Security Settings And Others and turn on the Data Security > Mask Data In Page Query Results switch.

Test the masking effect of queried data

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the navigation pane on the left, click the icon. In the Ad Hoc Query pane, click the icon and select Create > EMR Hive to create an ad hoc query node.
Query the onefall_test_dsg table in the node and view the masking effect of the table on the Data Development page.
```
SELECT * FROM onefall_test_dsg;
```