All Products
Search
Document Center

DataWorks:Example: Masking EMR data

Last Updated:Mar 26, 2026

Dynamic data masking lets users with query access to an E-MapReduce (EMR) cluster see transformed values instead of raw sensitive data — without altering the underlying data. At query time, Data Security Guard intercepts the result and applies masking rules based on the user identity and the masking scenario. This topic walks through an end-to-end example: creating a test table, configuring sensitive data identification and masking rules, and verifying that masked values appear in query results.

How it works

  1. Data Security Guard syncs EMR metadata once per day and uses sensitive data identification rules to classify fields (for example, phone, email, gender) as sensitive.

  2. When a user queries the table in Data Studio or Data Map, Data Security Guard matches the query context to a masking scenario and applies the corresponding masking rule to each sensitive field.

  3. The user sees the masked value (for example, 138****8888) in the result set. The underlying data is unchanged.

Limitations

  • EMR clusters support only the sensitive data identification and dynamic data masking features of Data Security Guard. Other Data Security Guard features are not supported.

  • Sensitive data identification and data masking are supported only by specific types of EMR clusters and tables. For details, see the Which types of Hive tables support data preview in Data Map? section in the "Data governance" topic.

  • Metadata at the Data Security Guard side is updated with a one-day delay. The EMR table to be masked must be created at least one day before you configure masking.

  • Only exclusive resource groups for scheduling are supported. For details, see Exclusive resource groups for scheduling.

Prerequisites

Before you begin, make sure that you have:

  • An EMR cluster with the required table permissions

  • Access to DataWorks Data Security Guard (your Alibaba Cloud account must be granted the required permissions)

  • (Conditional) If Lightweight Directory Access Protocol (LDAP) or Kerberos authentication is enabled for your EMR cluster and Ranger or DLF-Auth manages table permissions, a mapping configured between your Alibaba Cloud account and the EMR cluster account. The mapped account must have access to the target tables. For details, see Data Studio (legacy version): Associate an EMR computing resource.

Note: By default, Data Security Guard uses the EMR cluster account that maps to your Alibaba Cloud account to sample data.

Masking scenarios in this example

Data Security Guard organizes masking rules into a two-level hierarchy. Level-1 scenarios are fixed and define the display context. Level-2 scenarios are custom scenarios you create under a level-1 scenario.

This example uses two level-2 scenarios to cover the most common query surfaces:

Level-2 scenario nameBased on level-1 scenarioBest used for
development demonstrationData development / Data map display desensitizationPreviewing data in Data Map and Data Studio
SQL analysisData analysis and display desensitizationRunning ad hoc queries and analytical workloads

Step 1: Create an EMR table

  1. Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the workspace and click Go to Data Development.

  2. In the DATASTUDIO pane, click the Create icon and choose Create Node > EMR > EMR Hive.

  3. In the node editor, run the following SQL statement to create the onefall_test_dsg table:

    CREATE TABLE IF NOT EXISTS onefall_test_dsg
    (
        username  STRING,
        gender    STRING,
        phone     STRING,
        email     STRING,
        card_no   STRING,
        address   STRING,
        zip_code  STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',';
  4. Import test data into the table.

    1. Download data.csv.

    2. Import the data using one of the following methods:

    • From an EMR cluster node: Upload data.csv to a node in the EMR cluster and run:

      LOAD DATA LOCAL INPATH '/…/data.csv' OVERWRITE INTO TABLE onefall_test_dsg;
    • From an Object Storage Service (OSS) bucket: Upload data.csv to an OSS bucket and run:

      LOAD DATA INPATH 'oss://bucket-name.Endpoint/…/data.csv' OVERWRITE INTO TABLE onefall_test_dsg;
  5. Wait until the next day before proceeding. Data Security Guard syncs EMR metadata once per day, so the table must exist for at least one day before masking takes effect.

Step 2: Create sensitive data identification rules

DataWorks uses sensitive data identification rules to classify fields in EMR tables as sensitive. You must publish these rules before configuring masking rules.

In this example, you create rules to identify the gender, phone, and email fields in onefall_test_dsg.

  1. Log on to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the workspace and click Go to Data Development.

  2. Click the 图标 icon in the upper-left corner, then choose All Products > Data Governance > Data Security Guard. Click Try Now to go to the Data Security Guard page.

    Note: If your account is not yet granted the required permissions, you are redirected to the authorization page. Complete authorization before proceeding.
  3. In the left-side navigation pane, choose Rule Configuration > Sensitive Data Identification. The Data Identification Rules tab appears.

  4. In the BuildInClassificationTemplate section on the left, select the data category for the sensitive field types you want to create. For details, see Configure sensitive data detection rules and run tasks.

  5. In the upper-right corner of the tab, click Sensitive Field Type to open the configuration panel. Create a sensitive field type for each of gender, phone, and email. Using the field names from onefall_test_dsg as the type names makes them easier to identify. For details, see Configure sensitive data detection rules and run tasks.

  6. After creating the rules, click Batch Publish in the upper-right corner and select all three rules to publish them.

    image

Step 3: Configure data masking rules

For each of the three sensitive field types (gender, phone, email), configure a masking rule that applies to both level-2 scenarios. The masking mode differs by field:

FieldSensitive field typeData masking rule nameMasking scenariosMasking mode
gendergendergenderdevelopment demonstration, SQL analysisCharacters to replace → Replace with random value
emailemailemaildevelopment demonstration, SQL analysisHash → Encryption algorithm: MDS, Salt value: 5, Data watermarking: off
phonephonephonedevelopment demonstration, SQL analysisMasking out → Redaction mode: Show first three and last four characters
Note: This example uses three different masking modes to illustrate the options. For a full description of each mode, see the Configure the data masking method section in "Create a data masking rule".

To configure the rules:

  1. Log on to the DataWorks console and go to the Data Security Guard page. For details, see Overview.

  2. Click Try Now to open the Data Security Guard homepage.

  3. In the left-side navigation pane, choose Rule Configuration > Data Masking Management.

  4. Create the two level-2 scenarios. For details, see Create a data masking scenario.

  5. For each of the three sensitive field types, click Masking Rule in the upper-right corner and configure the rule using the values in the table above. For general instructions, see Create a data masking rule.

Step 4: Run the sensitive data identification task

After publishing the identification rules, run the task manually to classify fields in onefall_test_dsg without waiting for the next scheduled sync.

  1. In the left-side navigation pane of Data Security Guard, choose Rule Configuration > Sensitive Data Identification.

  2. In the upper-left corner, click Run Task. In the Enable sensitive data identification tasks panel, configure the following parameters:

    ParameterValue for this example
    Task typeManual Tasks
    Account used for identificationAlibaba Cloud Account
    Content identificationContent recognition
    Sampling quantity100 (default)
    Scanning rangePartial data — select the workspace and database containing onefall_test_dsg

    image

  3. Click Run in the lower-right corner to start the task.

To view the task progress and results, go to the Task Execution Records tab on the Sensitive Data Identification page.

Verify the masking results

After the identification task completes, the gender, phone, and email fields are masked based on the rules you configured. The following table shows the expected transformation for each field:

FieldRaw value (example)Masked value
genderMaleA random replacement value
phone13812348888138****8888
emailuser@example.comA hashed string (MDS algorithm)

Preview masked data in Data Map

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Governance > Data Map, then click Go to Data Map.

  2. In the left-side navigation pane of the Data Map page, click the image icon. In the top navigation bar dropdown, select E-MapReduce. Enter onefall_test_dsg in the search box.

  3. Click the table name to open the table details page, then click the Data Preview tab.

    image

The gender, phone, and email fields are masked based on the rules configured in Step 3.

View masked data in Data Studio

Masking in Data Studio page query results is controlled by the Mask Data in Page Query Results setting. Enable it before testing:

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the workspace and click Go to Data Development.

  2. In the left-side navigation pane of the Data Studio page, click the image icon to open the Settings page.

  3. Click Security Settings and Others, then turn on Mask Data in Page Query Results in the Data Security section.

Test with an ad hoc query

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select the workspace and click Go to Data Development.

  2. In the left-side navigation pane, click the image icon. In the Ad Hoc Query pane, hover over the image icon and choose Create > EMR Hive.

  3. Run the following query:

    SELECT * FROM onefall_test_dsg;

    The gender, phone, and email fields in the result set are masked.

    image

What's next