This topic describes how to configure the extract, transform, and load (ETL) feature to resolve data incompatibility and mask data for a Data Transmission Service (DTS) task.

Prerequisites

  • The source and destination database instances are created. In this topic, ApsaraDB RDS for MySQL instances are used for data synchronization. For more information about how to create an ApsaraDB RDS for MySQL instance, see Create an ApsaraDB RDS for MySQL instance.
  • The available storage space of the destination instance is larger than the total size of the data in the source instance.

Scenario 1: Resolve data incompatibility

DTS allows you to migrate and synchronize data between heterogeneous databases. During data migration and synchronization, data types are automatically mapped between the source and destination databases. However, data incompatibility may still occur. For example, when data is synchronized from a MySQL database to an Oracle database, an empty string in the varchar(10) NOT NULL column cannot be synchronized to a non-empty column in the destination database. This is because an empty string is treated as a null value in Oracle. As a result, the data synchronization task is interrupted.

If you skip the data in such a scenario, data inconsistency occurs. To resolve this issue, you can configure ETL in the data synchronization task and use domain-specific language (DSL) statements to handle invalid values.

Note In the following example, the source ApsaraDB RDS for MySQL instance allows null entries and the destination ApsaraDB RDS for MySQL instance does not.
  1. Create databases and tables in the source and destination instances.
    • Source instance
      CREATE DATABASE testdb;
      USE testdb;
      CREATE TABLE `src_table` (
          `id` int(11) NOT NULL,
          `user` varchar(100) NULL,
          PRIMARY KEY (`id`)
      ) ENGINE=InnoDB
      DEFAULT CHARACTER SET=utf8;
    • Destination instance
      CREATE DATABASE testdb;
      USE testdb;
      CREATE TABLE `dest_table` (
          `id` int(11) NOT NULL,
          `user` varchar(100) NOT NULL,
          PRIMARY KEY (`id`)
      ) ENGINE=InnoDB
      DEFAULT CHARACTER SET=utf8;
  2. Create a data synchronization task between the source and destination instances. For more information, see Overview of data synchronization scenarios.
    When you configure the task, take note of the following parameter settings:
    • Set the Processing Mode of Conflicting Tables parameter to Ignore Errors and Proceed so that no error is reported if the destination instance has a table with the same name as a table in the source instance.
    • Right-click the src_table table in the Selected Objects section and change the table name to dest_table.
  3. Insert NULL into the user column in the source instance.
    INSERT INTO src_table (id) VALUES (1);
    The data synchronization task fails because the user column in the destination instance is a non-null column.

    Error message: Column 'user' cannot be null.

  4. Configure ETL and use DSL statements to handle invalid values in the data synchronization task.
    1. Find the data synchronization task for which you want to modify ETL configurations and choose More icon > Modify ETL Configurations.
    2. In the Advanced Settings section of the Configure Objects and Advanced Settings step, set the Configure ETL parameter to Yes. Then, in the code editor, enter e_if(op_is_null(`user`), e_set(`user`, '_NULL_')).
      Note Syntax description of the DSL statement e_if(op_is_null(`user`), e_set(`user`, '_NULL_')):
      • op_is_null(`user`): checks whether an entry in the user column of the source instance is NULL.
      • e_set(`user`,'_NULL_'): sets the entry in the user column of the destination instance to _NULL_.
      • e_if(op_is_null(`user`), e_set(`user`, '_NULL_')): checks whether an entry in the user column of the source instance is NULL, and if yes, sets the entry in the user column of the destination instance to _NULL_.

      For more information about the DSL syntax, see Overview of DSL.

    3. After you complete this configuration, click Next: Precheck and Start Task.
  5. Check whether the task is resumed.
  6. In the destination instance, check whether NULL is replaced with _NULL_.

Scenario 2: Mask data

DTS provides features such as data backup and data integration. When you use these features, you may want to analyze data. For data security, you can configure ETL in a DTS task and use DSL statements to mask sensitive data.

You can use common data conversion functions such as the hash function in the DSL script for data masking. In the following example, ID card numbers and names are used as sensitive data.

  1. Create a database and a table in the source instance and populate the table with the sensitive data.
    CREATE DATABASE testdb2;
    USE testdb2;
    CREATE TABLE user(id int NOT NULL PRIMARY KEY, id_card varchar(100) NOT NULL, name varchar(100) NOT NULL);
    INSERT INTO user VALUES(1, '123456789', 'name1');
    INSERT INTO user VALUES(2, '987654321', 'name2');
  2. Create a data synchronization task between the source and destination instances. Synchronize the user table in the testdb2 database to the destination instance. For more information, see Overview of data synchronization scenarios.
    In the Advanced Settings section of the Configure Objects and Advanced Settings step, set the Configure ETL parameter to Yes. Then, in the code editor, enter the following DSL statements to mask the data:
    e_compose(
        e_if(op_eq(__TB__,'user'),e_set(`id_card`, str_md5(`id_card`))),
        e_if(op_eq(__TB__,'user'),e_set(`name`, str_mask(`name`, 1, 2, '*')))
    )
    Note Syntax description of the DSL statements:
    • str_md5(`id_card`): encrypts entries in the id_card column by using the MD5 algorithm.
    • str_mask(`name`, 1, 2, '*'): replaces the first and second bytes of the entries in the name column with asterisks (*).
    • e_set(`id_card`, str_md5(`id_card`)): sets entries in the id_card column to ciphertext generated by MD5.
    • e_set(`name`, str_mask(`name`, 1, 2, '*')): sets entries in the name column to the data after masking.
    • e_if(op_eq(__TB__,'user'),e_set(`id_card`, str_md5(`id_card`))),: sets entries in the id_card column of the user table to ciphertext generated by MD5.
    • e_if(op_eq(__TB__,'user'),e_set(`name`, str_mask(`name`, 1, 2, '*'))): sets entries in the name column of the user table to the data after masking.
    • e_compose( e_if(op_eq(__TB__,'user'),e_set(`id_card`, str_md5(`id_card`))), e_if(op_eq(__TB__,'user'),e_set(`name`, str_mask(`name`, 1, 2, '*')))): encrypts entries in the id_card column of the user table by using the MD5 algorithm and replaces the first and second bytes of the entries in the name column of the user table with asterisks (*).

    For more information about the DSL syntax, see Overview of DSL.

  3. Check whether the synchronized data is masked in the destination instance and whether the entries in the name column are masked with asterisks (*).