All Products
Search
Document Center

DataWorks:Migrate data from a self-managed MySQL database on an ECS instance to MaxCompute

Last Updated:Apr 15, 2024

This topic describes how to use exclusive resource groups for Data Integration to migrate data from a self-managed MySQL database hosted on an Elastic Compute Service (ECS) instance to MaxCompute.

Prerequisites

  • At least one ECS instance is purchased and associated with a virtual private cloud (VPC). You must make sure that the network type of the ECS instance is not classic network. A MySQL database that stores test data is deployed on the ECS instance. An account used to connect to the database is created. In this example, use the following statements to create a table in the MySQL database and insert test data to the table:

    CREATE TABLE IF NOT EXISTS good_sale(
       create_time timestamp,
       category varchar(20),
       brand varchar(20),
       buyer_id varchar(20),
       trans_num varchar(20),
       trans_amount DOUBLE,
       click_cnt varchar(20)
       );
    insert into good_sale values('2018-08-21','coat','brandA','lilei',3,500.6,7),
    ('2018-08-22','food','brandB','lilei',1,303,8),
    ('2018-08-22','coat','brandC','hanmeimei',2,510,2),
    ('2018-08-22','bath','brandA','hanmeimei',1,442.5,1),
    ('2018-08-22','food','brandD','hanmeimei',2,234,3),
    ('2018-08-23','coat','brandB','jimmy',9,2000,7),
    ('2018-08-23','food','brandA','jimmy',5,45.1,5),
    ('2018-08-23','coat','brandE','jimmy',5,100.2,4),
    ('2018-08-24','food','brandG','peiqi',10,5560,7),
    ('2018-08-24','bath','brandF','peiqi',1,445.6,2),
    ('2018-08-24','coat','brandA','ray',3,777,3),
    ('2018-08-24','bath','brandG','ray',3,122,3),
    ('2018-08-24','coat','brandC','ray',1,62,7) ;
  • The private IP address, VPC, and vSwitch of your ECS instance are noted.ECS

  • A security group rule is added for the ECS instance to allow access requests on the port used by the MySQL database. By default, the MySQL database uses port 3306. For more information, see Add a security group rule. The name of the security group is noted.

  • A DataWorks workspace is created. In this example, create a DataWorks workspace that is in basic mode and uses a MaxCompute compute engine. Make sure that the created DataWorks workspace belongs to the same region as the ECS instance. For more information about how to create a workspace, see Create a workspace.

  • An exclusive resource group for Data Integration is purchased and associated with the VPC where the ECS instance resides. The exclusive resource group and the ECS instance are in the same zone. For more information, see Create and use an exclusive resource group for Data Integration. After the exclusive resource group is associated with the VPC, you can view information about the exclusive resource group on the Resource Groups page.

  • On the VPC Binding tab of the exclusive resource group, whether the information about VPC, vSwitch, and security group of the exclusive resource group is the same as that of the ECS instance is checked.网络绑定

  • A MaxCompute data source is added. For more information, see Add a MaxCompute data source.

Background information

Exclusive resources can transmit your data in a fast and stable manner. You must make sure that the exclusive resource group for Data Integration resides in the same zone of the same region as the data source that you want to access. You must also make sure that the exclusive resource group for Data Integration belongs to the same region as the DataWorks workspace. In this example, a self-managed MySQL database deployed on an ECS instance is used as a data source.

Procedure

  1. Add a MySQL data source in DataWorks.

    1. Go to the Data Source page.

      1. Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

      2. In the left-side navigation pane of the page that appears, click Data Source. The Data Source page appears.

    2. On the Data Source page, click Add Data Source.

    3. In the Add Data Source dialog box, select MySQL.

    4. In the Add MySQL Data Source dialog box, configure the parameters. For more information, see Add a MySQL data source.

      In this example, set the Data Source Type parameter to Connection String Mode. Enter a URL that contains the private IP address of the ECS instance and the default port number 3306 of the MySQL database in the JDBC URL field.连接串模式

      Note

      DataWorks cannot test the connectivity of a self-managed MySQL database in a VPC. Therefore, it is normal that a connectivity test fails.

    5. Find the desired resource group and click Test Network Connectivity.

      You can use the shared resource group for Data Integration or an exclusive resource group for Data Integration to run data synchronization tasks. A data synchronization task can be run on only one resource group during each run. If multiple resource groups are available for a data synchronization task and you want the data synchronization task to be run as expected on every resource group, you must test the network connectivity between each of the selected resource groups and the data synchronization task. For more information, see Establish a network connection between a resource group and a data source.

    6. If the connectivity test is successful, click Complete.

  2. Create a MaxCompute table.

    You must create a table in DataWorks to receive test data from the MySQL database.

    1. Click the 图标 icon in the upper-left corner and choose All Products > Data Development And Task Operation > DataStudio.

    2. Create a workflow. For more information, see Create an auto triggered workflow.

    3. Right-click the created workflow and choose Create Table > MaxCompute > Table.

    4. Enter the name for your MaxCompute table. In this example, set the Table Name parameter to good_sale, which is the same as the name of the table in the MySQL database. Click DDL, enter the table creation statement, and then click Generate Table Schema.

      In this example, enter the following table creation statement. Pay attention to data type conversion.

      CREATE TABLE IF NOT EXISTS good_sale(
         create_time string,
         category STRING,
         brand STRING,
         buyer_id STRING,
         trans_num BIGINT,
         trans_amount DOUBLE,
         click_cnt BIGINT
         );
    5. Set the Display Name parameter and click Commit to Production Environment. The MaxCompute table named good_sale is created.

  3. Create and configure a data synchronization task.

    1. Right-click the workflow you created and choose Create Node > Data Integration > Offline synchronization to create a data synchronization task.

    2. Select the created MySQL data source as the source and the added MaxCompute data source as the destination. Click the Switch to Code Editor icon to switch to the code editor.

      If you cannot select the desired source table or an error is returned when you attempt to switch to the code editor, ignore the issue.

    3. Click the Resource Group configuration tab in the right-side navigation pane and select an exclusive resource group that you have purchased.

      If you do not select the exclusive resource group for Data Integration that you want to use to run the task, the task may fail to be run.

    4. Enter the following code for the data synchronization task:

      {
          "type": "job",
          "steps": [
              {
                  "stepType": "mysql",
                  "parameter": {
                      "column": [// The names of the columns.
                          "create_time",
                          "category",
                          "brand",
                          "buyer_id",
                          "trans_num",
                          "trans_amount",
                          "click_cnt"
                      ],
                      "connection": [
                          {
                              "datasource": "shuai",// The source.
                              "table": [
                                  "good_sale"// The name of the table in the source database. The name must be enclosed in brackets []. 
                              ]
                          }
                      ],
                      "where": "",
                      "splitPk": "",
                      "encoding": "UTF-8"
                  },
                  "name": "Reader",
                  "category": "reader"
              },
              {
                  "stepType": "odps",
                  "parameter": {
                      "partition": "",
                      "truncate": true,
                      "datasource": "odps_source",// The name of the MaxCompute data source.
                      "column": [// The columns in the destination table.
                          "create_time",
                          "category",
                          "brand",
                          "buyer_id",
                          "trans_num",
                          "trans_amount",
                          "click_cnt"
                      ],
                      "emptyAsNull": false,
                      "table": "good_sale"// The name of the destination table.
                  },
                  "name": "Writer",
                  "category": "writer"
              }
          ],
          "version": "2.0",
          "order": {
              "hops": [
                  {
                      "from": "Reader",
                      "to": "Writer"
                  }
              ]
          },
          "setting": {
              "errorLimit": {
                  "record": "0"
              },
              "speed": {
                  "throttle": false,
                  "concurrent": 2
              }
          }
      }
    5. Click the Run icon. You can click the Runtime Log tab in the lower part of the page to check whether the test data is synchronized to MaxCompute.

Result

To query data in the MaxCompute table, create an ODPS SQL node.

Enter the statement select * from good_sale ;, and click the Run icon. If the test data appears, the test data is synchronized to the MaxCompute table.