This topic describes how to use DataWorks to collect data to MaxCompute.

Background information

In this workshop, you must create connections to an Object Storage Service (OSS) bucket and an ApsaraDB RDS instance from which you want to read data. You must also create tables to which you want to write data.
Note
  • You can use the data stores that are prepared for you in this workshop. You can also use your own data stores.
  • The prepared data stores reside in the China (Shanghai) region. We recommend that you use a workspace in the China (Shanghai) region to make sure that the prepared data stores are accessible when you create connections to these data stores.

Create a connection to an OSS bucket from which you want to read data

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the workspace in which you want to create a connection and click Data Integration in the Actions column.
      If you are using another service of DataWorks, click the More icon icon in the upper-left corner and choose All Products > Data Aggregation > Data Integration to go to the Data Integration page.
    4. On the page that appears, click Connection in the left-side navigation pane. The Data Source page appears.
  2. On the Data Source page, click New data source in the upper-right corner.
  3. In the Add data source dialog box, click OSS in the Semi-structuredstorage section.
  4. In the Add OSS data source dialog box, set the parameters as required.
    Add OSS data source dialog box
    Parameter Description
    Data Source Name The name of the connection. Enter oss_workshop_log.
    Data source description The description of the connection.
    Applicable environment The environment in which the connection is used. Select Development.
    Note After you create a connection in the development environment, create the same connection in the production environment by setting this parameter to Production. Otherwise, an error is returned when a node that uses this connection is run in the production environment.
    Endpoint The OSS endpoint. Enter http://oss-cn-shanghai-internal.aliyuncs.com.
    Bucket The name of the OSS bucket. Enter new-dataworks-workshop.
    AccessKey ID The AccessKey ID that is used to connect to OSS. Enter LTAI4FvGT3iU4xjKotpUMAjS.
    AccessKey Secret The AccessKey secret that is used to connect to OSS. Enter 9RSUoRmNxpRC9EhC4m9PjuG7Jzy7px.
  5. On the Data Integration tab, click Test connectivity in the Operation column of each resource group.
    A sync node uses only one resource group. To ensure that your sync nodes can be properly run, you must test the connectivity of all the resource groups for Data Integration on which your sync nodes will be run. If you need to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
  6. After the connection passes the connectivity test, click Complete.
    Note
    • If the connectivity test fails, check whether the AccessKey ID and AccessKey secret that you entered are correct and whether the DataWorks workspace is in the China (Shanghai) region.
    • If you fail to use the internal endpoint to connect to OSS, use the public endpoint instead.

Create a connection to an ApsaraDB RDS instance from which you want to read data

  1. On a service page of DataWorks, click the More icon icon in the upper-left corner and choose All Products > Data Aggregation > Data Integration.
  2. On the page that appears, click Connection in the left-side navigation pane. The Data Source page appears.
  3. On the Data Source page, click New data source in the upper-right corner.
  4. In the Add data source dialog box, click MySQL in the Relational Database section.
  5. In the Add MySQL data source dialog box, set the parameters as required.
    Add MySQL data source dialog box
    Parameter Description
    Data source type The type of the connection. Select Alibaba Cloud instance mode.
    Data Source Name The name of the connection. Enter rds_workshop_log.
    Data source description The description of the connection. Enter RDS user information synchronization.
    Applicable environment The environment in which the connection is used. Select Development.
    Note After you create a connection in the development environment, create the same connection in the production environment by setting this parameter to Production. Otherwise, an error is returned when a node that uses this connection is run in the production environment.
    Region The region where the ApsaraDB RDS instance resides.
    RDS instance ID

    The ID of the ApsaraDB RDS instance. Enter rm-2ev0681lc7042g16u.

    RDS instance account ID

    The ID of the Alibaba Cloud account that is used to purchase the ApsaraDB RDS instance. Enter 5600815724958382.

    Database name The name of the ApsaraDB RDS database. Enter workshop.
    User name The username that is used to connect to the database. Enter workshop.
    Password The password that is used to connect to the database. Enter workshop#2017.
  6. On the Data Integration tab, click Test connectivity in the Operation column of each resource group.
    A sync node uses only one resource group. To ensure that your sync nodes can be properly run, you must test the connectivity of all the resource groups for Data Integration on which your sync nodes will be run. If you need to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
  7. After the connection passes the connectivity test, click Complete.

Create a workflow

  1. On the Data Source page, click the More icon icon in the upper-left corner and choose All Products > Data Development > DataStudio. The DataStudio page appears.
  2. On the Data Analytics tab, right-click Business Flow and select Create Workflow.
  3. In the Create Workflow dialog box, set the Workflow Name and Description parameters.
    Notice The workflow name can be up to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Create.
  5. Double-click the new workflow to go to the workflow configuration tab. Drag Zero-Load Node under General to the canvas on the right.
  6. In the Create Node dialog box, set the Node Name parameter to workshop_start and click Commit.
    Drag Batch Synchronization under Data Integration to the canvas on the right to create two batch sync nodes named oss_synchronization and rds_synchronization.
  7. Drag directed lines to configure the workshop_start node as the parent node of the two batch sync nodes.

Configure the workshop_start node

  1. On the Data Analytics tab, double-click the workshop_start node in the new workflow. On the node configuration tab that appears, click Properties in the right-side navigation pane.
  2. In the Dependencies section, click Use Root Node to set the root node of the workspace as the parent node of the workshop_start node.
    In the latest version of DataWorks, each node must have its parent and child nodes. Therefore, you must set a parent node for the workshop_start node. In this workshop, the root node of the workspace is set as the parent node of the workshop_start node. The root node of the workspace is named in the Workspace name_root format. Use Root Node
  3. Click the Save icon icon in the top toolbar.

Create tables to which you want to write data

  1. On the Data Analytics tab, click the new workflow, right-click MaxCompute, and then choose Create > Table.
  2. In the Create Table dialog box, set the Table Name parameter and click Commit.
    In this workshop, you must create two tables named ods_raw_log_d and ods_user_info_d. The former is used to store logs that are synchronized from OSS and the latter is used to store user information that is synchronized from ApsaraDB RDS.
    Notice The table name can be up to 64 characters in length. It must start with a letter and cannot contain special characters.
  3. Configure the tables by executing DDL statements.
    • Configure the ods_raw_log_d table.
      On the configuration tab of the ods_raw_log_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to configure the table:
      -- Configure the ods_raw_log_d table.
      CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
          col STRING
      )
      PARTITIONED BY (
          dt STRING
      );
    • Configure the ods_user_info_d table.
      On the configuration tab of the ods_user_info_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to configure the table:
      -- Configure the ods_user_info_d table.
      CREATE TABLE IF NOT  EXISTS ods_user_info_d (
          uid STRING COMMENT 'User ID',
          gender STRING COMMENT 'Gender',
          age_range STRING COMMENT 'Age range',
          zodiac STRING COMMENT 'Zodiac sign'
      )
      PARTITIONED BY (
          dt STRING
      );
  4. Click Generate Table Schema. In the Confirm message, click OK.
  5. On the configuration tab of each table, enter the display name in the General section.
  6. Click Commit in Development Environment and Commit to Production Environment in sequence.

Configure batch sync nodes

Note In a workspace in standard mode, we recommend that you do not run batch sync nodes in the development environment. This means that directly running nodes on their configuration tabs is not recommended. Instead, we recommend that you deploy the nodes in the production environment and then run the nodes in test run mode to obtain complete operational logs.

After the nodes are deployed in the production environment, you can apply for the permissions to read data from and write data to the tables in the development environment.

  1. Configure the oss_synchronization node.
    1. On the Data Analytics tab, double-click the oss_synchronization node in the new workflow. The node configuration tab appears.
    2. Configure a connection to the source data store.
      Source section
      Parameter Description
      Connection The type and name of the connection. Select OSS and oss_workshop_log in sequence.
      Object Name Prefix The prefix of the OSS object for storing the data to be synchronized. Do not enter the name of the OSS bucket. In this workshop, enter user_log.txt.
      File Type The object type. Select text.
      Field Delimiter The column delimiter. Enter |.
      Encoding The encoding format. Default value: UTF-8.
      Null String The string that represents null.
      Compression Format The compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. Select None.
      Include Header Specifies whether to include the table header. Default value: No.
    3. Configure a connection to the destination data store.
      Parameter Description
      Connection The type and name of the connection. Select ODPS and odps_first in sequence.
      Table The table for storing synchronized data. Select the ods_raw_log_d table.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The method that is used to cope with existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Select No.
      Note
      • The default odps_first connection is automatically generated for a workspace by DataWorks after you bind a MaxCompute compute engine to the workspace for the first time.
      • The odps_first connection is used to write synchronized data to the MaxCompute project in the current workspace.
    4. Configure the mappings between fields in the source and destination.
      Mappings section
    5. Configure channel control policies.
      Channel section
      Parameter Description
      Expected Maximum Concurrency The maximum number of concurrent threads that the batch sync node uses to read data from the source data store and write data to the destination data store. You can configure the concurrency for the sync node on the codeless user interface (UI).
      Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
      Dirty Data Records Allowed The maximum number of dirty data records allowed.
    6. On the node configuration tab, click Properties in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_raw_log_d format next to Outputs and click the Add icon icon.
      Notice We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
    7. Click the Save icon icon in the top toolbar.
    8. Close the node configuration tab.
  2. Configure the rds_synchronization node.
    1. On the Data Analytics tab, double-click the rds_synchronization node in the new workflow. The node configuration tab appears.
    2. Configure a connection to the source data store.
      Source section
      Parameter Description
      Connection The type and name of the connection. Select MySQL and rds_workshop_log in sequence.
      Table The table from which data is synchronized. Select the ods_user_info_d table in the ApsaraDB RDS instance.
      Filter The filter condition for the data to be synchronized. The filter condition is often used to filter incremental data to be synchronized. You can leave this parameter unspecified.
      Shard Key The shard key for the data to be synchronized. Default value: uid.
    3. Configure a connection to the destination data store.
      Parameter Description
      Connection The type and name of the connection. Select ODPS and odps_first in sequence.
      Table The table for storing synchronized data. Select the ods_user_info_d table in MaxCompute.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The method that is used to cope with existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Select No.
    4. Configure the mappings between fields in the source and destination.
    5. Configure channel control policies.
    6. On the node configuration tab, click Properties in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_user_info_d format next to Outputs and click the Add icon icon.
      If an output name does not comply with the naming rules, click Delete to delete the output name.
      Notice We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
    7. Click the Save icon icon in the top toolbar.
    8. Close the node configuration tab.

Commit the workflow

  1. On the Data Analytics tab, double-click the new workflow. On the workflow configuration tab that appears, click the Submit icon icon in the top toolbar.
  2. In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
  3. Click Commit. The Committed successfully message appears.

Run the workflow

  1. On the Data Analytics tab, double-click the new workflow. On the workflow configuration tab that appears, click the Run icon icon in the top toolbar.
  2. Right-click the rds_synchronization node and select View Log.
    If the information shown in the following figure appears in the logs, the rds_synchronization node is run and data is synchronized. success
  3. Right-click the oss_synchronization node and select View Log. View the logs to check whether the oss_synchronization node is run and data is synchronized.

Verify data synchronization to MaxCompute

  1. On the left-side navigation submenu, click the Ad-Hoc Query icon. The Ad-Hoc Query tab appears.
  2. On the Ad-Hoc Query tab, right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  3. In the Create Node dialog box, enter the node name and click Commit. On the node configuration tab that appears, write and execute SQL statements to view the number of data records that are synchronized to the ods_raw_log_d and ods_user_info_d tables.
    Note Use the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the node is run on July 17, 2018, the data timestamp is 20180716, which is one day before the node is run.
    -- Check whether the data is written to MaxCompute.
    select count(*) from ods_raw_log_d where dt=Data timestamp of the node;
    select count(*) from ods_user_info_d where dt=Data timestamp of the node;

What to do next

You have learned how to collect and synchronize data. You can now proceed with the next tutorial. The next tutorial describes how to compute and analyze collected data. For more information, see Process data.