This topic describes how to use DataWorks to collect logs to MaxCompute.

Background information

In this workshop, you must add an Object Storage Service (OSS) bucket and an ApsaraDB RDS instance as data sources from which you want to read data. You must also create tables to which you want to write data.

Note
  • You can use the data sources that are prepared for you in this workshop. You can also use your own data sources.
  • The prepared data sources reside in the China (Shanghai) region. We recommend that you use a workspace in the China (Shanghai) region to make sure that the prepared data sources are accessible when you add these data sources.

Add an OSS data source

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the workspace to which you want to add a data source and click Data Integration in the Actions column.
      If you are using another service of DataWorks, click the Icon icon in the upper-left corner and choose All Products > Data Aggregation > Data Integration to go to the Data Integration page.
    4. In the left-side navigation pane, choose Data Source > Data Sources. The Data Source page in Workspace Management appears.
  2. On the Data Source page, click Add data source in the upper-right corner.
  3. In the Add data source dialog box, click OSS.
  4. In the Add OSS data source dialog box, set the parameters based on your business requirements.
    Add OSS data source dialog box
    Parameter Description
    Data Source Name The name of the data source. Enter oss_workshop_log.
    Data source description The description of the data source.
    Environment The environment in which the data source is used. Select Development.
    Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.
    Endpoint The OSS endpoint. Enter http://oss-cn-shanghai-internal.aliyuncs.com.
    Bucket The name of the OSS bucket. Enter new-dataworks-workshop.
    AccessKey ID The AccessKey ID that is used to connect to OSS. Enter LTAI4FvGT3iU4xjKotpUMAjS.
    AccessKey Secret The AccessKey secret that is used to connect to OSS. Enter 9RSUoRmNxpRC9EhC4m9PjuG7Jzy7px.
  5. Find the desired resource group in the resource group list in the lower part of the dialog box and click Test connectivity in the Actions column.
    A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
    Note
    • By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
    • If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
  6. After the connection passes the connectivity test, click Complete.
    Note
    • If the connectivity test fails, check whether the AccessKey ID and AccessKey secret that you entered are correct and whether the DataWorks workspace is in the China (Shanghai) region.
    • If OSS cannot be accessed by using the internal endpoint, use the public endpoint.

Add an ApsaraDB RDS data source

  1. On a service page of DataWorks, click the Icon icon in the upper-left corner and choose All Products > Data Aggregation > Data Integration.
  2. On the page that appears, choose Data Source > Data Sources. The Data Source page in Workspace Management appears.
  3. On the Data Source page, click Add data source in the upper-right corner.
  4. In the Add data source dialog box, click MySQL.
  5. In the Add MySQL data source dialog box, set the parameters based on your business requirements.
    Add MySQL data source dialog box
    Parameter Description
    Data source type The mode in which the data source is added. Select Alibaba Cloud instance mode.
    Data Source Name The name of the data source. Enter rds_workshop_log.
    Data source description The description of the data source. Enter RDS user information synchronization.
    Environment The environment in which the data source is used. Select Development.
    Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.
    Region The region where the ApsaraDB RDS instance resides.
    RDS instance ID

    The ID of the ApsaraDB RDS instance. Enter rm-bp1z69dodhh85z9qa.

    RDS instance account ID

    The ID of the Alibaba Cloud account that is used to purchase the ApsaraDB RDS instance. Enter 1156529087455811.

    Default Database Name The name of the ApsaraDB RDS database. Enter workshop.
    User name The username that is used to connect to the database. Enter workshop.
    Password The password that is used to connect to the database. Enter workshop#2017.
  6. Find the desired resource group in the resource group list in the lower part of the dialog box and click Test connectivity in the Actions column.
    A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
    Note
    • By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
    • If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
  7. After the data source passes the connectivity test, click Complete.

Create a workflow

  1. Click the Icon icon in the upper-left corner of the Data Integration page and choose All Products > Data Development > DataStudio.
  2. In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.
  3. In the Create Workflow dialog box, set the Workflow Name and Description parameters.
    Notice The workflow name can be a maximum of 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Create.
  5. Double-click the new workflow to go to the workflow configuration tab. Drag Zero-Load Node in the General section to the canvas on the right.
  6. In the Create Node dialog box, set the Node Name parameter to workshop_start and click Commit.
    Create Node dialog box

    Drag Batch Synchronization in the Data Integration section to the canvas on the right to create two batch sync nodes named oss_synchronization and rds_synchronization.

  7. Drag directed lines to configure the workshop_start node as the ancestor node of the two batch sync nodes.
    Dependencies

Configure the workshop_start node

  1. In the Scheduled Workflow pane, double-click the workshop_start node in the workflow. On the node configuration tab, click the Properties tab in the right-side navigation pane.
  2. In the Dependencies section, click Add Root Node to configure the root node of the workspace as the ancestor node of the workshop_start node.
    In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must configure an ancestor node for the workshop_start node. In this example, the root node of the workspace is configured as the ancestor node of the workshop_start node. The root node of the workspace is named in the Workspace name_root format. Add Root Node
  3. Click the Save icon icon in the top toolbar.

Create tables to which you want to write data

  1. In the Scheduled Workflow pane, click the new workflow, right-click MaxCompute, and then choose Create > Table.
    Create Table dialog box
  2. In the Create Table dialog box, set the Table Name parameter and click Create.
    In this workshop, you must create two tables named ods_raw_log_d and ods_user_info_d. The ods_raw_log_d table is used to store logs that are synchronized from OSS and the ods_user_info_d table is used to store user information that is synchronized from ApsaraDB RDS.
    Notice The table name can be a maximum of 64 characters in length. It must start with a letter and cannot contain special characters.
  3. Create the tables by executing DDL statements.
    • Create the ods_raw_log_d table.
      On the creation tab of the ods_raw_log_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table. Create the ods_raw_log_d table
      -- Create the ods_raw_log_d table.
      CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
          col STRING
      )
      PARTITIONED BY (
          dt STRING
      );
    • Create the ods_user_info_d table.
      On the creation tab of the ods_user_info_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table.
      -- Create the ods_user_info_d table.
      CREATE TABLE IF NOT  EXISTS ods_user_info_d (
          uid STRING COMMENT 'User ID',
          gender STRING COMMENT 'Gender',
          age_range STRING COMMENT 'Age range',
          zodiac STRING COMMENT 'Zodiac sign'
      )
      PARTITIONED BY (
          dt STRING
      );
  4. Click Generate Table Schema. In the Confirm message, click OK.
  5. On the creation tab for each table, enter the display name in the General section.
  6. After the creation is complete, click Commit to Development Environment and Commit to Production Environment in sequence.

Configure the batch sync nodes

Note In a workspace in standard mode, we recommend that you do not run batch sync nodes in the development environment. This means that directly running nodes on their configuration tabs is not recommended. Instead, we recommend that you deploy the nodes in the production environment and then run the nodes in test mode to obtain complete operational logs.

After the nodes are deployed in the production environment, you can apply for the permissions to read data from and write data to the tables that are in the development environment.

  1. Configure the oss_synchronization node.
    1. In the Scheduled Workflow pane, double-click the oss_synchronization node in the new workflow. The node configuration tab appears.
    2. Configure a source.
      Source section
      Parameter Description
      Connection The type and name of the source. Select OSS and oss_workshop_log in sequence.
      Object Name Prefix The prefix of the OSS object for storing the data to be synchronized. Do not enter the name of the OSS bucket. In this workshop, enter user_log.txt.
      File Type The object type. Select text.
      Field Delimiter The column delimiter. Enter |.
      Encoding The encoding format. Default value: UTF-8.
      Null String The string that represents a null pointer.
      Compression Format The compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. Select None.
      Skip Header Specifies whether to include the table header. Default value: No.
    3. Configure a destination.
      Target section
      Parameter Description
      Connection The type and name of the destination. Select ODPS and odps_first in sequence.
      Table The table for storing the synchronized data. Select the ods_raw_log_d table.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The method that is used to process existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Select No.
      Note
      • The default odps_first data source is automatically generated for a workspace by DataWorks after you associate a MaxCompute compute engine instance with the workspace for the first time.
      • The odps_first data source is used to write synchronized data to a MaxCompute project in the current workspace.
    4. Configure the mappings between fields in the source and destination.
      Mappings section
    5. Set parameters in the Channel section.
      Channel section
      Parameter Description
      Expected Maximum Concurrency The maximum number of parallel threads that the sync node uses to read data from the source or write data to the destination. You can configure the parallelism for the sync node on the codeless UI.
      Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
      Dirty Data Records Allowed The maximum number of dirty data records allowed.
    6. On the node configuration tab, click the Properties tab in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_raw_log_d format below Outputs and click Create.
      Notice We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
      Create
    7. Click the Save icon icon in the top toolbar.
    8. Close the node configuration tab.
  2. Configure the rds_synchronization node.
    1. In the Scheduled Workflow pane, double-click the rds_synchronization node in the new workflow. The node configuration tab appears.
    2. Configure a source.
      Source section
      Parameter Description
      Connection The type and name of the source. Select MySQL and rds_workshop_log in sequence.
      Table The table for storing the synchronized data. Select the ods_user_info_d table in MaxCompute.
      Filter The condition used to filter the data that you want to synchronize. The filter condition is often used to filter incremental data. You can leave this parameter unspecified.
      Shard Key The shard key for the data to be synchronized. Default value: uid.
    3. Configure a destination.
      Target section
      Parameter Description
      Connection The type and name of the destination. Select ODPS and odps_first in sequence.
      Table The table for storing the synchronized data. Select the ods_user_info_d table in MaxCompute.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The method that is used to process existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Select No.
    4. Configure the mapping between fields in the source and destination.
    5. Set parameters in the Channel section.
    6. On the node configuration tab, click the Properties tab in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_user_info_d format below Outputs and click Create.
      If an output name does not comply with the naming rules, click Delete to delete the output name.
      Notice We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
      Create
    7. Click the Save icon icon in the top toolbar.
    8. Close the node configuration tab.

Commit the workflow

  1. In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab, click the Submit icon icon in the top toolbar.
    Commit the workflow
  2. In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
  3. Click Commit. The Committed successfully message appears.

Run the workflow

  1. In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab that appears, click the Run icon icon in the top toolbar.
    Run icon
  2. Right-click the rds_synchronization node and select View Log.
    If the information shown in the following figure appears in the logs, the rds_synchronization node is run and data is synchronized. View the operational logs
  3. Right-click the oss_synchronization node and select View Log. View the logs to check whether the oss_synchronization node is run and data is synchronized.

Verify data synchronization to MaxCompute

  1. In the left-side navigation pane, click Ad-Hoc Query.
  2. In the Ad-Hoc Query pane, right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  3. In the Create Node dialog box, enter the node name and click Commit. On the node configuration tab that appears, write and execute SQL statements to view the number of data records that are synchronized to the ods_raw_log_d and ods_user_info_d tables.
    Note Execute the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the node is run on July 17, 2018, the data timestamp is 20180716, which is one day before the node is run.
    -- Check whether the data is written to MaxCompute.
    select count(*) from ods_raw_log_d where dt=Data timestamp of the node;
    select count(*) from ods_user_info_d where dt=Data timestamp of the node;
    View data

Subsequent steps

You understand how to collect and synchronize data. You can now proceed with the next tutorial. In the next tutorial, you will learn how to compute and analyze collected data. For more information, see Process data.