This topic describes how to use DataWorks to collect logs to MaxCompute.

Background information

You must create an Object Storage Service (OSS) connection and a Relational Database Service (RDS) connection respectively in this workshop.
Note Data resources provided for this workshop are all stored in the China (Shanghai) region. Therefore, we recommend that you use the workspace in the China (Shanghai) region. Otherwise, the network connectivity test fails when you add a connection.

Create an OSS connection

  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Integration in the Actions column.
    If you have logged on to a module of DataWorks, click the DataWorks icon in the upper-left corner and choose All Products > Data Integration to go to Data Integration page.
  2. In the left-side navigation pane, click Connection. The Data Source page appears.
  3. On the Data Source page, click Add Connection in the upper-right corner.
  4. In the Add Connection dialog box that appears, select OSS.
  5. In the Add OSS Connection dialog box, set parameters for the OSS connection.
    Parameter Description
    Connection Name The name of the connection. Set the value to oss_workshop_log.
    Description The description of the connection.
    Applicable Environment The environment in which the connection is used. Set the value to Development.
    Note After creating a connection in the development environment, create the same connection in the production environment by setting this parameter to Production. Otherwise, an error is returned when a node that uses this connection is run in the production environment.
    Endpoint The OSS endpoint. Set the value to http://oss-cn-shanghai-internal.aliyuncs.com.
    Bucket The name of the OSS bucket. Set the value to new-dataworks-workshop.
    AccessKey ID The AccessKey ID of the account used to access the OSS bucket. Set the value to LTAI4FvGT3iU4xjKotpUMAjS.
    AccessKey Secret The AccessKey secret of the account used to access the OSS bucket. Set the value to 9RSUoRmNxpRC9EhC4m9PjuG7Jzy7px.
  6. Click Test Connection.
  7. After the connectivity test is passed, click Complete.
    Note
    • If the connectivity test fails, check whether the AccessKey ID and AccessKey secret that you enter are correct and whether the DataWorks workspace is in the China (Shanghai) region.
    • If you fail to use the internal endpoint to connect to OSS, use the public endpoint instead.

Create an RDS connection

  1. Click the DataWorks icon in the upper-left corner and choose All Products > Data Integration.
  2. In the left-side navigation pane, click Connection. The Data Source page appears.
  3. On the Data Source page, click Add Connection in the upper-right corner.
  4. In the Add Connection dialog box that appears, select MySQL.
  5. In the Add MySQL Connection dialog box, set parameters for the RDS connection.
    Parameter Description
    Connect To The type of the connection. Set the value to ApsaraDB for RDS.
    Connection Name The name of the connection. Set the value to rds_workshop_log.
    Description The description of the connection. Set the value to RDS user information synchronization.
    Applicable Environment The environment in which the connection is used. Set the value to Development.
    Note After creating a connection in the development environment, create the same connection in the production environment by setting this parameter to Production. Otherwise, an error is returned when a node that uses this connection is run in the production environment.
    Region The region where the RDS instance resides.
    RDS Instance ID

    The ID of the RDS instance. Set the value to rm-2ev0681lc7042g16u

    RDS Instance Account ID

    The ID of the Alibaba Cloud account used to purchase the RDS instance. Set the value to 5600815724958382.

    Database Name The name of the RDS database. Set the value to workshop.
    Username The username used to connect to the database. Set the value to workshop.
    Password The password used to connect to the database. Set the value to workshop#2017.
  6. Click Test Connection.
  7. After the connectivity test is passed, click Complete.

Create a workflow

  1. Click the DataWorks icon in the upper-left corner and choose All Products > DataStudio.
  2. In the Data Analytics section, right-click Business Flow and select Create Workflow.
  3. In the Create Workflow dialog box that appears, set Workflow Name and Description.
  4. Click Create.
  5. On the workflow editing tab that appears, click and hold Zero-Load Node on the left and drag it to the editing section on the right. In the Create Node dialog box that appears, set Node Name to workshop_start and click Commit.
    Create two batch synchronization nodes in the same way and name them oss_data synchronization and rds_data synchronization respectively.
  6. By drawing lines between nodes, configure the workshop_start node as the parent node of the two batch synchronization nodes.

Configure the workshop_start node

  1. In the Data Analytics section, double-click the workshop_start node in the corresponding workflow. On the node editing tab that appears, click the Properties tab on the right.
  2. In the Properties dialog box, click Use Root Node in the Dependencies section and set the root node of the workspace as the ancestor node of the workshop_start node.
    In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must set an ancestor node for the workshop_start node. In this example, the root node of the workspace is set as the ancestor node of the workshop_start node. The root node of the workspace is named in the Workspace name_root format.Properties dialog box
  3. After the configuration is completed, click Save icon in the upper-left corner.

Create tables

  1. In the Data Analytics section, click the created workflow, right-click MaxCompute, and then choose Create > Table.
  2. In the Create Table dialog box that appears, enter the table name and click Commit.

    Create two tables in total and name them ods_raw_log_d and ods_user_info_d respectively. One table is used to store logs synchronized from OSS and the other is used to store user information synchronized from RDS.

  3. Run DDL statements to create tables.
    • Create the ods_raw_log_d table.

      On the editing tab of the ods_raw_log_d table, click DDL Statement and enter the table creation statements in the DDL Statement dialog box.

      -- Create a table to store logs synchronized from OSS.
      CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
          col STRING
      )
      PARTITIONED BY (
          dt STRING
      );
    • Create the ods_user_info_d table.

      On the editing tab of the ods_user_info_d table, click DDL Statement and enter the table creation statements in the DDL Statement dialog box.

      -- Create a table to store user information synchronized from RDS.
      CREATE TABLE IF NOT  EXISTS ods_user_info_d (
          uid STRING COMMENT 'The ID of the client user',
          gender STRING COMMENT 'The gender of the user',
          age_range STRING COMMENT 'The age range of the user',
          zodiac STRING COMMENT 'The zodiac sign of the user'
      )
      PARTITIONED BY (
          dt STRING
      );
  4. Click Generate Table Schema. In the Confirm dialog box that appears, click OK.
  5. On the table editing tab, enter the display name of the table in the General section.
  6. Click Commit in Development Environment and Commit to Production Environment in sequence.

Configure the batch synchronization nodes

Note In a workspace in standard mode, we recommend that you do not run batch synchronization nodes in the development environment, that is, run the nodes directly on their configuration tabs. Instead, we recommend that you deploy the nodes in the production environment and then run the nodes in test run mode to obtain complete operational logs.

After the nodes are deployed in the production environment, you can apply for the permissions to read data from and write data to the tables in the development environment.

  1. Configure the oss_data synchronization node.
    1. In the Data Analytics section, double-click the oss_data synchronization node. The node configuration tab appears.
    2. Select the source connection.
      Source section
      Parameter Description
      Connection The connection for accessing the source data. Select OSS and oss_workshop_log in sequence.
      Object Name Prefix The prefix of the OSS object for storing the data to be synchronized. Set the value to user_log.txt.
      File Type The object type. Set the value to text.
      Field Delimiter The column delimiter. Set the value to |.
      Encoding The encoding format. Default value: UTF-8.
      Null String The string that represents null.
      Compression Format The compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. In this example, set the value to None.
      Include Header Specifies whether to include headers. Default value: No.
    3. Select the destination connection.
      Target section
      Parameter Description
      Connection The connection for storing source data to MaxCompute. Select ODPS and odps_first in sequence.
      Table The table for storing source data. Set the value to ods_raw_log_d.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The rule for writing data to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Set the value to No.
      Note The odps_first connection is used to write source data to the MaxCompute project in the current workspace.
    4. Configure the mapping between fields in the source and destination tables.
      Mappings step
    5. Set parameters in the Channel step.
      Parameter Description
      Expected Maximum Concurrency The maximum number of concurrent threads to read data from or write data to data storage within the batch synchronization node. You can configure the concurrency for a node on the codeless user interface (UI).
      Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
      Dirty Data Records Allowed The maximum number of dirty data records allowed.
      Resource Group The resource group used to run the batch synchronization node. By default, the batch synchronization node is run on the default resource group. If many nodes are run on the default resource group, some nodes may be delayed due to insufficient resources. We recommend that you purchase exclusive resources for data integration or add a custom resource group to run the batch synchronization node. For more information, see DataWorks exclusive resources and Add a custom resource group.
    6. Verify that the preceding configurations are correct and click the Save icon in the upper-left corner of the tab.
    7. Close the tab to return to the Data Analytics section.
  2. Configure the rds_data synchronization node.
    1. In the Data Analytics section, double-click the rds_data synchronization node. The node configuration tab appears.
    2. Select the source connection.
      Source section
      Parameter Description
      Connection The connection for accessing the source data. Select MySQL and rds_workshop_log in sequence.
      Table The table from which data is synchronized. Set the value to ods_user_info_d.
      Filter The filter condition for the data to be synchronized. The filter condition is often used to filter incremental data to be synchronized. You can leave the parameter unspecified.
      Shard Key The shard key in the source data. Default value: uid.
    3. Select the destination connection.
      Target section
      Parameter Description
      Connection The connection for storing source data to MaxCompute. Select ODPS and odps_first in sequence.
      Table The table for storing source data. Set the value to ods_user_info_d.
      Partition Key Column The partition information. Default value: ${bizdate}.
      Writing Rule The rule for writing data to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to Null Specifies whether to convert empty strings to null. Set the value to No.
    4. Configure the mapping between fields in the source and destination tables.
    5. Set parameters in the Channel step.
    6. Verify that the preceding configurations are correct and click the Save icon in the upper-left corner of the tab.
    7. Close the tab to return to the Data Analytics section.

Commit the workflow

  1. In the Data Analytics section, double-click the corresponding workflow. On the workflow editing tab that appears, click the Submit icon in the upper-left corner.
  2. In the Commit dialog box that appears, select the nodes to be committed, set Description, and then select Ignore I/O Inconsistency Alerts.
  3. Click Commit. The message Committed successfully. appears.

Run the workflow

  1. In the Data Analytics section, double-click the corresponding workflow. On the workflow editing tab that appears, click the Run icon in the upper-left corner.
  2. Right-click the rds_data synchronization node and select View Log.
  3. Right-click the oss_data synchronization node and select View Log. View the logs to check whether the oss_data synchronization node is run successfully and data is synchronized.

Check whether data is synchronized to MaxCompute

  1. Click Ad-Hoc Query on the navigation submenu. The Ad-Hoc Query section appears.
  2. Right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  3. In the Create Node dialog box that appears, enter the node name and click Commit. On the tab that appears, run SQL query statements to view the number of data records synchronized to the ods_raw_log_d and ods_user_info_d tables respectively.
    Note Use the following SQL query statements. In each query statement, change the partition key value to the data timestamp of the node. For example, if the node is run on July 17, 2018, the data timestamp is 20180716, which is one day before the node is run.
    -- Check whether data is synchronized to MaxCompute.
    select count(*) from ods_raw_log_d where dt=Data timestamp of the node;
    select count(*) from ods_user_info_d where dt=Data timestamp of the node;

What to do next

Now, you have learned how to synchronize logs and collect data. You can proceed with the next tutorial to learn how to compute and analyze collected data. For more information, see Process data.