This topic describes how to use DataWorks to collect logs to MaxCompute.

Background information

In this topic, you must add an Object Storage Service (OSS) bucket and an ApsaraDB RDS instance as data sources from which you want to read data. You must also create tables to which you want to write data.

Note
  • You can use the data sources that are prepared for you in this topic. You can also use your own data sources.
  • The prepared data sources reside in the China (Shanghai) region. We recommend that you use a workspace in the China (Shanghai) region to make sure that the prepared data sources are accessible when you add these data sources.

Add an OSS data source

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace, move the pointer over the More icon in the Actions column, and then select Workspace Settings.
      If you are using a service of DataWorks, click the Workspace Management icon in the upper-right corner of the service page. The Workspace Management page appears.
    4. In the left-side navigation pane, click Data Source.
  2. On the Data Source page, click Add data source in the upper-right corner.
  3. In the Add data source dialog box, click OSS.
  4. In the Add OSS data source dialog box, configure the parameters.
    Add OSS data source dialog box
    ParameterDescription
    Data Source NameThe name of the data source. Enter oss_workshop_log.
    Data Source DescriptionThe description of the data source.
    EnvironmentThe environment in which the data source is used. Select Development.
    Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.
    EndpointThe OSS endpoint. Enter http://oss-cn-shanghai-internal.aliyuncs.com.
    BucketThe name of the OSS bucket. Enter new-dataworks-workshop.
    AccessKey ID&AccessKey SecretThe AccessKey pair that is used to connect to OSS. Enter LTAI4FvGT3iU4xjKotpUMAjS for the AccessKey ID and 9RSUoRmNxpRC9EhC4m9PjuG7Jzy7px for the AccessKey secret.
    Note To help you better experience DataWorks services, an AccessKey pair that has read-only permissions is provided for trial use in this topic. The sample AccessKey pair can be used to only read sample data. You can replace the AccessKey pair with the AccessKey pair that is used to connect to your data source in the production environment.
  5. Test the connectivity between the data source and the resource group that you select.
    In this topic, the shared resource group for Data Integration (debugging) is used to run the node that uses this data source. Click Advanced in the lower-right corner of the dialog box to add the shared resource group for Data Integration (debugging) and test the connectivity between the data source and the resource group. We recommend that you select an exclusive resource group for Data Integration to develop data because exclusive resource groups provide enhanced capabilities that can ensure the effectiveness of data synchronization. For more information about how to add and use an exclusive resource group for Data Integration, see Create and use an exclusive resource group for Data Integration.
    A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Establish a network connection between a resource group and a data source.
    Note
    • By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
    • If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
  6. After the connection passes the connectivity test, click Complete.
    Note
    • If the connectivity test fails, check whether the AccessKey ID and AccessKey secret that you entered are correct and whether the DataWorks workspace is in the China (Shanghai) region.
    • If OSS cannot be accessed by using the internal endpoint, use the public endpoint.

Add an ApsaraDB RDS data source

  1. Go to the Data Source page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the desired workspace, move the pointer over the More icon in the Actions column, and then select Workspace Settings.
      If you are using a service of DataWorks, click the Workspace Management icon in the upper-right corner of the service page. The Workspace Management page appears.
    4. In the left-side navigation pane, click Data Source.
  2. In the Add data source dialog box, click MySQL.
  3. In the Add MySQL data source dialog box, configure the parameters.
    Add MySQL data source dialog box
    ParameterDescription
    Data Source TypeSet this parameter to Connection String Mode.
    Data Source NameThe name of the data source. Enter rds_workshop_log.
    Data Source DescriptionThe description of the data source. Enter RDS user information synchronization.
    EnvironmentThe environment in which the data source is used. Select Development.
    Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.
    JDBC URLSet this parameter to jdbc:mysql://rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com:3306/workshop.
    UsernameThe username that is used to connect to the ApsaraDB RDS database. Enter workshop.
    PasswordThe password that is used to connect to the ApsaraDB RDS database. Enter workshop#2017.
    Special Authentication MethodSet this parameter to None Auth.
  4. Test the connectivity between the data source and the resource group that you select.
    In this topic, Data Integration is selected for Resource Group connectivity. Click Advanced in the lower-right corner of the Add MySQL data source dialog box to add a shared resource group and test the connectivity between the data source and the shared resource group. You can add a data source in connection string mode and configure a public IP address that is used to access the data source. This way, the MySQL data source that is provided by DataWorks for a tutorial test can be accessed by using the public IP address. You are charged for Internet traffic that is generated when the data source is accessed by using the public IP address. For information about the billing of Internet traffic in DataWorks, see Billing of Internet traffic.
    Important If the data sources used for your data synchronization node are created in Alibaba Cloud instance mode, you cannot use the shared resource group for Data Integration to run the node to synchronize data across accounts or regions. We recommend that you select an exclusive resource group for Data Integration to develop data because exclusive resource groups provide enhanced capabilities that can ensure the effectiveness of data synchronization. For information about how to add and use an exclusive resource group for Data Integration, see Create and use an exclusive resource group for Data Integration.

    A synchronization node can use only one type of resource group. To ensure that your synchronization node can be run as expected, you must test the connectivity between the data source and all resource groups for Data Integration on which your synchronization node will be run. If you want to test the connectivity for multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Establish a network connection between a resource group and a data source.

  5. After the data source passes the connectivity test, click Complete.

Create a workflow

  1. Click the Icon icon in the upper-left corner of the Data Integration page and choose All Products > Data Development > DataStudio.
  2. In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.
  3. In the Create Workflow dialog box, configure the Workflow Name and Description parameters.
    Important The workflow name can be a maximum of 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Create.
  5. Double-click the new workflow to go to the workflow configuration tab. Drag Zero-Load Node in the General section to the canvas on the right.
  6. In the Create Node dialog box, set the Name parameter to workshop_start and click Commit.
    Create Node dialog box

    Drag Batch Synchronization in the Data Integration section to the canvas on the right to create two batch synchronization nodes named oss_synchronization and rds_synchronization.

  7. Drag directed lines to configure the workshop_start node as the ancestor node of the two batch synchronization nodes.
    Dependencies

Configure the workshop_start node

  1. In the Scheduled Workflow pane, double-click the workshop_start node in the workflow. On the node configuration tab, click the Properties tab in the right-side navigation pane.
  2. In the Dependencies section, click Add Root Node to configure the root node of the workspace as the ancestor node of the workshop_start node.
    In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must configure an ancestor node for the workshop_start node. In this example, the root node of the workspace is configured as the ancestor node of the workshop_start node. The root node of the workspace is named in the Workspace name_root format. Add Root Node
  3. Click the Save icon icon in the top toolbar.

Create tables to which you want to write data

  1. In the Scheduled Workflow pane, click the new workflow, right-click MaxCompute, and then choose Create > Table.
    Create a table
  2. In the Create Table dialog box, configure the Name parameter and click Create.
    In this workshop, you must create two tables named ods_raw_log_d and ods_user_info_d. The ods_raw_log_d table is used to store logs that are synchronized from OSS and the ods_user_info_d table is used to store user information that is synchronized from ApsaraDB RDS.
    Important The table name can be a maximum of 64 characters in length. It must start with a letter and cannot contain special characters.
  3. Create the tables by executing DDL statements.
    • Create the ods_raw_log_d table.
      On the creation tab of the ods_raw_log_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table. Create the ods_raw_log_d table
      -- Create the ods_raw_log_d table.
      CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
          col STRING
      )
      PARTITIONED BY (
          dt STRING
      );
    • Create the ods_user_info_d table.
      On the creation tab of the ods_user_info_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table.
      -- Create the ods_user_info_d table.
      CREATE TABLE IF NOT  EXISTS ods_user_info_d (
          uid STRING COMMENT 'User ID',
          gender STRING COMMENT 'Gender',
          age_range STRING COMMENT 'Age range',
          zodiac STRING COMMENT 'Zodiac sign'
      )
      PARTITIONED BY (
          dt STRING
      );
  4. Click Generate Table Schema. In the Confirm message, click OK.
  5. On the creation tab for each table, enter the display name in the General section.
  6. After the creation is complete, click Commit to Development Environment and Commit to Production Environment in sequence.

Configure the batch synchronization nodes

Note In a workspace in standard mode, we recommend that you do not run batch synchronization nodes in the development environment. This means that directly running nodes on their configuration tabs is not recommended. Instead, we recommend that you deploy the nodes in the production environment and then run the nodes in test mode to obtain complete run logs.

After the nodes are deployed in the production environment, you can apply for the permissions to read data from and write data to the tables that are in the development environment.

  1. Configure the oss_synchronization node.
    1. In the Scheduled Workflow pane, double-click the oss_synchronization node in the new workflow. The node configuration tab appears.
    2. Select the source.
      Source section
      ParameterDescription
      ConnectionThe type and name of the source. Select OSS and oss_workshop_log in sequence.
      Object PathThe path of the OSS object. Do not enter the name of the OSS bucket. In this topic, enter user_log.txt.
      File TypeThe object type. Select text.
      Column separatorThe column delimiter. Enter |.
      EncodingThe encoding format. Default value: UTF-8.
      Null StringThe string that represents null.
      Compression FormatThe compression format of the OSS object. Valid values: None, Gzip, Bzip2, and Zip. Select None.
      Skip HeaderSpecifies whether to skip the table header. Default value: No.
    3. Select the destination.
      Target section
      ParameterDescription
      ConnectionThe type and name of the destination. Select ODPS and odps_first in sequence.
      TableThe table for storing the synchronized data. Select the ods_raw_log_d table.
      Partition Key ColumnThe partition information. Default value: ${bizdate}.
      Writing RuleThe method that is used to process existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to NullSpecifies whether to convert empty strings to null. Select No.
      Note
      • The default odps_first data source is automatically generated for a workspace by DataWorks after you associate the MaxCompute compute engine with the workspace for the first time.
      • The odps_first data source is used to write synchronized data to a MaxCompute project in the current workspace.
    4. Configure the mappings between fields in the source and destination.
      Mappings section
    5. Configure parameters in the Channel section.
      Channel section
      ParameterDescription
      Expected Maximum ConcurrencyThe maximum number of parallel threads that the data synchronization node uses to read data from the source or write data to the destination. You can configure the parallelism for the data synchronization node on the codeless UI.
      Bandwidth ThrottlingSpecifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
      Dirty Data Records AllowedThe maximum number of dirty data records allowed.
    6. On the node configuration tab, click Properties in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_raw_log_d format below Output and click the Add icon.
      Important We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
      1
    7. Click the Save icon in the top toolbar.
    8. Close the node configuration tab.
  2. Configure the rds_synchronization node.
    1. In the Scheduled Workflow pane, double-click the rds_synchronization node in the new workflow. The node configuration tab appears.
    2. Select the source.
      Source section
      ParameterDescription
      ConnectionThe type and name of the source. Select MySQL and rds_workshop_log in sequence.
      TableThe table for storing the synchronized data. Select the ods_user_info_d table in the ApsaraDB RDS instance.
      FilterThe condition used to filter the data that you want to synchronize. The filter condition is often used to filter incremental data. You can leave this parameter unspecified.
      Shard KeyThe shard key for the data to be synchronized. Default value: uid.
    3. Select the destination.
      Target section
      ParameterDescription
      ConnectionThe type and name of the destination. Select ODPS and odps_first in sequence.
      TableThe table for storing the synchronized data. Select the ods_user_info_d table in the MaxCompute project.
      Partition Key ColumnThe partition information. Default value: ${bizdate}.
      Writing RuleThe method that is used to process existing data before new data is written to MaxCompute. Default value: Write with Original Data Deleted (Insert Overwrite).
      Convert Empty Strings to NullSpecifies whether to convert empty strings to null. Select No.
    4. Configure the mappings between fields in the source and destination.
    5. Configure parameters in the Channel section.
    6. On the node configuration tab, click Properties in the right-side navigation pane. In the Dependencies section, enter an output name in the Workspace name.ods_user_info_d format below Output and click the 1 icon.
      If an output name does not comply with the naming rules, click Delete to delete the output name.
      Important We recommend that you do not include Chinese characters in the output names of the current node. Chinese characters reduce the accuracy of automatic recommendation.
      1
    7. Click the Save icon in the top toolbar.
    8. Close the node configuration tab.

Commit the workflow

  1. In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab, click the Submit icon icon in the top toolbar.
    Commit the workflow
  2. In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
  3. Click Commit. The Committed successfully message appears.

Run the workflow

  1. In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab that appears, click the Run icon icon in the top toolbar.
    Run icon
  2. Right-click the rds_synchronization node and select View Log.
    If the information shown in the following figure appears in the logs, the rds_synchronization node is run and data is synchronized. View the operational logs
  3. Right-click the oss_synchronization node and select View Log. View the logs to check whether the oss_synchronization node is run and data is synchronized.

Verify data synchronization to MaxCompute

  1. In the left-side navigation pane, click Ad-Hoc Query.
  2. In the Ad-Hoc Query pane, right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  3. In the Create Node dialog box, enter the node name and click Commit. On the node configuration tab that appears, write and execute SQL statements to view the number of data records that are synchronized to the ods_raw_log_d and ods_user_info_d tables.
    Note Execute the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the node is run on July 17, 2018, the data timestamp is 20180716, which is one day before the node is run.
    -- Check whether the data is written to MaxCompute.
    select count(*) from ods_raw_log_d where dt=Data timestamp of the node;
    select count(*) from ods_user_info_d where dt=Data timestamp of the node;
    View data

Subsequent steps

You understand how to collect and synchronize data. You can now proceed with the next tutorial. In the next tutorial, you will learn how to compute and analyze collected data. For more information, see Process data.