Related Products

The big data products involved in this experiment are MaxCompute (big data computing services).And DataWorks (data factory, original big data development kit).

Prerequisites

Before you begin this lab, you need to make sure you have an Alibaba Cloud account and have a real name.

Activate MaxCompute

Note If you have already activated MaxCompute, skip this step to create the project space directly.
  1. Log in to the Alibaba Cloud website, click Log in in the upper-right corner to fill in your Alibaba Cloud account and password.
  2. Select Products > Analytics & Big Data > MaxComputute and go to the MaxCompute product details page.

  3. Click Start now.
  4. Select Pay-As-You-Go, click Buy Now.

Create workspace

  1. Log on to the DataWorks console by using a primary account.
  2. You can create a workspace in two ways.
    • On the console Overview page, go to Commonly Used Features > Create Workspace.

    • On the console Workspace page, select region, and then click Create Workspace in the upper right corner.

  3. Fill in the configuration items in the Create Workspace dialog box. Select a region and a calculation engine service.
    Note If you have not purchase the relevant services in the region, it is directly display that there is no service available in the Region. The data analytics, O&M, and administration are selected by default.
  4. Configure the basic information and advanced settings for the new project, and click Create Workspace.

    Note
    • The workspace name needs to begin with a letter or underline, and can only contain letters, underscores, and numbers.
    • The workspace name is globally unique, it is recommended that you use your own easy-to-distinguish name as the project space name for this lab.
  5. Once the workspace has been created successfully, you can select the Workspace page to Data Analytics after viewing the workspace space.

Create data source

Note Based on the scenario simulated by this lab, you need to distribute to create both the OSS data source and the RDS data source.
  • Create a new OSS data source
    1. Select the Data Integration > Data Source Page, and click Add Data Source.

    2. Select the data source type as OSS, with other configuration items as follows.

      Parameters:
      • Endpoint: http://oss-cn-shanghai-internal.aliyuncs.com
      • bucket: dataworks-workshop
      • AK ID: LTAINEhd4MZ8pX64
      • AK Key: lXnzUngTSebt3SfLYxZxoSjGAK6IaF
    3. Click Test Connectivity, and after the connectivity test passes, click Finish to save the configuration.
      Note If the test connectivity fails, check your AK and the region in which the item is located. It is recommended to create the project in East China 2, and other regions do not guarantee network access.
  • Add RDS Data Source
    1. Select the Data Integration > Data SourcePage, and click Add Data Source.
    2. Select the data source type as MySQL, and fill in the configuration information.

      Parameters:
      • Data source type: ApsaraDB for RDS
      • Data source name: rds_workshop_log
      • Data source description: RDS log data synchronization
      • RDS instance name: rm-2ev0681lc7042g16u
      • RDS instance buyer ID: 5600815724958382
      • Database name: workshop
      • Username/Password: workshop/workshop#2017
    3. Click Test Connectivity, and after the connectivity test passes, click Finish to save the configuration.

Create a Business Flow

  1. Right-click Business Flow under Data Analytics, select Create Business Flow.
  2. Fill in the Business Flow name and description.

  3. Click Create to complete the creation of the Business Flow.

  4. Enter the Business Flow Development Panel and drag a virtual node and two data sync nodes (oss_datasync and rds_datasync) into the Panel.



  5. Drag the connection to set the workshop_start node to the upstream of both data synchronization nodes.

Configure workshop_start task

Since the new version sets the input and output nodes for each node, you need to set an input for the workshop_start node, the virtual node in the Business Flow can be set to the upstream node as the project root node, the project root node is generally named project name _ root.

You can configure it by clicking Schedule. When the task configuration is complete, click Save.

Create Table

  1. Right-click Table and choose Create Table.

  2. Type in Table Name(ods_raw_log_d and ods_user_info_d) for oss logs and RDS respectively.

  3. Type in your Table Alias and choose Partitioned Table.

  4. Type in the field and partition information,click Submit to Development Environment and Submit to Production Environment.

    You can also click DDL Mode, use the following SQL statements to create tables.
    //Create a target table for oss logs
    CREATE TABLE IF NOT EXISTS  ods_raw_log_d (
      col STRING
    )
    PARTITIONED BY (
      dt STRING
    );
    
    //Creates a target table for RDS
    CREATE TABLE IF NOT  EXISTS ods_user_info_d (
      uid STRING COMMENT 'User ID',
      gender STRING COMMENT 'Gender',
      age_range STRING COMMENT 'Age range',
      zodiac STRING COMMENT 'Zodiac'
    )
    PARTITIONED BY (
      dt STRING
    );
  5. Click Submit to Development Environment and Submit to Production Environment. You can configure both of the tables in this way.

Configure the data synchronization task

  • Configure the oss_datasync node
    1. Double-click the oss_datasync node node to go to the node configuration page.
    2. Select a data source.
      Select the data source as the maid in the oss data source.

      Parameters:
      • Data source: oss_workshop_log
      • Object Prefix: /user_log.txt
      • Column Separator: |
    3. Select data destination
      Select the data destination is ods_raw_log_d in the odps_first data source. Both partition information and cleanup rules take the system default, the default configuration of the partition is ${bizdate}.

    4. Configure the field mapping, connect the fields that you want to synchronize.

    5. Configure Transmission Rate with a maximum operating rate of 10 Mb/s.

    6. Verify that the current task is configured and can be modified. After the confirmation is correct, click Save in the upper left corner.
    7. Closes the current task and returns to the Business Flow configuration panel.
  • Configure the rds_datasync node Node
    1. Double-click the rds_datasync node node to go to the node configuration page.
    2. Select a data source.
      Select the data source that is located in the MySQL data source rds_workshop_log, and the table is named as ods_user_info_d, the split key uses the default to generate columns.

    3. Select data destination
      Select the data destination ods_user_info_d in the data source named odps_first. Both partition information and cleanup rules take the system default, the default configuration of the partition is ${bizdate}.

    4. Configure the field mapping, default in association with the name mapping.

    5. Configure Transmission Rate with a maximum operating rate of 10 Mb/s.

    6. Verify that the current task is configured and can be modified. After the confirmation is correct, click Save in the upper left corner.

    7. Closes the current task and returns to the Business Flow configuration panel.

Submit Business Flow tasks

  1. Click Submit to submit the current Business Flow.
  2. Select the nodes in the submit dialog box, and check the Ignore Warning on I/O Inconsistency, click Submit.

Run workflow task

  1. Click Run.

    During a task run, you can view the run status.

  2. Right-click the rds_datasync task and select View Log.

    When the following words appear in the log, it indicates that the synchronization task runs successfully and synchronizes a batch of data.



  3. Right-click the oss_datasync task and select View Log. The confirmation method is consistent with the rds_datasync task.

Check if the data is successfully imported into MaxCompute

  1. Click temporary query in the left-hand navigation bar.
  2. Select New > ODPS SQL.

  3. Write and execute SQL statement to check the entries imported into ods_raw_log_d.

  4. Also write and execute SQL statements to view the number of imported ods_user_info_d records.
    Note The SQL statement is as follows, where the partition columns need to be updated to the business date, if the task runs on a date of 20180717, the business date is 20180716.
    Check that data was written to MaxCompute
    select count(*) from ods_raw_log_d where dt=business date;
    select count(*) from ods_user_info_d where dt=business date;

Next step

Now that you've learned how to synchronize the log data, complete the data acquisition, you can continue with the next tutorial. In this tutorial, you will learn how to calculate and analyze the collected data. For more information, see Data processing: user portraits.