This topic describes how to compute and analyze collected data by using DataWorks.

Prerequisites

The data that is required for the workshop is collected. For more information, see Collect data.

Create tables

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the DataStudio page, click the created workflow. Right-click MaxCompute and choose Create > Table.
  3. In the Create Table dialog box, set the Table Name parameter and click Create.
    Create a table named ods_log_info_d at the operational data store (ODS) layer, a table named dw_user_info_all_d at the common data model (CDM) layer, and a table named rpt_user_info_d at the application data store (ADS) layer.
  4. Run DDL statements to create tables.
    • Create the ods_log_info_d table.
      Double-click the ods_log_info_d table. On the configuration tab on the right side, click DDL Statement and enter the following table creation statement:
      -- Create a table at the ODS layer.
      CREATE TABLE IF NOT EXISTS ods_log_info_d (
        ip STRING COMMENT 'The IP address',
        uid STRING COMMENT 'The ID of the user',
        time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss',
        status STRING COMMENT 'The status code that is returned by the server',
        bytes STRING COMMENT 'The number of bytes that are returned to the client',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'url',
        protocol STRING COMMENT 'The version number of HTTP',
        referer STRING COMMENT 'The source URL',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the dw_user_info_all_d table.
      Double-click the dw_user_info_all_d table. On the configuration tab on the right side, click DDL Statement and enter the following table creation statement:
      -- Create a table at the CDM layer.
      CREATE TABLE IF NOT EXISTS dw_user_info_all_d (
        uid STRING COMMENT 'The ID of the user',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'url',
        referer STRING COMMENT 'The source URL',
        time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the rpt_user_info_d table.
      Double-click the rpt_user_info_d table. On the configuration tab on the right side, click DDL Statement and enter the following table creation statement:
      -- Create a table at the ADS layer.
      CREATE TABLE IF NOT EXISTS rpt_user_info_d (
        uid STRING COMMENT 'The ID of the user',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        pv BIGINT COMMENT 'pv',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign'
      )
      PARTITIONED BY (
        dt STRING
      );
                                          
  5. After you enter the table creation statement, click Generate Table Schema. Then, click OK to overwrite the current settings.
  6. On the table configuration tab, enter the display name of the table in the General section.
  7. After you complete the configuration, click Commit in Development Environment and Commit to Production Environment.
    Note If you use a workspace in basic mode, only Commit to Production Environment is available.

Design the workflow

For more information about how to configure the dependencies among nodes of a workflow, see Collect data.

Double-click the created workflow. On the configuration tab of the workflow, click and drag ODPS SQL to the configuration tab on the right. In the Create Node dialog box, set the Node Name parameter and click Commit.

Create three ODPS SQL nodes in total and name them ods_log_info_d, dw_user_info_all_d, and rpt_user_info_d. Then, configure the dependencies among the nodes, as shown in the following figure.odps sql

Create a UDF

  1. Create a resource.
    1. Download the ip2region.jar file.
    2. On the DataStudio page, click the created workflow, right-click MaxCompute, and then choose Create > Resource > JAR.
    3. In the Create Resource dialog box, set the Resource Name and Location parameters.
      Note
      • Select Upload to MaxCompute.
      • The resource name can be different from the name of the uploaded file.
      • A resource name can contain letters, digits, underscores (_), and periods (.), and is not case-sensitive. It must be 1 to 128 characters in length. A JAR resource name must end with .jar, and a Python resource name must end with .py.
    4. Click Upload, select the ip2region.jar file that is downloaded to your local computer, and then click Open.
    5. Click Create.
    6. Click the Submit icon in the toolbar.
  2. Register a function.
    1. On the DataStudio page, click the created workflow, right-click MaxCompute, and then choose Create > Function.
    2. In the Create Function dialog box, set the Function Name parameter and click Create. For example, you can set the Function Name parameter to getregion.
    3. In the Register Function section, set the parameters.
      Parameter Description
      Function Type The type of the function.
      Engine Instance MaxCompute By default, the parameter cannot be modified.
      Function Name The name of the function that you entered when you created the function.
      Owner The owner of the function.
      Class Name Set the parameter to org.alidata.odps.udf.Ip2Region.
      Resources Set the parameter to ip2region.jar.
      Description Set the parameter to Region conversion based on the IP address.
      Expression Syntax Set the parameter to getregion('ip').
      Parameter Description Set the parameter to IP address.
    4. Click the Save and Submit icons in the toolbar.

Configure the ODPS SQL nodes

  1. Configure the ods_log_info_d node.
    1. Double-click the ods_log_info_d node to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
      SELECT ip
        , uid
        , time
        , status
        , bytes 
        , getregion(ip) AS region -- Obtain the region based on the IP address by using the user-defined function (UDF).
        , regexp_substr(request, '(^[^ ]+ )') AS method -- Use the regular expression to extract three fields from the request.
        , regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') AS url
        , regexp_substr(request, '([^ ]+$)') AS protocol 
        , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Use the regular expression to clarify the referrer, so as to obtain a more accurate URL.
        , CASE
          WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' -- Obtain the terminal information and access types based on the agent parameter.
          WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
          WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
          WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
          WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
          WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
          ELSE 'unknown'
        END AS device
        , CASE
          WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
          WHEN TOLOWER(agent) RLIKE 'feed'
          OR regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') RLIKE 'feed' THEN 'feed'
          WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
          AND agent RLIKE '^[Mozilla|Opera]'
          AND regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
          ELSE 'unknown'
        END AS identity
        FROM (
          SELECT SPLIT(col, '##@@')[0] AS ip
          , SPLIT(col, '##@@')[1] AS uid
          , SPLIT(col, '##@@')[2] AS time
          , SPLIT(col, '##@@')[3] AS request
          , SPLIT(col, '##@@')[4] AS status
          , SPLIT(col, '##@@')[5] AS bytes
          , SPLIT(col, '##@@')[6] AS referer
          , SPLIT(col, '##@@')[7] AS agent
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    3. Click the Save icon in the toolbar.
  2. Configure the dw_user_info_all_d node.
    1. Double-click the dw_user_info_all_d node to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT COALESCE(a.uid, b.uid) AS uid
        , b.gender
        , b.age_range
        , b.zodiac
        , a.region
        , a.device
        , a.identity
        , a.method
        , a.url
        , a.referer
        , a.time
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    3. Click the Save icon in the toolbar.
  3. Configure the rpt_user_info_d node.
    1. Double-click the rpt_user_info_d node to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT uid
        , MAX(region)
        , MAX(device)
        , COUNT(0) AS pv
        , MAX(gender)
        , MAX(age_range)
        , MAX(zodiac)
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;
    3. Click the Save icon in the toolbar.

Commit the workflow

  1. On the configuration tab of the workflow, click the Submit icon to commit the nodes that are configured in the workflow.
  2. In the Commit dialog box, select the nodes that you want to commit and select Ignore I/O Inconsistency Alerts.
  3. Click Commit.

Run the workflow

  1. On the configuration tab of the workflow, click the Run icon in the toolbar to verify the logic of node code.
  2. After all nodes are run and a green check mark (✓) appears, click Ad-Hoc Query on the left-side navigation submenu.
  3. On the Ad-Hoc Query tab, right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  4. Write and run an SQL statement to query the node running result and check whether required data is generated.
    Query result
    Use the following SQL query statement. By default, the data timestamp of a node is one day before the node is run.
    --- View data in the rpt_user_info_d table.
    select * from rpt_user_info_d where dt=Data timestamp limit 10;

Deploy the workflow

After you commit the workflow, the nodes in the workflow are in the development environment. You must deploy the configured nodes in the production environment because nodes in the development environment cannot be automatically scheduled.
Note
  • Before you deploy the nodes in the production environment, test the node code to make sure that the code is correct.
  • In a workspace in basic mode, the Deploy icon is unavailable. After you commit a node, click the Go to Operation Center icon to go to the Operation Center page.
  1. On the configuration tab of the workflow, click the Deploy icon in the toolbar to go to the Deploy page.
  2. Select the nodes that you want to deploy and click Add to List.
  3. Click To-Be-Deployed Node List in the upper-right corner. In the Nodes to Deploy panel, click Deploy All.
  4. In the Create Deploy Task dialog box, click Deploy.
  5. In the left-side navigation pane, click Release Package to view the deployment status.

Run the nodes in the production environment

  1. After you deploy the nodes, click Operation Center in the upper-right corner.
    You can also click Go to Operation Center in the toolbar on the configuration tab of the workflow to go to the Operation Center page.
  2. In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Task to go to the Cycle Task page. Then, click the workshop workflow.
  3. Double-click the zero load node in the directed acyclic graph (DAG) to show the workflow. Right-click the workshop_start node and choose Run > Current and Descendent Nodes Retroactively.
  4. Select nodes to generate retroactive data, specify the data timestamp, and then click OK. The Patch Data page appears.
  5. Click Refresh until all SQL nodes are run.

What to do next

You have learned how to create SQL nodes and process raw logs. You can now proceed with the next tutorial to learn how to monitor and ensure the quality of the data that is generated by the developed nodes. For more information, see Configure rules to monitor data quality.