This topic describes how to compute and analyze collected data by using DataWorks.

Prerequisites

The required data is collected. For more information, see Collect data.

Create tables

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click the workflow name. Right-click MaxCompute and select Create Table.
  3. In the Create Table dialog box, configure the Name parameter and click Create.
    Create a table named ods_log_info_d at the operational data store (ODS) layer, a table named dw_user_info_all_d at the common data model (CDM) layer, and a table named rpt_user_info_d at the application data store (ADS) layer.
  4. Create tables by executing DDL statements.
    • Create the ods_log_info_d table.
      Double-click the name of the ods_log_info_d table. On the configuration tab of the table, click DDL Statement and enter the following table creation statement:
      -- Create a table at the ODS layer.
      CREATE TABLE IF NOT EXISTS ods_log_info_d (
        ip STRING COMMENT 'The IP address',
        uid STRING COMMENT 'The ID of the user',
        time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss',
        status STRING COMMENT 'The status code that is returned by the server',
        bytes STRING COMMENT 'The number of bytes that are returned to the client',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'url',
        protocol STRING COMMENT 'The version number of HTTP',
        referer STRING COMMENT 'The source URL',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the dw_user_info_all_d table.
      Double-click the name of the dw_user_info_all_d table. On the configuration tab of the table, click DDL Statement and enter the following table creation statement:
      -- Create a table at the CDM layer.
      CREATE TABLE IF NOT EXISTS dw_user_info_all_d (
        uid STRING COMMENT 'The ID of the user',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'url',
        referer STRING COMMENT 'The source URL',
        time STRING COMMENT 'The time in the format of yyyymmddhh:mi:ss'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the rpt_user_info_d table.
      Double-click the name of the rpt_user_info_d table. On the configuration tab of the table, click DDL Statement and enter the following table creation statement:
      -- Create a table at the ADS layer.
      CREATE TABLE IF NOT EXISTS rpt_user_info_d (
        uid STRING COMMENT 'The ID of the user',
        region STRING COMMENT 'The region, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        pv BIGINT COMMENT 'pv',
        gender STRING COMMENT 'The gender',
        age_range STRING COMMENT 'The age range',
        zodiac STRING COMMENT 'The zodiac sign'
      )
      PARTITIONED BY (
        dt STRING
      );
                                          
  5. After you enter the CREATE TABLE statements, click Generate Table Schema. In the Confirm message, click OK.
  6. On the configuration tab of each table, enter a display name in the General section.
  7. After the tables are configured, click Commit to DEV and Commit to PROD in sequence.
    Note If you use a workspace that is in basic mode, only Commit to PROD is available.

Design the workflow

For more information about how to configure the dependencies among nodes in a workflow, see Collect data.

In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click the workflow name. On the configuration tab of the workflow, click ODPS SQL or drag ODPS SQL to the configuration tab on the right. In the Create Node dialog box, configure the Name parameter and click Commit.

Create three ODPS SQL nodes in total and name them ods_log_info_d, dw_user_info_all_d, and rpt_user_info_d. Then, configure the dependencies among the nodes, as shown in the following figure. Design the workflow

Create a user-defined function (UDF)

  1. Create a resource.
    1. Download the ip2region.jar file.
    2. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click the workflow name. Right-click MaxCompute and then choose Create Resource > JAR.
    3. In the Create Resource dialog box, configure the Name and Path parameters.
      Note
      • Select Upload to MaxCompute.
      • The resource name can be different from the name of the uploaded file.
    4. Click Upload, select the ip2region.jar file that is downloaded to your on-premises machine, and then click Open.
    5. Click Create.
    6. Click the Submit icon in the top toolbar.
  2. Register a function.
    1. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click the workflow name. Right-click MaxCompute, and then choose Create Function.
    2. In the Create Function dialog box, configure the Name parameter and click Create. For example, you can set the Name parameter to getregion.
    3. In the Register Function section of the configuration tab that appears, configure the parameters that are described in the following table.
      Register Function
      Parameter Description
      Function Type The type of the function.
      MaxCompute Engine Instance The MaxCompute compute engine instance. The value of this parameter cannot be changed.
      Function Name The name of the function.
      Owner The owner of the function.
      Class Name Set the parameter to org.alidata.odps.udf.Ip2Region.
      Resources Set the parameter to ip2region.jar.
      Description Set the parameter to Region conversion based on the IP address.
      Expression Syntax Set the parameter to getregion('ip').
      Parameter Description Set the parameter to IP address.
    4. Click the Save and Submit icons in the top toolbar.

Configure the ODPS SQL nodes

  1. Configure the ods_log_info_d node.
    1. In the Scheduled Workflow pane of the DataStudio page, find the ods_log_info_d node and double-click the node name to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
      SELECT ip
        , uid
        , time
        , status
        , bytes 
        , getregion(ip) AS region -- Obtain the region based on the IP address by using the UDF. 
        , regexp_substr(request, '(^[^ ]+ )') AS method -- Use the regular expression to extract three fields from the request. 
        , regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') AS url
        , regexp_substr(request, '([^ ]+$)') AS protocol 
        , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Use the regular expression to clarify the referrer to obtain a more accurate URL. 
        , CASE
          WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' -- Obtain the terminal information and access types based on the agent parameter. 
          WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
          WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
          WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
          WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
          WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
          ELSE 'unknown'
        END AS device
        , CASE
          WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
          WHEN TOLOWER(agent) RLIKE 'feed'
          OR regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') RLIKE 'feed' THEN 'feed'
          WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
          AND agent RLIKE '^[Mozilla|Opera]'
          AND regexp_extract(request, '^[^ ]+ (.*) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
          ELSE 'unknown'
        END AS identity
        FROM (
          SELECT SPLIT(col, '##@@')[0] AS ip
          , SPLIT(col, '##@@')[1] AS uid
          , SPLIT(col, '##@@')[2] AS time
          , SPLIT(col, '##@@')[3] AS request
          , SPLIT(col, '##@@')[4] AS status
          , SPLIT(col, '##@@')[5] AS bytes
          , SPLIT(col, '##@@')[6] AS referer
          , SPLIT(col, '##@@')[7] AS agent
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    3. Click the Save icon in the top toolbar.
  2. Configure the dw_user_info_all_d node.
    1. In the Scheduled Workflow pane of the DataStudio page, find the dw_user_info_all_d node and double-click the node name to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT COALESCE(a.uid, b.uid) AS uid
        , b.gender
        , b.age_range
        , b.zodiac
        , a.region
        , a.device
        , a.identity
        , a.method
        , a.url
        , a.referer
        , a.time
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    3. Click the Save icon in the top toolbar.
  3. Configure the rpt_user_info_d node.
    1. In the Scheduled Workflow pane of the DataStudio page, find the rpt_user_info_d node and double-click the node name to go to the configuration tab of the node.
    2. On the configuration tab of the node, enter the following SQL statements:
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT uid
        , MAX(region)
        , MAX(device)
        , COUNT(0) AS pv
        , MAX(gender)
        , MAX(age_range)
        , MAX(zodiac)
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;
    3. Click the Save icon in the top toolbar.

Commit the workflow

  1. On the configuration tab of the workflow, click the Submit icon in the top toolbar to commit the nodes that are configured in the workflow.
  2. In the Commit dialog box, select the nodes that you want to commit and select Ignore I/O Inconsistency Alerts.
  3. Click Commit.

Run the workflow

  1. On the configuration tab of the workflow, click the Run icon in the top toolbar to verify the logic of node code.
  2. After all nodes are run and a green check mark (✓) appears, click Ad Hoc Query in the left-side navigation pane.
  3. In the Ad Hoc Query pane, right-click Ad Hoc Query and choose Create Node > ODPS SQL.
  4. Write and execute an SQL statement to query the node execution result and check whether required data is generated.
    Query results
    Execute the following SQL query statement. By default, the data timestamp of a node is one day before the node is run.
    --- View data in the rpt_user_info_d table. 
    select * from rpt_user_info_d where dt=Data timestamp limit 10;

Deploy the workflow

After you commit the workflow, the nodes in the workflow are in the development environment. You must deploy the configured nodes to the production environment because nodes in the development environment cannot be automatically scheduled.
Note
  • Before you deploy the nodes to the production environment, test the node code to ensure that the code is correct.
  • In a workspace in basic mode, the Deploy icon is unavailable. After you commit a node, click the Go to Operation Center icon to go to the Operation Center page.
  1. On the configuration tab of the workflow, click the Deploy icon in the top toolbar to go to the Deploy page.
  2. Select the nodes that you want to deploy and click Add to List.
    Add to List
  3. Click Nodes to Deploy in the upper-right corner. In the Nodes to Deploy panel, click Deploy All.
  4. In the Create Deploy Task dialog box, click Deploy.
  5. In the left-side navigation pane of the Deploy page, click Release Package to view the deployment status.

Run the nodes in the production environment

  1. After the nodes are deployed, click Operation Center in the upper part of the Deploy page.
    You can also click Operation Center in the top toolbar on the configuration tab of the workflow to go to the Operation Center page.
  2. In the left-side navigation pane of the Operation Center page, choose Cycle Task Maintenance > Cycle Task to go to the Cycle Task page. Then, click the root node of the workshop workflow.
  3. Double-click the zero load node which is the root node of the workflow in the directed acyclic graph (DAG) to show the workflow. Right-click the workshop_start node and choose Run > Current and Descendant Nodes Retroactively.
    Run
  4. Select nodes to backfill data, specify the data timestamp, and then click OK. The Patch Data page appears.
  5. Click Refresh until all SQL nodes are successfully run.

What to do next

You have learned how to create SQL nodes and process raw log data. You can now proceed with the next tutorial to learn how to monitor and ensure the quality of the data that is generated by the developed nodes. For more information, see Configure rules to monitor data quality.