This topic describes how to process logs collected to MaxCompute and generate user profiles in DataWorks.

Prerequisites

The data is collected. For more information, see Collect data.

Create tables

  1. In the Data Analytics section, click the created workflow, right-click MaxCompute, and then choose Create > Table.
  2. In the Create Table dialog box that appears, enter the table name and click Commit.
    Create a table named ods_log_info_d at the ODS layer, a table named dw_user_info_all_d at the DW layer, and a table named rpt_user_info_d at the RPT layer respectively.
  3. Run DDL statements to create tables.
    • Create the ods_log_info_d table.

      On the editing tab of the ods_log_info_d table, click DDL Statement and enter the table creation statements in the DDL Statement dialog box.

      -- Create a table at the ODS layer.
      CREATE TABLE IF NOT EXISTS ods_log_info_d (
        ip STRING COMMENT 'The IP address of the client that sends the request',
        uid STRING COMMENT 'The ID of the client user',
        time STRING COMMENT 'The time when the user accessed the webpage, in the format of yyyymmddhh:mm:ss',
        status STRING COMMENT 'The status code returned by the server',
        bytes STRING COMMENT 'The number of bytes returned to the client',
        region STRING COMMENT 'The region where the user resides, which is obtained based on the IP address',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'The URL of the webpage accessed by the user',
        protocol STRING COMMENT 'The version number of HTTP',
        referer STRING COMMENT 'The URL of the webpage linked to the resource being requested',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the dw_user_info_all_d table.

      On the editing tab of the dw_user_info_all_d table, click DDL Statement and enter the table creation statements in the DDL Statement dialog box.

      -- Create a table at the DW layer.
      CREATE TABLE IF NOT EXISTS dw_user_info_all_d (
        uid STRING COMMENT 'The ID of the client user',
        gender STRING COMMENT 'The gender of the user',
        age_range STRING COMMENT 'The age range of the user',
        zodiac STRING COMMENT 'The zodiac sign of the user',
        region STRING COMMENT 'The region where the user resides, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        identity STRING COMMENT 'The access type, which can be crawler, feed, user, or unknown',
        method STRING COMMENT 'The type of the HTTP request',
        url STRING COMMENT 'The URL of the webpage accessed by the user',
        referer STRING COMMENT 'The URL of the webpage linked to the resource being requested',
        time STRING COMMENT 'The time when the user accessed the webpage, in the format of yyyymmddhh:mm:ss'
      )
      PARTITIONED BY (
        dt STRING
      );
    • Create the rpt_user_info_d table.

      On the editing tab of the rpt_user_info_d table, click DDL Statement and enter the table creation statements in the DDL Statement dialog box.

      -- Create a table at the RPT layer.
      CREATE TABLE IF NOT EXISTS rpt_user_info_d (
        uid STRING COMMENT 'The ID of the client user',
        region STRING COMMENT 'The region where the user resides, which is obtained based on the IP address',
        device STRING COMMENT 'The terminal type',
        pv BIGINT COMMENT 'The number of times that the user viewed the webpage',
        gender STRING COMMENT 'The gender of the user',
        age_range STRING COMMENT 'The age range of the user',
        zodiac STRING COMMENT 'The zodiac sign of the user'
      )
      PARTITIONED BY (
        dt STRING
      );
      								
  4. Click Generate Table Schema. In the Confirm dialog box that appears, click OK.
  5. On the table editing tab, enter the display name of the table in the General section.
  6. Click Commit in Development Environment and Commit to Production Environment in sequence.

Design the workflow

For more information about how to configure the dependencies among nodes of a workflow, see Collect data.

In the DataStudio console, double-click the created workflow in the left-side navigation pane. On the workflow editing tab that appears, click and hold ODPS SQL on the left and drag it to the editing section on the right. In the Create Node dialog box that appears, set Node Name and click Commit.

Create three ODPS SQL nodes in total and name them ods_log_info_d, dw_user_info_all_d, and rpt_user_info_d respectively. Then, configure the dependencies among the nodes.

Create a UDF

  1. Create a resource.
    1. Download the ip2region.jar file and store it in a local directory.
    2. In the Data Analytics section, click the created workflow, right-click MaxCompute, and then choose Create > Resource > JAR.
    3. In the Create Resource dialog box that appears, click Upload and select the downloaded ip2region.jar file.
    4. Click OK.
  2. Register a user-defined function (UDF).
    1. In the Data Analytics section, click the created workflow, right-click MaxCompute, and then choose Create > Function.
    2. In the Create Function dialog box that appears, enter the name of the UDF, for example, getregion, and click Commit.
    3. On the configuration tab that appears, set parameters in the Register Function section and click the Save and Submit icons in sequence.
      Parameter Description
      Function Type The type of the UDF.
      Engine Instance MaxCompute The MaxCompute engine instance bound to the current workspace. By default, you cannot change the engine instance.
      Function Name The name of the UDF you entered when you created the UDF.
      Owner The owner of the UDF.
      Class Name The name of the class for implementing the UDF. Set the value to org.alidata.odps.udf.Ip2Region.
      Resources The resource you created. Set the value to ip2region.jar.
      Description The description of the UDF. Set the value to Region conversion based on the IP address.
      Expression Syntax The expression syntax of the UDF. Set the value to getregion('ip').
      Parameter Description The description of the parameter in the expression. Set the value to IP address.

Configure the ODPS SQL nodes

  1. Configure the ods_log_info_d node.
    1. Double-click the ods_log_info_d node.
    2. On the node editing tab that appears, enter the following SQL statements:
      INSERT OVERWRITE TABLE ods_log_info_d PARTITION (dt=${bdp.system.bizdate})
      SELECT ip
        , uid
        , time
        , status
        , bytes 
        , getregion(ip) AS region -- Obtain the region based on the IP address by using the UDF.
        , regexp_substr(request, '(^[^ ]+ )') AS method -- Use the regular expression to extract three fields from the request.
        , regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') AS url
        , regexp_substr(request, '([^ ]+$)') AS protocol 
        , regexp_extract(referer, '^[^/]+://([^/]+){1}') AS referer -- Use the regular expression to clean the HTTP referrer so as to obtain a more accurate URL.
        , CASE
          WHEN TOLOWER(agent) RLIKE 'android' THEN 'android' -- Obtain the terminal and access types from the value of the agent parameter.
          WHEN TOLOWER(agent) RLIKE 'iphone' THEN 'iphone'
          WHEN TOLOWER(agent) RLIKE 'ipad' THEN 'ipad'
          WHEN TOLOWER(agent) RLIKE 'macintosh' THEN 'macintosh'
          WHEN TOLOWER(agent) RLIKE 'windows phone' THEN 'windows_phone'
          WHEN TOLOWER(agent) RLIKE 'windows' THEN 'windows_pc'
          ELSE 'unknown'
        END AS device
        , CASE
          WHEN TOLOWER(agent) RLIKE '(bot|spider|crawler|slurp)' THEN 'crawler'
          WHEN TOLOWER(agent) RLIKE 'feed'
          OR regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') RLIKE 'feed' THEN 'feed'
          WHEN TOLOWER(agent) NOT RLIKE '(bot|spider|crawler|feed|slurp)'
          AND agent RLIKE '^[Mozilla|Opera]'
          AND regexp_extract(request, '^[^ ]+ (. *) [^ ]+$') NOT RLIKE 'feed' THEN 'user'
          ELSE 'unknown'
        END AS identity
        FROM (
          SELECT SPLIT(col, '##@@')[0] AS ip
          , SPLIT(col, '##@@')[1] AS uid
          , SPLIT(col, '##@@')[2] AS time
          , SPLIT(col, '##@@')[3] AS request
          , SPLIT(col, '##@@')[4] AS status
          , SPLIT(col, '##@@')[5] AS bytes
          , SPLIT(col, '##@@')[6] AS referer
          , SPLIT(col, '##@@')[7] AS agent
        FROM ods_raw_log_d
        WHERE dt = ${bdp.system.bizdate}
      ) a;
    3. Click the Save icon in the upper-left corner.
  2. Configure the dw_user_info_all_d node.
    1. Double-click the dw_user_info_all_d node.
    2. On the node editing tab that appears, enter the following SQL statements:
      INSERT OVERWRITE TABLE dw_user_info_all_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT COALESCE(a.uid, b.uid) AS uid
        , b.gender
        , b.age_range
        , b.zodiac
        , a.region
        , a.device
        , a.identity
        , a.method
        , a.url
        , a.referer
        , a.time
      FROM (
        SELECT *
        FROM ods_log_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) a
      LEFT OUTER JOIN (
        SELECT *
        FROM ods_user_info_d
        WHERE dt = ${bdp.system.bizdate}
      ) b
      ON a.uid = b.uid;
    3. Click the Save icon in the upper-left corner.
  3. Configure the rpt_user_info_d node.
    1. Double-click the rpt_user_info_d node.
    2. On the node editing tab that appears, enter the following SQL statements:
      INSERT OVERWRITE TABLE rpt_user_info_d PARTITION (dt='${bdp.system.bizdate}')
      SELECT uid
        , MAX(region)
        , MAX(device)
        , COUNT(0) AS pv
        , MAX(gender)
        , MAX(age_range)
        , MAX(zodiac)
      FROM dw_user_info_all_d
      WHERE dt = ${bdp.system.bizdate}
      GROUP BY uid;
    3. Click the Save icon in the upper-left corner.

Commit the workflow

  1. On the workflow editing tab, click the Submit icon to commit the nodes configured in the workflow.
  2. In the Commit dialog box that appears, select the nodes to be committed, enter the description, and then select Ignore I/O Inconsistency Alerts.
  3. Click Commit.

Run the workflow

  1. On the workflow editing tab, click the Run icon to verify the logic of node code.
  2. After all nodes are run and a green check mark (✓) appears next to each node, click Ad-Hoc Query on the navigation submenu. The Ad-Hoc Query section appears.
  3. Right-click Ad-Hoc Query and choose Create Node > ODPS SQL.
  4. In the Create Node dialog box that appears, enter the node name and click Commit. On the tab that appears, run an SQL query statement to query the node running result and check whether required data is generated.
    Query the node running result

    Use the following SQL query statement. By default, the data timestamp of a node is one day before the node is run.

    --- View data in the rpt_user_info_d table.
    select * from rpt_user_info_d where dt=Data timestamp limit 10;

Deploy the workflow

After you commit the workflow, the nodes in the workflow are in the development environment. You must deploy the configured nodes in the production environment because nodes in the development environment cannot be automatically scheduled.

Note
  • Before deploying the nodes in the production environment, test the node code to make sure that the code is correct.
  • In a workspace in basic mode, the Deploy icon is unavailable for a workflow. You can click Go to Operation Center to schedule nodes in Operation Center.
  1. On the workflow editing tab, click the Deploy icon.
  2. On the Deploy page that appears, select the nodes to be deployed and click Add to List.
  3. Click To-Be-Deployed Node List in the upper-right corner. In the Nodes to Deploy dialog box, click Deploy All.
  4. In the Create Deploy Task dialog box that appears, click Deploy.
  5. Click Release Package in the left-side navigation pane to view the deployment status.

Run the nodes in the production environment

  1. After you deploy the workflow, click Operation Center in the upper-right corner.
    You can also click Go to Operation Center in the toolbar on the workflow editing tab to go to the Operation Center page.
  2. On the Operation Center page, choose Cycle Task Maintenance > Cycle Task in the left-side navigation pane. On the Cycle Task page, click the workshop workflow.
  3. In the directed acyclic graph (DAG) on the right, right-click the workshop_start node and choose Run > Current and Descendent Nodes Retroactively.
  4. In the Patch Data dialog box that appears, select nodes to generate retroactive data, specify the data timestamp, and then click OK. The Patch Data page appears.
  5. Click Refresh until all retroactive data generation instances are in the Successful state.

What to do next

Now, you have learned how to create SQL nodes and process raw logs. You can proceed with the next tutorial to learn how to monitor and guarantee the quality of the data generated by the developed nodes. For more information, see Monitor data quality.