edit-icon download-icon

Step 5: Data statistics and analysis

Last Updated: May 04, 2018

After following the operations in the preceding tutorial, you have synchronized the data from RDS data source to MaxCompute table. This tutorial describes how to perform statistical analysis of the data with MaxCompute SQL or MR, and so on.

Procedure

Create target table

The main purpose of this example is to conclude the resale houses with top 5 average prices in each city and to get the area of the houses by statistical analysis. The required data includes the city name, house name, average price, rank, and area. First, the target table must be created.

  1. Log on to the DataWorks console as the project administrator and click Enter Workspace from the navigation pane of the relevant project in the Project List.

  2. Go to the Data Development page in the upper menu, click New and select New Table. See the following figure. 1

  3. Enter the following table creation statement in the new table page, and click OK.

    1. CREATE TABLE IF NOT EXISTS house_unit_price_top5 (
    2. house_city STRING,
    3. house_community_name STRING,
    4. house_unit_price_all DOUBLE,
    5. area STRING,
    6. tops BIGINT
    7. )
    8. PARTITIONED BY (
    9. pt STRING
    10. );

Create a task for statistical analysis of the data

  1. Go to the Data Development page in the upper menu, click New and select New Task. See the following figure.

    2

  2. Create an ODPS_SQL node task as shown in the following figure:

    3

Edit the SQL code

After entering the ODPS_SQL node task page, edit the following SQL code:

— Generate temporary average price table of each house in each city.

— The partition value is the value configured to the corresponding data import task. Make sure that the most recent partition value imported every day is taken from running tasks.DROP TABLE IF EXISTS t_house_unit_price_info;CREATE TABLE IF NOT EXISTS t_house_unit_price_infoASSELECT house_city, house_community_name, AVG(house_unit_price) AS house_unit_price_allFROM house_basic_infoWHERE pt = ‘${bdp.system.bizdate}’GROUP BY house_city, house_community_name;

— Break up house_region field to output the district name as area and store it as a temporary table.

— The partition value is the value configured to the corresponding data import task. Make sure that the most recent partition value imported every day is taken from running tasks.DROP TABLE IF EXISTS t_house_area;CREATE TABLE IF NOT EXISTS t_house_areaASSELECT distinct house_city, house_community_name, split_part(house_region, ‘ ‘, 1) AS areaFROM house_basic_infoWHERE pt = ‘${bdp.system.bizdate}’;

— Generate the final target table containing the resale houses with the top 5 average price in each city every day and the area of the houses.— The partition value is the value configured to the corresponding data import task. Make sure that the date partition value generated every day when running the tasks conforms to the date of the source table data.INSERT OVERWRITE TABLE house_unit_price_top5 PARTITION (pt=’${bdp.system.bizdate}’)SELECT a.house_city, a.house_community_name, a.house_unit_price_all, b.area, a.topsFROM ( SELECT house_city house_community_name, house_unit_price_all, ROW_NUMBER() OVER (PARTITION BY house_city ORDER BY house_unit_price_all DESC) AS tops FROM t_house_unit_price_info) aJOIN t_house_area bON a.house_city = b.house_city AND a.house_community_name = b.house_community_name AND a.tops < 6;```

Note:

MaxCompute SQL adopts the syntax similar to that of common SQL, which can be considered as a subset of standard SQL. However, MaxCompute SQL does not possess many features that a common database does, such as transactions, primary key constraints, and indexes.

Some differences in DDL syntax has been briefly introduced in sections describing how to import data into a target table, and information on DML statements is as follows:

  • The statement used to generate the temporary average price table of each house in each city can be directly performed on MySQL after the pt condition in where condition is modified.

  • In the statement of split the house_region field, the split_part() function is a string function provided by MaxCompute, which can be directly used in SQL corresponding to substring_index() or other string functions in MySQL.

  • In the statement used to generate the target table, ROW_NUMBER() is a window function provided by MaxCompute, which is used for ranking calculation in this case. The function can be directly used in SQL, and MySQL provides no corresponding functions.

  • In the statement used to generate the target table, the keyword “table” is required after “insert overwrite” (or “insert into”), but not required for MySQL or Oracle.

  • For more differences between MaxCompute SQL and common SQLs, see Differences with other SQLs.

Scheduling configuration and parameter configuration

After editing the code, click Run to run the SQL for inspection. Check that no error exists, and perform scheduling configuration. The configuration mainly includes scheduling property and dependency property:

  • Scheduling property: Remains the default configuration for daily scheduling.

  • Dependency property: Data source processed by this task is the big data produced by the data import task “house_basic_info”, so the import task must be set as the upstream task (parent task) of this task to make sure that the data import is finished before running this task.

    4

Note:

This task only uses the system parameter ${bdp.system.bizdate} which is auto replaced when running system scheduling tasks, so no other parameter configurations are needed. For more information, see System parameters.

Save and submit

  1. Click Save and Submit to save and submit the task to the scheduling system.

  2. Click Go to O&M from the upper-right corner of the workspace to check the workflow status in the O&M center.

Run the task

Similar to the operation of data import task. After the running is successfully completed, you can check the target table data in the SQL script of Data Development module.The target table is normally produced so far. However, MaxCompute SQL needs some time to wait for scheduling when running the task, which is suitable for batch process of big data but not for reading data from website front end, so you must feed the target table back to the business database of the website.

What to do next

Now you have learned how to process data with MaxCompute SQL and produce the final target table. You can continue with the next course to learn how to feed the target table back to the business database of the website. For more information, see Data feeding back.

Thank you! We've received your feedback.