edit-icon download-icon

Demo introduction

Last Updated: Mar 27, 2018

Example background

The following example explains how to perform the simple statistical analysis on website data by using DTplus MaxCompute and DataWorks.

With this example, you can quickly get started with MaxCompute for the big data development, can briefly understand the big data ETL process on MaxCompute, and learn about the fundamental differences between the SQLs of common databases and MaxCompute.

Target users

Applies to MaxCompute beginners, especially who have experience of using database but new to the big data development.

Example description

Ranking lists appear commonly on the real estate websites, such as the projects of most contracts in the past 30 days and the projects of that have highest contract price. This example statistically analyzes the data information table (house_basic_info) of resale houses to generate the information about the resale house projects with the top five lowest average prices in each city and their districts, and displays this data on the real estate websites.

Demand analysis

Core objective

The core objective is to analyze the top five projects with the lowest average prices of resale houses in each city and their districts, namely, the city, project, average price, ranking, and district.

Current data status

  • In the information table, each project may cover multiple records and average prices. However, this example calculates only the average value of the average prices in the entire project.

  • The house_region in the information table contains the address information including the district and street, from which the district information must be extracted.

  • The data changes every day, and the data calculated on each datetime is full data.

Procedure

Step 1: Prepare the data

Step 2: Configure RDS data sources

Step 3: Configure data synchronization tasks

Step 4: Perform data import tasks

Step 5: Perform statistical analysis on data

Step 6: Perform the data backflow

Data backflow means to send the result tables back to the business system of the websites for them to display the data on the frontend by direct calling.

Conclusion

You can learn about the following with the subsequent statistical analysis on data:

  • DataWorks (formerly Data IDE) is a web tool based on MaxCompute for interface operations, data integration, and task scheduling. MaxCompute provides the computing and storage services.

  • After a MaxCompute SQL job is submitted, it takes anything from several seconds to several minutes to queue and schedule the job. Therefore, this service is suitable for batch jobs, where each job processes a massive volume of data. It is not suitable for frontend business systems that must process several thousand or tens of thousands of transactions per second.

  • MaxCompute SQL adopts the syntax similar to that of SQL, which can be considered as a subset of the standard SQL. However, MaxCompute SQL cannot be equated with a database, because it does not have many features like that of database, such as transactions, primary key constraints, and indexes. For more differences, see Syntax Differences from Other SQLs.

  • Data synchronization in DataWorks can implement the data transfer between RDS and MaxCompute of different regions without special processing.

For more advanced function components (such as MapReduce and Graph), see MaxCompute Documentation.

Thank you! We've received your feedback.