edit-icon download-icon

Build a data warehouse on Alibaba Cloud

Last Updated: Dec 04, 2018

This tutorial demonstrates how to build a Data Warehouse on Alibaba Cloud. However, it does not cover the theory and methodology on Data Warehouse. It mainly focuses on how to ingest data from different sources by DataIDE and how to analyze data by MaxCompute SQL and so on.

Introduction

Data Warehouse is a central repository of integrated data from one or more disparate sources. It stores current and historical data at one place and helps to create analytical reports for knowledge workers throughout the enterprise. The basic architecture is as follows.

2

The following figure shows how the Data Warehouse works in Alibaba Cloud.

2

Prerequisites

Basic knowledge about Data Warehouse, Alibaba Cloud RDS, OSS, and SQL.

  1. Get data prepared on RDS.

    3

    Get data loaded to table tmall_user_brand.

    4

  2. Get data prepared on OSS.

    2

    Upload databy by using View URL.

Build Data Warehouse on Alibaba Cloud

Data Ingestion from RDS

Log on to the DTplus console, and use DataWorks to collect the data.

  1. Set DataSource.

    1. Add whitelist to RDS.

      2

    2. Set the data source of RDS.

      2

    3. Click Test connectivity to check if it works.

    4. Similarly, create data source of MaxCompute.

      2

  2. Set the task.

    Use the Wizard mode.

    2

    1. Create a destination table on MaxCompute and set the target, then go to Field Mapping.
    DROP TABLE IF EXISTS tmall_user_brand;
    CREATE TABLE tmall_user_brand (
    user_id string,
    brand_id string,
    type string COMMENT “click-0, buy-1, collect-2, shopping_cart-3”,
    visit_datetime string
    );

    2

    1. Go through the procedure, and save the task as dw_demo_rds2max.

      Or use the Script mode after you get familiar with it.

      1. {
      2. "type": "job",
      3. "version": "1.0",
      4. "configuration": {
      5. "reader": {
      6. "plugin": "mysql",
      7. "parameter": {
      8. "datasource": "dw_demo_rds",
      9. "where": "",
      10. "splitPk": "",
      11. "connection": [
      12. {
      13. "table": [
      14. "`tmall_user_brand`"
      15. ],
      16. "datasource": "dw_demo_rds"
      17. }
      18. ],
      19. "connectionTable": "`tmall_user_brand`",
      20. "column": [
      21. "user_id",
      22. "brand_id",
      23. "type",
      24. "visit_datetime"
      25. ]
      26. }
      27. },
      28. "writer": {
      29. "plugin": "odps",
      30. "parameter": {
      31. "datasource": "dw_demo_maxcompute",
      32. "table": "tmall_user_brand",
      33. "truncate": true,
      34. "partition": "",
      35. "column": [
      36. "user_id",
      37. "brand_id",
      38. "type",
      39. "visit_datetime"
      40. ]
      41. }
      42. },
      43. "setting": {
      44. "speed": {
      45. "mbps": "1",
      46. "concurrent": "1"
      47. }
      48. }
      49. }
      50. }
    2. Click Run and get data loaded from RDS to MaxCompute.

      2

Data Ingestion from OSS

  • You can ingest data from OSS to MaxCompute as previously mentioned. DataIDE supports following data sources.

    2

    Note: ODPS is now called as MaxCompute, and OTS is called as TableStore.

  • On the other hand, you can also leverage MaxCompute 2.0 new feature, which enables MaxCompute to access OSS data directly without loading data to it.

    1. Grant MaxCompute to access OSS.

      Log on to the RAM console to confirm Authorization.

      Or you can also log on to the RAM console, create a role AliyunODPSDefaultRole and set the following policy.

      1. {
      2. "Version": "1",
      3. "Statement": [
      4. {
      5. "Action": [
      6. "oss:ListBuckets",
      7. "oss:GetObject",
      8. "oss:ListObjects",
      9. "oss:PutObject",
      10. "oss:DeleteObject",
      11. "oss:AbortMultipartUpload",
      12. "oss:ListParts"
      13. ],
      14. "Resource": "*",
      15. "Effect": "Allow"
      16. }
      17. ]
      18. }
    2. Create an EXTERNAL MaxCompute table for OSS.

      CREATE EXTERNAL TABLE IF NOT EXISTS test_oss_external
      (
      user_id string,
      brand_id string,
      type string,
      visit_datetime string
      )
      STORED BY com.aliyun.odps.CsvStorageHandler
      WITH SERDEPROPERTIES (
      odps.properties.rolearn=acs:ram::<UID>:role/aliyunodpsdefaultrole
      )
      LOCATION oss://oss-cn-shanghai-internal.aliyuncs.com/bigcompute-demo/data/dw_demo/;

      Note: Location is defined as: LOCATION oss://${endpoint}/${bucket}/${userPath}/. The data on OSS is in CSV format, so use the built-in CsvStorageHandler. Otherwise, you can implement own handler to cope with the data.

      2

    3. Though labeled as ExternalTable, you can query the data as a common MaxCompute Table.

      2

Data Ingestion from Local

You can use tunnel in MaxCompute console to upload data from local to MaxCompute or download data from MaxCompute to local.

CREATE TABLE IF NOT EXISTS demo_tmall_local (
user_id string,
brand_id string,
type string COMMENT “click-0, buy-1, collect-2, shopping_cart-3”, visit_datetime string
);
tunnel upload /home/admin/demo/data/tmall_result.csv demo_tmall_local;

2

Note: To simplify the process, you can use the same one copy of sanitized tmall data to illustrate how to ingest data from different sources from the aforementioned demo. In real case, the data is diverse and complex.

Similarly, a copy of sanitized web server access log is loaded to table ods_log_tracker.

DROP TABLE IF EXISTS ods_log_tracker;
CREATE TABLE ods_log_tracker(
ip STRING COMMENT ‘client ip address’,
user STRING,
time DATETIME,
request STRING COMMENT ‘HTTP request type + requested path without args + HTTP protocol version’,
status BIGINT COMMENT ‘HTTP response code from server’,
size BIGINT,
Referrer STRING,
agent STRING)
PARTITIONED BY(dt STRING);

The data is shown as follows.

2

Here, this data is used to illustrate how to build Data Warehouse on MaxCompute.

Data processing

After source data is ready, you can design the Data Warehouse topic and models, create fact table, dim tables and aggregate data mart tables respectively.

—fact table
DROP TABLE IF EXISTS dw_log_detail;
CREATE TABLE IF NOT EXISTS dw_log_detail(
ip STRING COMMENT ‘client ip address’,
time DATETIME,
method STRING COMMENT ‘HTTP request type, such as GET POST…’,
url STRING,
protocol STRING,
status BIGINT COMMENT ‘HTTP response code from server’,
size BIGINT,
referrer STRING COMMENT ‘referrer domain’,
agent STRING,
device STRING COMMENT ‘android|iphone|ipad…’,
identity STRING COMMENT ‘identify: user, crawler, feed’)
PARTITIONED BY(dt STRING);
DROP TABLE IF EXISTS dw_log_fact;
CREATE TABLE IF NOT EXISTS dw_log_fact(
uid STRING COMMENT ‘unique user id’,
time DATETIME,
method STRING COMMENT ‘HTTP request type, such as GET POST…’,
url STRING,
status BIGINT COMMENT ‘HTTP response code from server’,
size BIGINT,
referrer STRING)
PARTITIONED BY(dt STRING);
—dim table
DROP TABLE IF EXISTS dim_user_info;
CREATE TABLE IF NOT EXISTS dim_user_info(
uid STRING COMMENT ‘unique user id’,
ip STRING COMMENT ‘client ip address’,
city STRING,
device STRING,
protocol STRING,
identity STRING COMMENT ‘user, crawler, feed’,
agent STRING);
—aggregate table
DROP TABLE IF EXISTS adm_refer_info;
CREATE TABLE adm_refer_info(
referrer STRING,
count BIGINT)
PARTITIONED BY(dt STRING);
DROP TABLE IF EXISTS adm_user_measures;
CREATE TABLE IF NOT EXISTS adm_user_measures(
device STRING COMMENT ‘such as android, iphone, ipad…’,
pv BIGINT,
uv BIGINT)
PARTITIONED BY(dt STRING);
  1. Run those SQL in MaxCompute cmd, or DataIDE.

  2. Process data by SQL. The following uses DataIDE for further operations:

  3. Go to Data Development panel, create a Workflow task.

    2

  4. In Workflow node, we support OPEN_MR, ODPS_SQL, Data SYMC, Script etc. Drag the node and generate the flow.

    2

  5. Double-click the node to edit the SQL.

    2

  6. Submit the flow and test run it. Go to O&M center and check the running results.

    For each log, you can check the running log and the logview. For example,

    2

Data representation

You can use Quick BI to represent data.

  1. Log on to the Quick BI console, create data collections from data sources.

  2. Select the template, and set the measures and dims.

Conclusion

You can leverage DataIDE to build Data Warehouse on MaxCompute. Comparing with Hadoop, MaxCompute is out-of-box. You do not have to care about cluster setup and management. DataWorks provides easy-to-use GUI to handle the whole process, which make things easier for the beginners.

Further reading

For more information about how to Access OSS from MaxCompute, see Access OSS data.

For more information about MaxCompute, see MaxCompute.

For more information about DataIDE, see DataIDE.

Thank you! We've received your feedback.