By Yi Xiu
You may have encountered parameter setting problems on partitioning or DataWorks scheduling when you ship data to MaxCompute by using DataWorks. This article provides solutions to these problems by simulating a real case as follows:
Official help document: https://www.alibabacloud.com/help/doc-detail/68322.html
Step 1. Go to Data Integration and then go to the Data Source tab page.
Step 2. In the upper right corner,
click Add Data Source and choose Message Queue > LogHub.
Step 3. Enter the required fields in the Add LogHub Data Source dialog box: Data Source Name, LogHub
Endpoint, Project, AccessKey ID, and AccessKey Secret. Then click Test Connectivity.
Step 1. Click Temporary Query in the left-side navigation pane. Right click anywhere on the Query page, and select Create > ODPS SQL.
Step 2. Write the DDL statement for creating the tables.
Step 3. Click the
Run button to create the destination tables: ods_client_operation_log, ods_vedio_server_log, and ods_web_tracking_log.
Step 4. When you see the message "shell run successfully!", these three DDL statements have been run successfully.
Step 5. Use the desc command to view the created tables.
You can use the desc command to view the other two tables, and ensure they exist.
After creating the data source and testing the connectivity in DataWorks, you can use synchronize data from the data source to MaxCompute through a data synchronization task.
Procedure
Step 1. Click
Create Business Flow and then click Confirm. Name the business flow as ApsaraVideo Live log collection.
Step 2. Successively create the following dependencies on the Business Flow Development panel.
Configure the data synchronization nodes as follows: web_tracking_log_syn, client_operation_log_syn, and vedio_server_log_syn.
Step 3. Double click
web_tracking_log_syn to enter node configuration page. The configuration items include: Data Source (Source and Destination), Mappings (Source Table and Destination Table), and Channel.
Specify the parameters based on the data collection window as follows:
Set the consumption checkpoint to once every five minutes. From 00:00 to 23:59, startTime= [yyyymmddhh24miss-10/24/60]The first 10 minutes of the system time to endTime=[yyyymmddhh24miss-5/24/60] The first five minutes of the system time (note that this time is different from the consumption checkpoint shown in the preceding figure). Then set ds=[yyyymmdd-5/24/60], hr=[hh24-5/24/60], min=[mi-5/24/60].
Step 4. Click Advanced run to perform testing.
You can perform testing by manually entering custom parameters.
Step 3. Use a SQL script to verify whether the data has already been written into the destination table, as shown in the following figure.
After synchronizing logs of Log Service to MaxCompute, you can proceed with the data processing.
For example, you can record the statistics of top channels, region distribution, and buffering lag.
The detailed SQL logic will not be elaborated here. You can implement statistical analysis based on your actual business needs. The dependency relationship is configured as shown in the preceding figure.
Data Quality Management of Data Warehouses Based on MaxCompute
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - February 17, 2021
Alibaba EMR - July 9, 2021
Alibaba Cloud MaxCompute - April 26, 2020
Alibaba Cloud New Products - January 19, 2021
Alibaba Clouder - June 26, 2018
Alibaba Cloud MaxCompute - July 14, 2021
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud MaxCompute