By Yi Xiu
You may have encountered parameter setting problems on partitioning or DataWorks scheduling when you ship data to MaxCompute by using DataWorks. This article provides solutions to these problems by simulating a real case as follows:
Official help document: https://www.alibabacloud.com/help/doc-detail/68322.html
Step 1. Go to Data Integration and then go to the Data Source tab page.
Step 2. In the upper right corner,
click Add Data Source and choose Message Queue > LogHub.
Step 3. Enter the required fields in the Add LogHub Data Source dialog box: Data Source Name, LogHub
Endpoint, Project, AccessKey ID, and AccessKey Secret. Then click Test Connectivity.
Step 1. Click Temporary Query in the left-side navigation pane. Right click anywhere on the Query page, and select Create > ODPS SQL.
Step 2. Write the DDL statement for creating the tables.
Step 3. Click the
Run button to create the destination tables: ods_client_operation_log, ods_vedio_server_log, and ods_web_tracking_log.
Step 4. When you see the message "shell run successfully!", these three DDL statements have been run successfully.
Step 5. Use the desc command to view the created tables.
You can use the desc command to view the other two tables, and ensure they exist.
After creating the data source and testing the connectivity in DataWorks, you can use synchronize data from the data source to MaxCompute through a data synchronization task.
Step 1. Click
Create Business Flow and then click Confirm. Name the business flow as ApsaraVideo Live log collection.
Step 2. Successively create the following dependencies on the Business Flow Development panel.
Configure the data synchronization nodes as follows: web_tracking_log_syn, client_operation_log_syn, and vedio_server_log_syn.
Step 3. Double click
web_tracking_log_syn to enter node configuration page. The configuration items include: Data Source (Source and Destination), Mappings (Source Table and Destination Table), and Channel.
Specify the parameters based on the data collection window as follows:
Set the consumption checkpoint to once every five minutes. From 00:00 to 23:59, startTime= [yyyymmddhh24miss-10/24/60]The first 10 minutes of the system time to endTime=[yyyymmddhh24miss-5/24/60] The first five minutes of the system time (note that this time is different from the consumption checkpoint shown in the preceding figure). Then set ds=[yyyymmdd-5/24/60], hr=[hh24-5/24/60], min=[mi-5/24/60].
Step 4. Click Advanced run to perform testing.
You can perform testing by manually entering custom parameters.
Step 3. Use a SQL script to verify whether the data has already been written into the destination table, as shown in the following figure.
After synchronizing logs of Log Service to MaxCompute, you can proceed with the data processing.
For example, you can record the statistics of top channels, region distribution, and buffering lag.
The detailed SQL logic will not be elaborated here. You can implement statistical analysis based on your actual business needs. The dependency relationship is configured as shown in the preceding figure.
Alibaba Cloud MaxCompute - April 26, 2020
Alibaba Clouder - June 26, 2018
Alibaba Clouder - October 1, 2019
Alibaba Cloud MaxCompute - March 3, 2020
Alibaba Cloud MaxCompute - May 30, 2019
Alibaba Clouder - April 11, 2018
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
ApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.Learn More
This solution helps you easily build a robust data security framework to safeguard your data assets throughout the data security lifecycle with ensured confidentiality, integrity, and availability of your data.Learn More
SDDP automatically discovers sensitive data in a large amount of user-authorized data, and detects, records, and analyzes sensitive data consumption activities.Learn More
More Posts by Alibaba Cloud MaxCompute