×
Community Blog Archive Log Service Data to MaxCompute for Offline Analysis Using DataWorks

Archive Log Service Data to MaxCompute for Offline Analysis Using DataWorks

This article discusses the solutions to parameter setting problems on partitioning or DataWorks scheduling when shipping data to MaxCompute.

By Yi Xiu

You may have encountered parameter setting problems on partitioning or DataWorks scheduling when you ship data to MaxCompute by using DataWorks. This article provides solutions to these problems by simulating a real case as follows:

1

Official help document: https://www.alibabacloud.com/help/doc-detail/68322.html

Create a Data Source:

Step 1. Go to Data Integration and then go to the Data Source tab page.

2

Step 2. In the upper right corner,

click Add Data Source and choose Message Queue > LogHub.

3

4

Step 3. Enter the required fields in the Add LogHub Data Source dialog box: Data Source Name, LogHub

Endpoint, Project, AccessKey ID, and AccessKey Secret. Then click Test Connectivity.

5

Create Destination Tables:

Step 1. Click Temporary Query in the left-side navigation pane. Right click anywhere on the Query page, and select Create > ODPS SQL.

6

Step 2. Write the DDL statement for creating the tables.

Step 3. Click the

Run button to create the destination tables: ods_client_operation_log, ods_vedio_server_log, and ods_web_tracking_log.

Step 4. When you see the message "shell run successfully!", these three DDL statements have been run successfully.

7

Step 5. Use the desc command to view the created tables.

8

You can use the desc command to view the other two tables, and ensure they exist.

Create a Data Synchronization Task

After creating the data source and testing the connectivity in DataWorks, you can use synchronize data from the data source to MaxCompute through a data synchronization task.

Procedure

Step 1. Click

Create Business Flow and then click Confirm. Name the business flow as ApsaraVideo Live log collection.

9

Step 2. Successively create the following dependencies on the Business Flow Development panel.

10

Configure the data synchronization nodes as follows: web_tracking_log_syn, client_operation_log_syn, and vedio_server_log_syn.

Step 3. Double click

web_tracking_log_syn to enter node configuration page. The configuration items include: Data Source (Source and Destination), Mappings (Source Table and Destination Table), and Channel.

11

12

Specify the parameters based on the data collection window as follows:

Set the consumption checkpoint to once every five minutes. From 00:00 to 23:59, startTime= [yyyymmddhh24miss-10/24/60]The first 10 minutes of the system time to endTime=[yyyymmddhh24miss-5/24/60] The first five minutes of the system time (note that this time is different from the consumption checkpoint shown in the preceding figure). Then set ds=[yyyymmdd-5/24/60], hr=[hh24-5/24/60], min=[mi-5/24/60].

Step 4. Click Advanced run to perform testing.

13

You can perform testing by manually entering custom parameters.

14

Step 3. Use a SQL script to verify whether the data has already been written into the destination table, as shown in the following figure.

15

After synchronizing logs of Log Service to MaxCompute, you can proceed with the data processing.

For example, you can record the statistics of top channels, region distribution, and buffering lag.

16

The detailed SQL logic will not be elaborated here. You can implement statistical analysis based on your actual business needs. The dependency relationship is configured as shown in the preceding figure.

0 0 0
Share on

Alibaba Cloud MaxCompute

135 posts | 18 followers

You may also like

Comments

Alibaba Cloud MaxCompute

135 posts | 18 followers

Related Products