Use DataHub to migrate log data to MaxCompute - MaxCompute

This topic describes how to use DataHub to stream log data into MaxCompute for batch processing. You will create a DataHub project and topic, set up a MaxCompute DataConnector, and verify that data flows into your MaxCompute table.

Prerequisites

Ensure that the following permissions are granted to the account authorized to access MaxCompute:

CreateInstance permission on MaxCompute projects
Permissions to view, modify, and update MaxCompute tables

For more information, see MaxCompute permissions.

How it works

DataHub is a platform designed to process streaming data. After data is uploaded to a DataHub topic, it is stored for real-time processing. A MaxCompute DataConnector within DataHub periodically batches the incoming records and writes them to a MaxCompute table, where you can run SQL queries for batch processing.

By default, DataHub triggers a sync to MaxCompute at five-minute intervals or when the buffered data reaches 64 MB, whichever comes first. To set up this pipeline, you only need to create and configure a DataConnector in DataHub.

Log Source  --->  DataHub Topic  --->  MaxCompute DataConnector  --->  MaxCompute Table
                  (streaming)         (batch sync every              (partitioned,
                                       5 min or 64 MB)               offline query)

Procedure

Step 1: Create a MaxCompute table

On the odpscmd client (MaxCompute command-line tool), create a table to store the data that will be synchronized from DataHub. For example, run the following SQL statement to create a partitioned table:

CREATE TABLE test(f1 string, f2 string, f3 double) partitioned by (ds string);

Step 2: Create a DataHub project

Log on to the DataHub console. In the upper-left corner, select a region.
In the left-side navigation pane, click Projects.
In the upper-right corner of the Projects page, click Create Project.
In the Create Project panel, configure Name and Description, and then click Create.

Step 3: Create a topic

On the Projects page, find the desired project and click View in the Actions column.
On the project details page, click Create Topic in the upper-right corner.
In the Create Topic panel, select Import MaxCompute Tables for Creation Type and configure the other parameters.
Click Next Step to complete the topic configuration.

Note - Schema corresponds to a MaxCompute table. The field names, data types, and field sequence specified by Schema must be consistent with those of the MaxCompute table. You can create a DataConnector only if all three conditions are met. - You can migrate topics of the TUPLE and BLOB types to MaxCompute tables. - A maximum of 20 topics can be created by default. If you need more topics, submit a ticket. - Only the owner of a DataHub topic or the Creator account has the permissions to manage a DataConnector. For example, you can create or delete a DataConnector.

Step 4: Create a MaxCompute DataConnector

On the Topic List tab of the project details page, find the newly created topic and click View in the Actions column.
On the topic details page, click Connector in the upper-right corner.
In the Create Connector panel, click MaxCompute, configure the parameters, and then click Create.

Step 5: View DataConnector details

In the left-side navigation pane, click Projects.
On the Projects page, find the desired project and click View in the Actions column.
On the Topic List tab, find the topic and click View in the Actions column.
On the topic details page, click the Connector tab.
Find the newly created DataConnector and click View to view DataConnector details.

By default, DataHub migrates data to MaxCompute tables at five-minute intervals or when the amount of data reaches 64 MB. Sync Offset indicates the number of migrated data entries.

Step 6: Verify the migration

Execute the following SQL statement to check whether the log data has been migrated to MaxCompute:

SELECT * FROM test;

If results are returned as shown in the following figure, the log data has been migrated to MaxCompute successfully.

What's next

After you verify that the data pipeline is working, consider the following actions:

Monitor DataConnector status: Periodically check the Connector tab for your topic to confirm that Sync Offset is increasing and that no errors have occurred.
Query with partition filters: Use partition filters in your queries (for example, SELECT * FROM test WHERE ds='<partition_value>';) to improve query performance on large datasets.
Scale your pipeline: If you need higher throughput, you can increase the number of shards in your DataHub topic.

Appendix: Data type mappings

The following table lists the data type mappings between MaxCompute and DataHub. When you create a DataHub topic, the schema must use compatible data types.

MaxCompute	DataHub	Notes
BIGINT	BIGINT	Direct mapping.
STRING	STRING	Direct mapping.
BOOLEAN	BOOLEAN	Direct mapping.
DOUBLE	DOUBLE	Direct mapping.
DATETIME	TIMESTAMP	DataHub TIMESTAMP maps to MaxCompute DATETIME.
DECIMAL	DECIMAL	Direct mapping.
TINYINT	TINYINT	Direct mapping.
SMALLINT	SMALLINT	Direct mapping.
INT	INTEGER	DataHub uses INTEGER; MaxCompute uses INT.
FLOAT	FLOAT	Direct mapping.
BLOB	STRING	BLOB data in MaxCompute is mapped to STRING in DataHub.
MAP	Not supported	MAP types cannot be synchronized to DataHub.
ARRAY	Not supported	ARRAY types cannot be synchronized to DataHub.