This topic describes how to use DataHub to stream log data into MaxCompute for batch processing. You will create a DataHub project and topic, set up a MaxCompute DataConnector, and verify that data flows into your MaxCompute table.
Prerequisites
Ensure that the following permissions are granted to the account authorized to access MaxCompute:
CreateInstance permission on MaxCompute projects
Permissions to view, modify, and update MaxCompute tables
For more information, see MaxCompute permissions.
How it works
DataHub is a platform designed to process streaming data. After data is uploaded to a DataHub topic, it is stored for real-time processing. A MaxCompute DataConnector within DataHub periodically batches the incoming records and writes them to a MaxCompute table, where you can run SQL queries for batch processing.
By default, DataHub triggers a sync to MaxCompute at five-minute intervals or when the buffered data reaches 64 MB, whichever comes first. To set up this pipeline, you only need to create and configure a DataConnector in DataHub.
Log Source ---> DataHub Topic ---> MaxCompute DataConnector ---> MaxCompute Table
(streaming) (batch sync every (partitioned,
5 min or 64 MB) offline query)Procedure
Step 1: Create a MaxCompute table
On the odpscmd client (MaxCompute command-line tool), create a table to store the data that will be synchronized from DataHub. For example, run the following SQL statement to create a partitioned table:
CREATE TABLE test(f1 string, f2 string, f3 double) partitioned by (ds string);Step 2: Create a DataHub project
Log on to the DataHub console. In the upper-left corner, select a region.
In the left-side navigation pane, click Projects.
In the upper-right corner of the Projects page, click Create Project.
In the Create Project panel, configure Name and Description, and then click Create.
Step 3: Create a topic
On the Projects page, find the desired project and click View in the Actions column.
On the project details page, click Create Topic in the upper-right corner.
In the Create Topic panel, select Import MaxCompute Tables for Creation Type and configure the other parameters.

Click Next Step to complete the topic configuration.
Note - Schema corresponds to a MaxCompute table. The field names, data types, and field sequence specified by Schema must be consistent with those of the MaxCompute table. You can create a DataConnector only if all three conditions are met. - You can migrate topics of the TUPLE and BLOB types to MaxCompute tables. - A maximum of 20 topics can be created by default. If you need more topics, submit a ticket. - Only the owner of a DataHub topic or the Creator account has the permissions to manage a DataConnector. For example, you can create or delete a DataConnector.
Step 4: Create a MaxCompute DataConnector
On the Topic List tab of the project details page, find the newly created topic and click View in the Actions column.
On the topic details page, click Connector in the upper-right corner.
In the Create Connector panel, click MaxCompute, configure the parameters, and then click Create.
Step 5: View DataConnector details
In the left-side navigation pane, click Projects.
On the Projects page, find the desired project and click View in the Actions column.
On the Topic List tab, find the topic and click View in the Actions column.
On the topic details page, click the Connector tab.
Find the newly created DataConnector and click View to view DataConnector details.
By default, DataHub migrates data to MaxCompute tables at five-minute intervals or when the amount of data reaches 64 MB. Sync Offset indicates the number of migrated data entries.

Step 6: Verify the migration
Execute the following SQL statement to check whether the log data has been migrated to MaxCompute:
SELECT * FROM test;If results are returned as shown in the following figure, the log data has been migrated to MaxCompute successfully.

What's next
After you verify that the data pipeline is working, consider the following actions:
Monitor DataConnector status: Periodically check the Connector tab for your topic to confirm that Sync Offset is increasing and that no errors have occurred.
Query with partition filters: Use partition filters in your queries (for example,
SELECT * FROM test WHERE ds='<partition_value>';) to improve query performance on large datasets.Scale your pipeline: If you need higher throughput, you can increase the number of shards in your DataHub topic.
Appendix: Data type mappings
The following table lists the data type mappings between MaxCompute and DataHub. When you create a DataHub topic, the schema must use compatible data types.
| MaxCompute | DataHub | Notes |
|---|---|---|
| BIGINT | BIGINT | Direct mapping. |
| STRING | STRING | Direct mapping. |
| BOOLEAN | BOOLEAN | Direct mapping. |
| DOUBLE | DOUBLE | Direct mapping. |
| DATETIME | TIMESTAMP | DataHub TIMESTAMP maps to MaxCompute DATETIME. |
| DECIMAL | DECIMAL | Direct mapping. |
| TINYINT | TINYINT | Direct mapping. |
| SMALLINT | SMALLINT | Direct mapping. |
| INT | INTEGER | DataHub uses INTEGER; MaxCompute uses INT. |
| FLOAT | FLOAT | Direct mapping. |
| BLOB | STRING | BLOB data in MaxCompute is mapped to STRING in DataHub. |
| MAP | Not supported | MAP types cannot be synchronized to DataHub. |
| ARRAY | Not supported | ARRAY types cannot be synchronized to DataHub. |