All Products
Search
Document Center

DataHub:Create a MaxCompute synchronization

Last Updated:Mar 12, 2026

Preparations

Create a MaxCompute table

DataHub lets you synchronize data to MaxCompute tables. It supports both partitioned and non-partitioned tables. We recommend using partitioned tables for easier data processing in MaxCompute.

You can use DataHub to synchronize data from TUPLE and BLOB topics to MaxCompute tables.

  • If you synchronize data from a TUPLE topic, the data types in the destination MaxCompute table must match the DataHub data types. The following table describes the data type mapping.

    MaxCompute

    DataHub

    BIGINT

    BIGINT

    STRING

    STRING

    BOOLEAN

    BOOLEAN

    DOUBLE

    DOUBLE

    DATETIME

    TIMESTAMP

    DECIMAL

    DECIMAL

    TINYINT

    TINIINT

    SMALLINT

    SMALLINT

    INT

    INTEGER

    FLOAT

    FLOAT

    MAP

    Not supported

    ARRAY

    Not supported

    DataHub does not support all MaxCompute data types. You must create the MaxCompute table schema based on the DataHub data types.

  • If you synchronize data from a BLOB topic, the MaxCompute table schema must contain only one STRING column. By default, DataHub synchronizes data to this column.

    DataHub

    MaxCompute

    BLOB

    STRING

  • To make data tracking and troubleshooting easier, you can add a __rowkey__ STRING field when you create the MaxCompute table schema. DataHub automatically syncs the trace information for the data to this column. This helps with future data investigation.

Prepare an account for the sync task and grant permissions

  • When you create a sync task for MaxCompute, you must enter the account information required to access the MaxCompute table. Make sure the account information is valid. A MaxCompute RAM user is usually sufficient.

  • You must grant the account the required permissions to access the MaxCompute table. These permissions include CreateInstance, Describe, Alter, and Update.

    You can use the DataWorks console to manage permissions for MaxCompute tables. For more information, see Configure MaxCompute Engine PermissionsConfigure MaxCompute Engine Permissions. You can also use the MaxCompute command line interface to grant permissions. For more information, see MaxCompute Usage and Authorization Management.

Confirm the TimestampUnit

  • The TimestampUnit parameter in the connector converts TIMESTAMP data. The data is converted based on the unit that you specify and then written to a date-type field, such as a datetime field, in the downstream system.

  • If the TIMESTAMP column value is in seconds, set TimestampUnit to "SECOND" when you create the connector. If the value is in milliseconds, set it to "MILLISECOND". If the value is in microseconds, set it to "MICROSECOND".

Important

Many partitions can slow down data synchronization from DataHub. This is because of the current writing standards of MaxCompute. When you create a MaxCompute sync task, you should limit the number of partitions. This is especially important for the USER_DEFINE sync mode.

  • Keep data within the same partition as continuous as possible. Avoid frequent partition changes.

  • When you create partitions, do not create too many.

Note

If the whitelist feature is enabled for a MaxCompute project, only devices in the whitelist can access the project. After you enable the MaxCompute IP whitelist, you must configure a service whitelist. This ensures that the sync service can be accessed. For more information about how to configure the whitelist, see Overview.

Sync modes

Append mode

Data is appended to the destination table. This mode is suitable for scenarios where data only needs to be appended, not updated.

Upsert mode

Upsert combines Update and Insert operations. The logic is as follows:

  • If a record with the same primary key exists in the destination table, the existing record is updated.

  • If a record with the same primary key does not exist, a new record is inserted.

The Upsert mode lets you handle data updates and insertions more flexibly. It ensures that the data in the destination table is always current.

Note

For more information about the MaxCompute Upsert feature, see Terms.

Scenarios

  • Update data based on a primary key: Data may change over time and must be updated based on its primary key.

  • Maintain data uniqueness in the destination table: Ensure that each record in the destination table is unique to avoid data duplication.

  • Process duplicate data: Remove duplicates from large amounts of data based on a primary key.

Configuration description

  1. DataHub topic type: Must be a TUPLE topic.

  2. DataHub Topic Schema: The following two types are supported:

    1. The schema type for data synchronized from DTS to DataHub, which is referred to as the DTS format.

    2. In a schema that you create, you must select a String column as the operation column. This defines the schema as a custom format.

  3. ODPS destination table: Must be a Transactional Table 2.0.

Sync rules

1. DTS format

When you synchronize data from DTS to DataHub, DataHub uses the operation_flag, before_flag, and after_flag columns in the schema to determine how to synchronize data to the ODPS destination table. The rules are as follows:

operation_flag

before_flag

after_flag

OperationType

Sync to destination table

I

*

*

UPSERT

Update the record in the destination table based on the primary key.

U

Y

N

DELETE

Delete the record from the destination table based on the primary key.

U

N

Y

UPSERT

Update the record in the destination table based on the primary key.

D

*

*

DELETE

Delete the record from the destination table based on the primary key.

2. Custom format

For custom data, DataHub determines how to synchronize data to the ODPS destination table based on the operation column that you select.

ddddd

OperationType

Sync to destination table

U

UPSERT

Update the record in the destination table based on the primary key.

D

DELETE

Delete the record from the destination table based on the primary key.

Create a sync task

  1. In DataHub, click a topic to go to its details page.

  2. On the topic details page, click the Sync button in the upper-right corner to create a sync task.

    c

  3. Select the MaxCompute job type to go to the Create Connector page.

    Configuration item description:

    • Parameter

      Options

      Required

      Description

      Project Name

      /

      Yes

      The name of the MaxCompute project.

      Schema

      /

      No

      The name of the MaxCompute schema.

      Note

      To use the schema feature, enable schema syntax development. For more information about how to enable this feature and other schema details, see Schema operations.

      Table

      /

      Yes

      The name of the MaxCompute table.

      Note

      If you use the Upsert mode, the destination table must be a Transactional Table 2.0.

      Sync mode

      Append

      Yes

      Appends data to the MaxCompute destination table.

      Upsert

      Updates or deletes records in the MaxCompute transactional table based on the primary key.

      For more information, see the Sync modes section in this topic.

      Upsert method

      SYNC_CUSTOM 

      Required if you set Sync mode to Upsert. Not applicable if you set Sync mode to Append.

      A custom field for the upsert operation.

      SYNC_NONE

      All data is written to the destination table as an upsert operation.

      SYNC_DTS

      Used when data is written to DataHub from DTS and the new DTS attachment column rule is enabled.

      SYNC_DTS_OLD  

      This applies to scenarios where you use DTS to write data to DataHub and enable the new attachment column rule.

      Primary key field

      /

      The primary key column that you specify when you create the downstream table for the Upsert sync mode.

      Upsert operation field

      /

      Required if you set Upsert method to SYNC_CUSTOM.

      Select a STRING column as the operation column. This column indicates whether the data is synchronized to the downstream table as an upsert or delete operation.

      For more information about the Upsert mode, see the Upsert mode section in this topic.

    • Import fields: DataHub can synchronize the content of specified columns to the MaxCompute table based on your settings.

    • Partition mode: This mode determines the MaxCompute partition to which data is written. DataHub supports the following partition modes:

      Partition mode

      Partition basis

      Supported topic type

      Description

      USER_DEFINE

      Value of the partition key column in the record. The column must have the same name as the partition field in MaxCompute.

      TUPLE

      (1) The DataHub schema must contain the MaxCompute partition field. (2) The value of this column must be a UTF-8 string. The value can be empty, which indicates that the data is not partitioned.

      SYSTEM_TIME

      Time when the record is written to DataHub.

      TUPLE / BLOB

      (1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.

      EVENT_TIME

      Value of the event_time(TIMESTAMP) column in the record.

      TUPLE

      (1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.

      META_TIME

      Value of the __dh_meta_time__ attribute field of the record.

      TUPLE / BLOB

      (1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.

      The SYSTEM_TIME, EVENT_TIME, and META_TIME modes partition data based on the timestamp and time zone configurations. The default unit for the timestamp is microsecond.

    • Partition configuration: Specifies the settings for partitioning data based on timestamps. The console uses a fixed MaxCompute partition format by default. The partition configuration is as follows:

      Partition

      Time Format

      Description

      ds

      %Y%m%d

      day

      hh

      %H

      hour

      mm

      %M

      minute

      • The partition interval determines the time interval used to convert timestamps into MaxCompute partitions. The time range is 15 minutes to 1440 minutes (1 day), and the step interval is 15 minutes.

      • Time zone (TimeZone): Specifies the time zone for partitioning data based on timestamps.

      • Separator: When you synchronize BLOB data, you can specify a hexadecimal separator to split the data before it is synchronized to MaxCompute. For example, 0A represents a line feed (\n).

      • Base64 encoding: By default, DataHub stores BLOB data as binary data. The corresponding column in MaxCompute is of the STRING type. Therefore, when you create a sync task in the console, the data is Base64-encoded by default before synchronization. For more customization options, you can use the software development kit (SDK).

View a sync task

You can click the details page of a connector to view the status and checkpoint information of the sync task. This includes sync checkpoints, sync status, and operations such as restart and stop. The following figure shows an example.

Synchronization examples

1. USER_DEFINE sync mode

  1. Create a DataHub topic.

    Note: The topic schema must contain the MaxCompute partition field. The field must be of the STRING type, as shown in the following figure: 5-5

  2. Write data to the DataHub topic. You can use a DataHub SDK to write the data.

    During the test, use the SDK to write several records. The values for [ds,hh,mm] are [20210304,01,15] and [20210304,02,15]. The data is as follows:

5-6

3. Create a sync task.

In USER_DEFINE partition mode, you can set the partition configuration fields during synchronization. If a corresponding table does not exist in MaxCompute, it is automatically created.

In this example, the f1 and f2 fields are imported. The f3 field is not synchronized.

4. Confirm the synchronized data.

You can view the synchronization information for the sync task in the DataHub console. Query the data in MaxCompute. The result is as follows: 5-9 In USER_DEFINE mode, DataHub synchronizes data to the corresponding partition based on the value of the MaxCompute grouping field.

2. SYSTEM_TIME sync mode

  1. Create a DataHub topic.

    Note: The partition is calculated based on the time when data is written to DataHub. Therefore, the topic schema only needs to contain data fields, not partition fields, as shown in the following figure:

a

  1. Write data to the DataHub topic. You can use a DataHub SDK to write the data.

    During the test, use the SDK to write several records. The time when the data is written to DataHub is 2021-03-04 14:02:45. The data is as follows: 5-11

  2. Create a sync task.

    • Make sure that the partition configuration is consistent with the MaxCompute table partitions.

4. Confirm the synchronized data.

You can view the synchronization information for the sync task in the DataHub console, such as DoneTime. Query the data in MaxCompute. The result is as follows: 5-14 In SYSTEM_TIME mode, DataHub synchronizes data to the corresponding partition based on the time when the data was written to DataHub.

FAQ

  • The time of the timestamp field synchronized to MaxCompute becomes 1970-01-19.

    Cause: The default unit for timestamps synchronized from DataHub to MaxCompute is microsecond. The timestamp written by the user is in milliseconds. Solution: Write timestamps to DataHub in microseconds.