Stream DataHub Topics into MaxCompute with Real-Time DataConnector - DataHub

Preparations

Create a MaxCompute table

DataHub lets you synchronize data to MaxCompute tables. It supports both partitioned and non-partitioned tables. We recommend using partitioned tables for easier data processing in MaxCompute.

You can use DataHub to synchronize data from TUPLE and BLOB topics to MaxCompute tables.

If you synchronize data from a TUPLE topic, the data types in the destination MaxCompute table must match the DataHub data types. The following table describes the data type mapping.
MaxCompute
DataHub
BIGINT
BIGINT
STRING
STRING
BOOLEAN
BOOLEAN
DOUBLE
DOUBLE
DATETIME
TIMESTAMP
DECIMAL
DECIMAL
TINYINT
TINIINT
SMALLINT
SMALLINT
INT
INTEGER
FLOAT
FLOAT
MAP
Not supported
ARRAY
Not supported
DataHub does not support all MaxCompute data types. You must create the MaxCompute table schema based on the DataHub data types.
If you synchronize data from a BLOB topic, the MaxCompute table schema must contain only one STRING column. By default, DataHub synchronizes data to this column.
DataHub
MaxCompute
BLOB
STRING
To make data tracking and troubleshooting easier, you can add a __rowkey__ STRING field when you create the MaxCompute table schema. DataHub automatically syncs the trace information for the data to this column. This helps with future data investigation.

Prepare an account for the sync task and grant permissions

When you create a sync task for MaxCompute, you must enter the account information required to access the MaxCompute table. Make sure the account information is valid. A MaxCompute RAM user is usually sufficient.
You must grant the account the required permissions to access the MaxCompute table. These permissions include CreateInstance, Describe, Alter, and Update.
You can use the DataWorks console to manage permissions for MaxCompute tables. For more information, see Configure MaxCompute Engine Permissions Configure MaxCompute Engine Permissions. You can also use the MaxCompute command line interface to grant permissions. For more information, see MaxCompute Usage and Authorization Management.

Confirm the TimestampUnit

The TimestampUnit parameter in the connector converts TIMESTAMP data. The data is converted based on the unit that you specify and then written to a date-type field, such as a datetime field, in the downstream system.
If the TIMESTAMP column value is in seconds, set TimestampUnit to "SECOND" when you create the connector. If the value is in milliseconds, set it to "MILLISECOND". If the value is in microseconds, set it to "MICROSECOND".

Important

Many partitions can slow down data synchronization from DataHub. This is because of the current writing standards of MaxCompute. When you create a MaxCompute sync task, you should limit the number of partitions. This is especially important for the USER_DEFINE sync mode.

Keep data within the same partition as continuous as possible. Avoid frequent partition changes.
When you create partitions, do not create too many.

Note

If the whitelist feature is enabled for a MaxCompute project, only devices in the whitelist can access the project. After you enable the MaxCompute IP whitelist, you must configure a service whitelist. This ensures that the sync service can be accessed. For more information about how to configure the whitelist, see Overview.

Sync modes

Append mode

Data is appended to the destination table. This mode is suitable for scenarios where data only needs to be appended, not updated.

Upsert mode

Upsert combines Update and Insert operations. The logic is as follows:

If a record with the same primary key exists in the destination table, the existing record is updated.
If a record with the same primary key does not exist, a new record is inserted.

The Upsert mode lets you handle data updates and insertions more flexibly. It ensures that the data in the destination table is always current.

Note

For more information about the MaxCompute Upsert feature, see Terms.

Scenarios

Update data based on a primary key: Data may change over time and must be updated based on its primary key.
Maintain data uniqueness in the destination table: Ensure that each record in the destination table is unique to avoid data duplication.
Process duplicate data: Remove duplicates from large amounts of data based on a primary key.

Configuration description

DataHub topic type: Must be a TUPLE topic.
DataHub Topic Schema: The following two types are supported:
1. The schema type for data synchronized from DTS to DataHub, which is referred to as the DTS format.
2. In a schema that you create, you must select a String column as the operation column. This defines the schema as a custom format.
ODPS destination table: Must be a Transactional Table 2.0.

Sync rules

1. DTS format

When you synchronize data from DTS to DataHub, DataHub uses the operation_flag, before_flag, and after_flag columns in the schema to determine how to synchronize data to the ODPS destination table. The rules are as follows:

operation_flag	before_flag	after_flag	OperationType	Sync to destination table
I	*	*	UPSERT	Update the record in the destination table based on the primary key.
U	Y	N	DELETE	Delete the record from the destination table based on the primary key.
U	N	Y	UPSERT	Update the record in the destination table based on the primary key.
D	*	*	DELETE	Delete the record from the destination table based on the primary key.

2. Custom format

For custom data, DataHub determines how to synchronize data to the ODPS destination table based on the operation column that you select.

ddddd	OperationType	Sync to destination table
U	UPSERT	Update the record in the destination table based on the primary key.
D	DELETE	Delete the record from the destination table based on the primary key.

Create a sync task

In DataHub, click a topic to go to its details page.
On the topic details page, click the Sync button in the upper-right corner to create a sync task.

Select the MaxCompute job type to go to the Create Connector page.

Configuration item description:

Parameter	Options	Required	Description
Project Name	/	Yes	The name of the MaxCompute project.
Schema	/	No	The name of the MaxCompute schema. Note To use the schema feature, enable schema syntax development. For more information about how to enable this feature and other schema details, see Schema operations.
Table	/	Yes	The name of the MaxCompute table. Note If you use the Upsert mode, the destination table must be a Transactional Table 2.0.
Sync mode	Append	Yes	Appends data to the MaxCompute destination table.
Sync mode	Upsert	Yes	Updates or deletes records in the MaxCompute transactional table based on the primary key. For more information, see the Sync modes section in this topic.
Upsert method	SYNC_CUSTOM	Required if you set Sync mode to Upsert. Not applicable if you set Sync mode to Append.	A custom field for the upsert operation.
	SYNC_NONE		All data is written to the destination table as an upsert operation.
	SYNC_DTS		Used when data is written to DataHub from DTS and the new DTS attachment column rule is enabled.
	SYNC_DTS_OLD		This applies to scenarios where you use DTS to write data to DataHub and enable the new attachment column rule.
Primary key field	/		The primary key column that you specify when you create the downstream table for the Upsert sync mode.
Upsert operation field	/	Required if you set Upsert method to SYNC_CUSTOM.	Select a STRING column as the operation column. This column indicates whether the data is synchronized to the downstream table as an upsert or delete operation.

For more information about the Upsert mode, see the Upsert mode section in this topic.

Import fields: DataHub can synchronize the content of specified columns to the MaxCompute table based on your settings.

Partition mode: This mode determines the MaxCompute partition to which data is written. DataHub supports the following partition modes:

Partition mode	Partition basis	Supported topic type	Description
USER_DEFINE	Value of the partition key column in the record. The column must have the same name as the partition field in MaxCompute.	TUPLE	(1) The DataHub schema must contain the MaxCompute partition field. (2) The value of this column must be a `UTF-8 string`. The value can be empty, which indicates that the data is not partitioned.
SYSTEM_TIME	Time when the record is written to DataHub.	TUPLE / BLOB	(1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.
EVENT_TIME	Value of the `event_time(TIMESTAMP)` column in the record.	TUPLE	(1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.
META_TIME	Value of the `__dh_meta_time__` attribute field of the record.	TUPLE / BLOB	(1) In the partition configuration, set the time format for the MaxCompute partition. (2) Set the time zone.

The SYSTEM_TIME, EVENT_TIME, and META_TIME modes partition data based on the timestamp and time zone configurations. The default unit for the timestamp is microsecond.

Partition configuration: Specifies the settings for partitioning data based on timestamps. The console uses a fixed MaxCompute partition format by default. The partition configuration is as follows:
Partition
Time Format
Description
ds
%Y%m%d
day
hh
%H
hour
mm
%M
minute
- The partition interval determines the time interval used to convert timestamps into MaxCompute partitions. The time range is 15 minutes to 1440 minutes (1 day), and the step interval is 15 minutes.
- Time zone (TimeZone): Specifies the time zone for partitioning data based on timestamps.
- Separator: When you synchronize BLOB data, you can specify a hexadecimal separator to split the data before it is synchronized to MaxCompute. For example, 0A represents a line feed (\n).
- Base64 encoding: By default, DataHub stores BLOB data as binary data. The corresponding column in MaxCompute is of the STRING type. Therefore, when you create a sync task in the console, the data is Base64-encoded by default before synchronization. For more customization options, you can use the software development kit (SDK).

View a sync task

You can click the details page of a connector to view the status and checkpoint information of the sync task. This includes sync checkpoints, sync status, and operations such as restart and stop. The following figure shows an example.

Synchronization examples

1. USER_DEFINE sync mode

Create a DataHub topic.
Note: The topic schema must contain the MaxCompute partition field. The field must be of the STRING type, as shown in the following figure:
Write data to the DataHub topic. You can use a DataHub SDK to write the data.
During the test, use the SDK to write several records. The values for [ds,hh,mm] are [20210304,01,15] and [20210304,02,15]. The data is as follows:

5-6

3. Create a sync task.

In USER_DEFINE partition mode, you can set the partition configuration fields during synchronization. If a corresponding table does not exist in MaxCompute, it is automatically created.

In this example, the f1 and f2 fields are imported. The f3 field is not synchronized.

4. Confirm the synchronized data.

You can view the synchronization information for the sync task in the DataHub console. Query the data in MaxCompute. The result is as follows: 5-9 In USER_DEFINE mode, DataHub synchronizes data to the corresponding partition based on the value of the MaxCompute grouping field.

2. SYSTEM_TIME sync mode

Create a DataHub topic.
Note: The partition is calculated based on the time when data is written to DataHub. Therefore, the topic schema only needs to contain data fields, not partition fields, as shown in the following figure:

Write data to the DataHub topic. You can use a DataHub SDK to write the data.
During the test, use the SDK to write several records. The time when the data is written to DataHub is 2021-03-04 14:02:45. The data is as follows:
Create a sync task.
- Make sure that the partition configuration is consistent with the MaxCompute table partitions.

4. Confirm the synchronized data.

You can view the synchronization information for the sync task in the DataHub console, such as DoneTime. Query the data in MaxCompute. The result is as follows: 5-14 In SYSTEM_TIME mode, DataHub synchronizes data to the corresponding partition based on the time when the data was written to DataHub.

FAQ

The time of the timestamp field synchronized to MaxCompute becomes 1970-01-19.
Cause: The default unit for timestamps synchronized from DataHub to MaxCompute is microsecond. The timestamp written by the user is in milliseconds. Solution: Write timestamps to DataHub in microseconds.

MaxCompute	DataHub
BIGINT	BIGINT
STRING	STRING
BOOLEAN	BOOLEAN
DOUBLE	DOUBLE
DATETIME	TIMESTAMP
DECIMAL	DECIMAL
TINYINT	TINIINT
SMALLINT	SMALLINT
INT	INTEGER
FLOAT	FLOAT
MAP	Not supported
ARRAY	Not supported

Partition	Time Format	Description
ds	%Y%m%d	day
hh	%H	hour
mm	%M	minute